0% found this document useful (0 votes)
6 views6 pages

UNIT3

This document covers advanced topics in operating systems, focusing on distributed shared memory (DSM), distributed scheduling, and failure recovery mechanisms. It discusses the architecture and algorithms for implementing DSM, including memory coherence and design issues, as well as the challenges of load distribution and task migration in distributed systems. Additionally, it addresses fault tolerance and recovery strategies, including error recovery techniques and checkpointing methods for maintaining consistency in distributed environments.

Uploaded by

diyadivya528
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

UNIT3

This document covers advanced topics in operating systems, focusing on distributed shared memory (DSM), distributed scheduling, and failure recovery mechanisms. It discusses the architecture and algorithms for implementing DSM, including memory coherence and design issues, as well as the challenges of load distribution and task migration in distributed systems. Additionally, it addresses fault tolerance and recovery strategies, including error recovery techniques and checkpointing methods for maintaining consistency in distributed environments.

Uploaded by

diyadivya528
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Advanced Operating Systems

UNIT- 3:
Distributed shared memory, Architecture, algorithms for implementing DSM, memory
coherence and protocols, design issues. Distributed Scheduling, introduction, issues in load
distributing, components of a load distributing algorithm, stability, load distributing
algorithm, performance comparison, selecting a suitable load sharing algorithm, requirements
for load distributing, task migration and associated issues. Failure Recovery and Fault
tolerance: introduction, basic concepts, classification of failures, backward and forward error
recovery, backward error recovery, recovery in concurrent systems, consistent set of check
points, synchronous and asynchronous check pointing and recovery, check pointing for
distributed database systems, recovery in replicated distributed databases.

Distributed Shared Memory (DSM)

Architecture:

 Distributed Shared Memory (DSM) is a concept where a distributed system


(composed of multiple machines) presents the illusion of having a single, unified
memory space to the applications running on it. Each node (machine) in the
distributed system has its own local memory, but DSM allows processes to access the
memory as if it were shared. The architecture of DSM typically consists of:
1. Multiple nodes: Each node has local memory and is connected to others via a
network.
2. Communication layer: The nodes communicate with each other to share
memory content, often utilizing message-passing protocols to update and
access shared memory locations.
3. Synchronization mechanisms: Mechanisms such as locks or semaphores may
be needed to coordinate memory access between processes running on
different nodes to ensure correct operation.

Algorithms for Implementing DSM:

 Implementing DSM requires coordination between different nodes to ensure that


updates to the shared memory are properly synchronized. The main approaches to
DSM implementation include:
o Lazy Update Protocol: In this approach, the updates to shared memory are
not immediately propagated to all nodes. Updates are only sent to the nodes
that request the data or when necessary. This protocol reduces the overhead of
communication but can lead to temporary inconsistency.
o Eager Update Protocol: Every time a process writes to the shared memory,
the change is immediately broadcast to all other processes. This ensures strong
consistency but can introduce high communication overhead.
o Write-Invalidate Protocol: In this approach, when a process writes to shared
memory, it invalidates other copies of the memory location in the system. This
Advanced Operating Systems
is typically used in systems with distributed caches to prevent stale data from
being accessed.
o Write-Propagation Protocol: This approach propagates changes to all other
copies in the system as soon as a process modifies a memory location.

Memory Coherence and Protocols:

 Memory Coherence refers to the requirement that all processes in a distributed


system observe the same order of updates to shared memory locations. In other words,
if process P1 writes to a memory location and process P2 reads that location, P2 must
see the update from P1, or they should be able to synchronize the access to this
memory.
o Sequential Consistency: This is the strongest memory consistency model
where the results of operations are guaranteed to be the same as if they were
executed in some sequential order. All processes see memory updates in the
same order.
o Release Consistency: This model allows for more relaxed memory
synchronization by only ensuring that synchronization happens at explicit
synchronization points (such as barriers or locks) instead of after every
memory write.
o Distributed Coherence Protocols:
 Directory-based protocols: These are used to maintain memory
coherence across a distributed system. A directory at each node tracks
which processes have copies of memory locations, and updates are sent
to all nodes that have a copy of the memory location.
 Broadcast protocols: These protocols propagate updates to all nodes
immediately. However, they are inefficient for large systems with
many nodes.
 Token-based protocols: A token is passed between nodes that hold
shared memory. This token is used to ensure that the memory is
updated in a consistent way across all nodes.

Design Issues in DSM:

 Scalability: DSM systems need to be scalable, meaning they can handle the
increasing number of nodes, processes, and memory accesses without significant
performance degradation. As the system grows, the number of nodes involved in
memory coherence and the overhead of synchronization increases.
 Consistency: Ensuring consistency is a critical design issue. Different consistency
models (sequential consistency, release consistency, etc.) need to be carefully chosen
based on the application requirements.
 Fault Tolerance: The DSM system must be able to continue functioning correctly if a
node fails. This requires mechanisms like replication and recovery strategies to handle
failures.
 Communication Overhead: Communication between nodes (e.g., sending memory
updates) can introduce latency. Efficient communication protocols are necessary to
minimize this overhead.
Advanced Operating Systems
Distributed Scheduling

Introduction:

 Distributed scheduling involves the allocation of resources or tasks across multiple


machines in a distributed system. Unlike centralized scheduling, which uses a single
point to manage tasks, distributed scheduling handles tasks in a decentralized manner
to achieve load balancing, resource optimization, and minimize delays. The main goal
is to improve the performance of the system by utilizing the resources efficiently.

Issues in Load Distributing:

 Load Imbalance: Load imbalance happens when some nodes in the system are
overloaded, while others are under-utilized. This leads to inefficiency and
underperformance.
 Communication Overhead: Load distributing algorithms often require processes to
communicate across the network to share load information. Excessive communication
overhead can hinder performance.
 Scalability: As the system scales (i.e., the number of nodes increases), the algorithms
should continue to function effectively. Some load balancing strategies may not scale
well and may become inefficient.
 Dynamic Changes: The load in the system can change over time (e.g., nodes can join
or leave the system, workloads may change). The load distribution mechanism must
dynamically adjust to these changes.

Components of a Load Distributing Algorithm:

1. Load Measurement: The algorithm must monitor and measure the load at each node.
This could include CPU usage, memory usage, or network bandwidth.
2. Task Assignment: Once the load is measured, tasks must be allocated to nodes. This
could involve direct assignment or task migration.
3. Task Migration: Tasks may need to be moved from overloaded nodes to
underutilized ones to balance the load. This process is known as task migration.
4. Feedback Mechanism: The system must periodically assess the load distribution and
adapt the assignments based on current system status.

Stability:

 A load distributing algorithm is considered stable if it consistently achieves an


optimal or near-optimal load balance without oscillating. In other words, the system
should not continually overcompensate or undercompensate the allocation of tasks
across nodes.

Load Distributing Algorithms:

1. Centralized Algorithms: These algorithms use a central controller that monitors the
load of all nodes and assigns tasks accordingly. While simpler, they introduce a single
point of failure and can become a bottleneck.
Advanced Operating Systems
2. Decentralized Algorithms: In decentralized algorithms, each node makes decisions
about load balancing based on local knowledge and periodic communication with
other nodes. These systems are more scalable and fault-tolerant than centralized ones.
3. Hybrid Algorithms: Hybrid algorithms combine elements of both centralized and
decentralized approaches. They may use a central controller for overall coordination
but delegate task distribution to individual nodes.

Performance Comparison:

 Efficiency: A good load distributing algorithm must minimize the time taken to
complete tasks and ensure that nodes are neither underloaded nor overloaded.
 Scalability: Algorithms that work well with a small number of nodes may not scale
effectively with more nodes. Performance must remain stable as the system grows.
 Fault Tolerance: The algorithm must handle node failures gracefully, ensuring that
tasks are reassigned in the case of node unavailability.

Selecting a Suitable Load Sharing Algorithm:

 Consider factors such as network topology, system size, load distribution dynamics,
and fault tolerance when selecting an algorithm. Algorithms with high overhead may
be unsuitable for large-scale systems, while simpler algorithms may not provide
optimal performance.

Requirements for Load Distribution:

 Task Granularity: The granularity of tasks (how large or small they are) affects how
easily they can be distributed or migrated across nodes.
 Communication Overhead: The algorithm should minimize the overhead for
communication between nodes to avoid delays in task distribution.
 Adaptability: The load distribution algorithm should adapt to changes in the system,
such as new tasks arriving or nodes failing.

Task Migration and Associated Issues:

 Migration Overhead: Moving tasks between nodes requires communication and


synchronization. Migration should only happen when it results in a net gain in
performance.
 Consistency: Task migration requires that the state of the task is preserved across
nodes to ensure correct execution.
 Synchronization: When tasks are moved between nodes, they may need to
synchronize with other tasks, particularly in concurrent environments.

Failure Recovery and Fault Tolerance

Introduction:
Advanced Operating Systems
 Distributed systems are inherently prone to failures, and recovery from failures is an
essential part of system design. Fault tolerance ensures that the system can continue
to function even in the presence of failures. Failure recovery mechanisms ensure that
the system can return to a consistent state after a failure.

Basic Concepts:

 Fault Tolerance: The ability of the system to continue operating despite failures.
 Failure Recovery: The ability of the system to restore itself to a consistent state after
a failure has occurred.

Classification of Failures:

1. Crash Failures: A process or node fails by simply stopping and losing its internal
state. These are the most common type of failures in distributed systems.
2. Omission Failures: A process fails to send or receive messages. This can happen if a
network link fails or if a process does not respond to requests.
3. Byzantine Failures: These occur when a process behaves arbitrarily, which could
include sending incorrect data, producing false outputs, or acting maliciously.
4. Network Partitioning: The system is divided into isolated groups of nodes, where
communication is impossible between the groups.

Backward and Forward Error Recovery:

 Backward Error Recovery: In this strategy, the system reverts to a previous,


consistent state (such as a checkpoint) and re-executes any operations that happened
after that point. This is typically used when a failure occurs and the system needs to
"undo" the operations to reach a known good state.
 Forward Error Recovery: This strategy allows the system to continue operation
despite the failure. For example, by using redundancy or error-correcting codes, the
system might be able to proceed with operations even after encountering an error.

Backward Error Recovery:

 Checkpointing: To implement backward recovery, the system periodically saves its


state, known as a checkpoint. If a failure occurs, the system can roll back to the most
recent checkpoint and continue from there.
 Logs: Another method is to keep logs of actions and operations. If a failure occurs,
the system can replay the log to restore itself to a consistent state.

Recovery in Concurrent Systems:

 Concurrency makes recovery more complex, as multiple processes may be


interacting with each other at the time of failure. A system must ensure that it can
restore the state of each process and maintain consistency across all processes.

Consistent Set of Checkpoints:


Advanced Operating Systems
 In distributed systems, checkpoints must be consistent across all processes. This
means that all processes that depend on each other must be at compatible states for
recovery to be meaningful.
 Coordinated Checkpointing: All processes synchronize their checkpoints to ensure a
global consistent state.
 Uncoordinated Checkpointing: Processes checkpoint independently, which may
lead to inconsistencies, requiring more sophisticated recovery mechanisms.

Synchronous and Asynchronous Checkpointing and Recovery:

 Synchronous Checkpointing: All processes coordinate their checkpoints at the same


time, ensuring a consistent global state. However, this approach incurs high overhead.
 Asynchronous Checkpointing: Processes can checkpoint independently, reducing
overhead but potentially leading to an inconsistent global state.

Checkpointing for Distributed Database Systems:

 Distributed databases use checkpointing to ensure consistency across replicas. Write-


ahead logging (WAL) ensures that changes are recorded before being applied,
allowing for recovery after a crash.

Recovery in Replicated Distributed Databases:

 In replicated databases, multiple copies of data are maintained to provide fault


tolerance. If one replica fails, others can take over. Replication mechanisms ensure
that the system remains consistent during recovery by using protocols like quorum-
based replication or two-phase commit (2PC) for transactions.

You might also like