UNIT3
UNIT3
UNIT- 3:
Distributed shared memory, Architecture, algorithms for implementing DSM, memory
coherence and protocols, design issues. Distributed Scheduling, introduction, issues in load
distributing, components of a load distributing algorithm, stability, load distributing
algorithm, performance comparison, selecting a suitable load sharing algorithm, requirements
for load distributing, task migration and associated issues. Failure Recovery and Fault
tolerance: introduction, basic concepts, classification of failures, backward and forward error
recovery, backward error recovery, recovery in concurrent systems, consistent set of check
points, synchronous and asynchronous check pointing and recovery, check pointing for
distributed database systems, recovery in replicated distributed databases.
Architecture:
Scalability: DSM systems need to be scalable, meaning they can handle the
increasing number of nodes, processes, and memory accesses without significant
performance degradation. As the system grows, the number of nodes involved in
memory coherence and the overhead of synchronization increases.
Consistency: Ensuring consistency is a critical design issue. Different consistency
models (sequential consistency, release consistency, etc.) need to be carefully chosen
based on the application requirements.
Fault Tolerance: The DSM system must be able to continue functioning correctly if a
node fails. This requires mechanisms like replication and recovery strategies to handle
failures.
Communication Overhead: Communication between nodes (e.g., sending memory
updates) can introduce latency. Efficient communication protocols are necessary to
minimize this overhead.
Advanced Operating Systems
Distributed Scheduling
Introduction:
Load Imbalance: Load imbalance happens when some nodes in the system are
overloaded, while others are under-utilized. This leads to inefficiency and
underperformance.
Communication Overhead: Load distributing algorithms often require processes to
communicate across the network to share load information. Excessive communication
overhead can hinder performance.
Scalability: As the system scales (i.e., the number of nodes increases), the algorithms
should continue to function effectively. Some load balancing strategies may not scale
well and may become inefficient.
Dynamic Changes: The load in the system can change over time (e.g., nodes can join
or leave the system, workloads may change). The load distribution mechanism must
dynamically adjust to these changes.
1. Load Measurement: The algorithm must monitor and measure the load at each node.
This could include CPU usage, memory usage, or network bandwidth.
2. Task Assignment: Once the load is measured, tasks must be allocated to nodes. This
could involve direct assignment or task migration.
3. Task Migration: Tasks may need to be moved from overloaded nodes to
underutilized ones to balance the load. This process is known as task migration.
4. Feedback Mechanism: The system must periodically assess the load distribution and
adapt the assignments based on current system status.
Stability:
1. Centralized Algorithms: These algorithms use a central controller that monitors the
load of all nodes and assigns tasks accordingly. While simpler, they introduce a single
point of failure and can become a bottleneck.
Advanced Operating Systems
2. Decentralized Algorithms: In decentralized algorithms, each node makes decisions
about load balancing based on local knowledge and periodic communication with
other nodes. These systems are more scalable and fault-tolerant than centralized ones.
3. Hybrid Algorithms: Hybrid algorithms combine elements of both centralized and
decentralized approaches. They may use a central controller for overall coordination
but delegate task distribution to individual nodes.
Performance Comparison:
Efficiency: A good load distributing algorithm must minimize the time taken to
complete tasks and ensure that nodes are neither underloaded nor overloaded.
Scalability: Algorithms that work well with a small number of nodes may not scale
effectively with more nodes. Performance must remain stable as the system grows.
Fault Tolerance: The algorithm must handle node failures gracefully, ensuring that
tasks are reassigned in the case of node unavailability.
Consider factors such as network topology, system size, load distribution dynamics,
and fault tolerance when selecting an algorithm. Algorithms with high overhead may
be unsuitable for large-scale systems, while simpler algorithms may not provide
optimal performance.
Task Granularity: The granularity of tasks (how large or small they are) affects how
easily they can be distributed or migrated across nodes.
Communication Overhead: The algorithm should minimize the overhead for
communication between nodes to avoid delays in task distribution.
Adaptability: The load distribution algorithm should adapt to changes in the system,
such as new tasks arriving or nodes failing.
Introduction:
Advanced Operating Systems
Distributed systems are inherently prone to failures, and recovery from failures is an
essential part of system design. Fault tolerance ensures that the system can continue
to function even in the presence of failures. Failure recovery mechanisms ensure that
the system can return to a consistent state after a failure.
Basic Concepts:
Fault Tolerance: The ability of the system to continue operating despite failures.
Failure Recovery: The ability of the system to restore itself to a consistent state after
a failure has occurred.
Classification of Failures:
1. Crash Failures: A process or node fails by simply stopping and losing its internal
state. These are the most common type of failures in distributed systems.
2. Omission Failures: A process fails to send or receive messages. This can happen if a
network link fails or if a process does not respond to requests.
3. Byzantine Failures: These occur when a process behaves arbitrarily, which could
include sending incorrect data, producing false outputs, or acting maliciously.
4. Network Partitioning: The system is divided into isolated groups of nodes, where
communication is impossible between the groups.