0% found this document useful (0 votes)
12 views53 pages

Distributed Systems QA

The document discusses distributed systems, their characteristics, and goals, as well as comparing Shared Address Space and Message-Passing Architectures. It explains architectural and fundamental models of distributed systems, mutual exclusion algorithms, and the implications of the absence of a global clock. Additionally, it covers logical clocks and vector clocks, including their application in implementing semaphores.

Uploaded by

Naman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views53 pages

Distributed Systems QA

The document discusses distributed systems, their characteristics, and goals, as well as comparing Shared Address Space and Message-Passing Architectures. It explains architectural and fundamental models of distributed systems, mutual exclusion algorithms, and the implications of the absence of a global clock. Additionally, it covers logical clocks and vector clocks, including their application in implementing semaphores.

Uploaded by

Naman Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

KAMLA NEHRU INSTITUTE

OF TECHNOLOGY,
SULTANPUR (UP)

ASSIGNMENT – 1, 2, 3, 4, 5
MASTERS IN TECHNOLOGY
1st YEAR 2nd SEMESTER
DISTRIBUTED COMPUTING (RCSE-201)

Submitted To:- Submitted By:-


Prof. Abhay Kumar Agrawal Shubham
CSE Department Roll No. 2410208
KNIT Sultanpur
1. What is a distributed system? Give the important goals to build the
distributed system.
A distributed system is a collection of independent computers that appear to the
users as a single coherent system. These computers coordinate with each other to
achieve a common objective, typically through communication over a network. Each
node in the system performs part of the overall task and communicates results to
others. Unlike centralized systems, distributed systems don’t have a single point of
failure, and tasks can continue even if one or more nodes fail.

Key Characteristics of Distributed Systems:

Concurrency: Multiple processes run simultaneously on different machines.

Scalability: Ability to handle growth in workload or addition of nodes.

Transparency: Users perceive the system as a single unit despite its distribution.

Fault tolerance: The system can recover from partial failures.

Goals of Distributed Systems:

Transparency: Users perceive the system as a single unit despite its distribution.

Openness: Distributed systems should be open to integration and expansion using


standardized protocols and interfaces.

Scalability: A system should handle growth efficiently, whether it involves more


users, resources, or geographic spread.

Fault Tolerance and Reliability: Even if parts of the system fail, the system as a
whole should continue functioning using replication, checkpointing, and recovery
techniques.

Performance: Optimized resource sharing and task distribution should ensure high
performance, even under load.

Security: Since the system is accessible over networks, it must ensure data
integrity, confidentiality, and secure access.

Resource Sharing: Enables sharing of hardware, software, and data resources


across machines.

Concurrency Control: As multiple users/processes access resources


simultaneously, mechanisms must ensure consistency and correctness.
2. Compare the Shared Address Space and Message-Passing Architectures.
Shared Address Space Architecture (SAS):

In the Shared Address Space model, multiple processes or threads share a


common memory space. They communicate by reading and writing to shared
variables in memory.

Characteristics:

 Shared variables act as communication channels.

 Synchronization is essential to avoid conflicts (e.g., using locks or


semaphores).

 Primarily used in tightly coupled systems like multi-core or multi-threaded


environments.

Advantages:

 Faster Communication: No need for complex message encoding/decoding.

 Low Overhead: Avoids the cost of message passing (no marshalling).

 Ease of Programming: Natural for algorithms where threads cooperate


closely on shared data.

Disadvantages:

 Synchronization Complexity: Needs careful coordination using mutexes,


semaphores, etc.

 Scalability Issues: As the number of threads grows, contention for shared


memory increases.

 Limited to Local Systems: Difficult to use efficiently across physically


separate machines.

Message-Passing Architecture (MPA):

In the Message-Passing model, processes have private memory and interact solely
by sending and receiving messages. This model is common in loosely coupled
systems, including distributed environments over a network.

Characteristics:

 No shared memory; each process has its own memory space.


 Communication via explicit message passing (e.g., sockets, RPC, MPI).

 Supports both synchronous (blocking) and asynchronous (non-blocking)


messaging.

Advantages:

 Highly Scalable: Works efficiently in distributed systems with thousands of


nodes.

 Isolation: Processes are independent, reducing shared state conflicts.

 Suitable for Heterogeneous Systems: Different machines can interact using


standardized protocols.

Disadvantages:

 Overhead: Requires serialization and communication over the network.

 Complexity: Handling message formats, communication failures, retries, etc.

 Latency: Network delays can affect performance, especially in synchronous


messaging.

3. Explain the Architectural and Fundamental models of the Distributed System.


1. Architectural Models

An Architectural Model defines the structure of a distributed system in terms of its


components (like processes, objects, or services) and their relationships. It focuses
on how these entities are arranged and how they communicate.

Types of Architectural Models:

a) Client-Server Model

 Description: One or more clients request services from a central server.

 Example: A web browser (client) requests pages from a web server.

 Pros: Simplicity, central control.

 Cons: Single point of failure, scalability bottleneck.

b) Peer-to-Peer Model (P2P)

 Description: All nodes act both as clients and servers. Each peer can request
and provide services.
 Example: BitTorrent, blockchain.

 Pros: High scalability, fault tolerance.

 Cons: Complex management, security concerns.

c) Three-Tier Architecture

 Description: Divides the system into three layers – presentation (UI), logic
(business logic), and data (database).

 Example: Web applications (HTML frontend, application server, database


backend).

 Pros: Separation of concerns, easier to manage and scale.

 Cons: Increased latency due to more components.

d) Service-Oriented Architecture (SOA)

 Description: Applications are composed of loosely coupled services that


communicate using standard protocols (like SOAP or REST).

 Example: Enterprise systems integrating HR, billing, and logistics services.

 Pros: Reusability, interoperability.

 Cons: Overhead from service orchestration.

2. Fundamental Models

A Fundamental Model focuses on understanding the properties and limitations of


distributed systems. It provides insight into behavior like timing, failure, and
security, helping guide design choices.

Types of Fundamental Models:

a) Interaction Model

 Describes: How components interact, considering communication delays


and timing assumptions.

 Types:

o Synchronous: Known message delay and execution time.

o Asynchronous: No bounds on delays – real-world internet model.


o Partially Synchronous: Hybrid of the above two.

 Importance: Helps in choosing the right algorithms, especially for


coordination and synchronization.

b) Failure Model

 Describes: The types of failures that can occur in the system.

 Types of Failures:

o Crash: A process stops unexpectedly.

o Omission: Message lost or not sent/received.

o Timing: Response not delivered on time.

o Byzantine: Arbitrary or malicious failures.

 Use: Helps in designing fault-tolerant protocols and systems.

c) Security Model

 Describes: Threats and mechanisms to ensure confidentiality, integrity, and


availability.

 Common Threats:

o Eavesdropping (unauthorized access to messages).

o Masquerading (impersonating a legitimate user).

o Denial of Service (DoS) attacks.

 Mechanisms: Authentication, encryption, access control.

4. Explain the Shared Address Space and Message-Passing Architecture with


their requirements and working methodology.
1. Shared Address Space Architecture

In the Shared Address Space (SAS) model, multiple threads or processes operate
within a single address space. This setup is common in systems using shared
memory multiprocessors or in multi-threaded applications on the same machine.

Requirements:
 Shared memory access: A physical or logical memory space must be
accessible to all participating threads or processes.

 Synchronization mechanisms: Such as semaphores, mutexes, condition


variables, or monitors to ensure safe concurrent access.

 Thread-safe operations: Developers must handle data races, critical


sections, and potential deadlocks.

 Operating System Support: Support for thread/process scheduling and


memory protection.

Working Methodology:

 Each thread/process reads or writes shared data in the common memory.

 Synchronization primitives are used to prevent conflicts (e.g., one thread


locking a resource while another waits).

 Data exchange is implicit—threads access shared variables directly.

Example Use Case:

A multi-threaded banking system where multiple threads update the account


balance stored in a shared variable.

2. Message-Passing Architecture

In the Message-Passing Architecture (MPA), each process has its own private
memory, and all communication occurs by explicit message exchange. This is the
most common communication method in distributed systems.

Requirements:

 Communication channels: Established via sockets, TCP/IP, MPI, RPC, or


RMI.

 Message format and protocol: Define how data is serialized and


interpreted.

 Error handling and acknowledgment mechanisms: To ensure delivery


and deal with failures.

 Concurrency support: Processes may need to handle multiple incoming


messages at once.
Working Methodology:

 A process sends a message containing data to another process.

 The recipient receives and processes the message.

 Communication may be:

o Synchronous (blocking): The sender waits for the message to be


received.

o Asynchronous (non-blocking): The sender continues execution


without waiting.

Example Use Case:

Microservices architecture where independent services interact via RESTful APIs or


gRPC messages.

5. What is distributed mutual exclusion? Classify the Distributed mutual


exclusion.
Distributed Mutual Exclusion (DME) is a mechanism for ensuring mutual
exclusion in a distributed environment, where the participating processes run on
separate nodes and communicate via message-passing.

The goal is to ensure:

 Safety: At most one process is in the critical section at a time.

 Liveness: Every request to enter the CS will eventually be granted.

 Fairness: Requests should be honored in the order they were made (first-
come, first-served, if applicable).

Challenges in Distributed Mutual Exclusion:

 No shared memory or centralized controller.

 No global clock.

 Message delays or failures.

 Fault tolerance.

Classification of Distributed Mutual Exclusion Algorithms:


Distributed mutual exclusion algorithms are typically classified into three
categories:

1. Centralized Algorithms:

Concept: A single coordinator process manages access to the critical section.

Working:

 A process sends a request to the coordinator.

 The coordinator grants access if no one else is using the CS.

 After completion, the process informs the coordinator, who then grants
access to the next requester.

Example: Ricart and Agrawala (centralized version)

2. Distributed Algorithms (Decentralized Voting/Quorum-Based):

Concept: No central coordinator. Each process communicates with a group of other


processes to get permission.

Working:

 A process sends requests to all other processes (or a quorum subset).

 It enters the CS only after receiving permission from the majority (or all).

 After exiting, it sends release messages.

Example: Ricart and Agrawala's distributed algorithm, Maekawa’s algorithm.

3. Token-Based Algorithms:

Concept: A unique token circulates among processes. Possession of the token grants
access to the CS.

Working:

 When a process wants to enter the CS, it waits for the token.

 If it doesn’t have the token, it sends a request to the current token holder.

 Once it completes the CS, it passes the token to the next requesting process.

Examples: Suzuki-Kasami’s algorithm.


6. Explain the mutual exclusion algorithm by taking suitable example.
Mutual exclusion in a distributed system ensures that only one process enters the
critical section (CS) at a time when accessing shared resources. Unlike centralized
systems, distributed systems lack shared memory and global clocks, and rely solely
on message passing, making mutual exclusion more complex.

One of the most well-known mutual exclusion algorithms in distributed


systems is the Ricart-Agrawala Algorithm.

This is a distributed algorithm for achieving mutual exclusion without a central


coordinator and without a token. It is based on message passing and logical
timestamps (Lamport clocks).

Assumptions:

 Each process has a unique ID.

 Processes communicate over reliable channels.

 The communication graph is fully connected (i.e., any process can send a
message to any other).

 Lamport logical clocks are used to timestamp events.

Algorithm Steps:

1. Requesting Critical Section (CS):

o A process P that wants to enter CS sends a REQUEST(timestamp,


P_ID) message to all other processes.

o It then waits for a REPLY message from all other processes.

2. Receiving a Request:

o When a process Q receives a REQUEST from P, it checks:

 If Q is not interested in the CS or has a later timestamp, it


sends a REPLY immediately.

 If Q is already in CS, or has requested access with an earlier


timestamp, it defers the reply (queues the request).

3. Entering CS:

o Once P has received REPLY from all other processes, it enters the CS.
4. Exiting CS:

o On exiting the CS, P sends deferred REPLYs to the processes whose


requests were queued.

Example:

Let’s say we have three processes: P1, P2, and P3.

1. P1 wants to enter CS:

o P1’s logical clock is 10.

o It sends REQUEST(10, P1) to P2 and P3.

2. P2 and P3 receive the request:

o If neither is requesting CS, they reply immediately.

o Suppose P2 is also requesting CS with a timestamp of 12.

o Since 10 < 12, P2 replies to P1 and queues its own request.

3. P1 gets REPLY from both and enters CS.

4. When P1 exits CS:

o It sends a REPLY to P2 (who was deferred).

o Now P2 can enter CS after receiving all replies.

Message Complexity:

 2(N - 1) messages per CS entry:

o (N - 1) REQUESTs + (N - 1) REPLYs.

Advantages:

 Fully distributed — no single point of failure.

 No token required.

 Fairness ensured via timestamps (Lamport clocks).

 Works well when requests are frequent.


7. What do you mean by absence of global clock? Give the impact of the
absence of global time.
Meaning of Absence of Global Clock:

A global clock would allow all processes in the distributed system to have a
consistent view of time. However, in reality:

 Clocks are maintained independently.

 Clock synchronization is not instantaneous or perfect.

 Message delays and system latency make it impossible to assume any fixed
or consistent time ordering across distributed nodes.

As a result, event ordering, which is straightforward in centralized systems,


becomes ambiguous in distributed systems.

Impacts of Absence of Global Time:

1. Event Ordering Becomes Uncertain:

o In a centralized system, you can say that Event A happened before


Event B by comparing timestamps.

o In a distributed system, without a global clock, it’s difficult to say


definitively which event occurred first across different machines.

2. Causal Relationships Are Hard to Track:

o Determining causal dependencies between events (i.e., whether


Event A caused Event B) becomes complex.

o This affects debugging, logging, and tracing of system behavior.

3. Difficulties in Coordination and Synchronization:

o Distributed algorithms often rely on the order of operations.

o Without a global time, coordination mechanisms must rely on logical


clocks, not actual time.

4. Challenging Consistency Models:

o In distributed databases or systems like Google Spanner, consistency


is harder to maintain without synchronized clocks.
o Operations like replication, snapshots, and backups can suffer from
inconsistencies if time is not globally coordinated.

5. Problems in Resource Allocation:

o Algorithms that allocate resources or schedule tasks may face race


conditions or inconsistencies due to the lack of synchronized time.

6. Consensus and Agreement Issues:

o Protocols like Paxos or Raft need to assume asynchronous timing


models and cannot rely on synchronized clocks.

o Reaching consensus becomes difficult because nodes cannot


determine when others should have responded.

8. What is Logical clock and Vector clock? How logical clock can be used to
implement semaphores?
Lamport Logical Clock, proposed by Leslie Lamport in 1978, is a simple
mechanism to establish a partial ordering of events in distributed systems.

Vector clocks extend Lamport clocks to capture causal relationships more


precisely and to detect concurrent events.

Using Logical Clocks to Implement Semaphores:

In centralized systems, semaphores (e.g., binary or counting semaphores) control


access to critical sections. In distributed systems, logical clocks help emulate similar
control.

Approach:

1. Request Queue: Each process sends a request to enter the critical section
with its logical timestamp.

2. Ordering Requests: Requests are ordered based on logical clock values. The
lowest timestamp wins.

3. Accessing CS:

o A process enters the critical section only when:

 Its request is at the front of the queue.


 It has received acknowledgments (or REPLYs) from all other
processes.

4. Releasing Semaphore:

o After leaving the critical section, the process sends release messages,
and other processes update their request queues.

This method mimics the behavior of a semaphore in a distributed setup using


logical ordering of requests.

9. What is a Lamport clock? What are its limitations? What is the use of Virtual
time in Lamport's logical clock?
Lamport Clock is a mechanism used in distributed systems to provide a partial
ordering of events. It was introduced by Leslie Lamport in 1978 to handle the issue
of event ordering in systems without a global clock. Unlike physical clocks that
require synchronization across nodes, Lamport’s logical clock provides a way to
order events based on causal relationships.

Limitations of Lamport Clocks:

While Lamport clocks are an effective tool for ordering events, they do have several
limitations:

1. Cannot Detect Concurrent Events:

o Lamport clocks can only guarantee that events in a causal


relationship will have an ordered timestamp. However, they do not
provide information about whether two events are concurrent or
independent.

o In Lamport’s system, two independent events happening at different


processes may have the same or overlapping clock values, making it
impossible to determine if they occurred concurrently.

2. Limited Causality Information:

o Lamport clocks provide only partial ordering, meaning they only


guarantee that if A → B (event A happens before event B), then
clock(A) < clock(B).

o They cannot provide any additional information about the


relationship between events that are not causally related. For
example, if A and B are independent, the Lamport clocks will not
distinguish between their ordering.

3. Non-Deterministic Ordering for Concurrent Events:

o If two events are concurrent (they do not have a direct causal


relationship), Lamport clocks can assign the same or different
timestamps, but there’s no guarantee on which event will have a
higher timestamp. This creates ambiguity when handling
concurrency, as Lamport clocks do not give a mechanism to
distinguish between events that happened simultaneously.

Virtual Time in Lamport's Logical Clock:

Virtual time refers to the system of timestamps created by Lamport clocks that
order events within a distributed system. It is called "virtual" because it does not
rely on physical time or any actual clock.

1. Virtual Time as a Logical Clock:

o The virtual time (logical clock) maintains the sequence of events


across distributed processes without needing synchronized physical
clocks. Each process maintains its own local logical time that is
updated according to Lamport’s rules.

2. Usage of Virtual Time:

o Causal Ordering of Events: Virtual time ensures that all causally


related events across processes are ordered properly. If one process
sends a message to another, virtual time guarantees that the receiver
will see the message after the sender has sent it, based on the
timestamps.

o Ensuring Consistency in Distributed Systems: Virtual time helps in


ensuring consistent behavior in distributed systems by maintaining
an ordering that respects causal relationships, which is important for
tasks like distributed logging, debugging, and system consistency.

10. What do you mean by causal ordering of message? Give an example of it.
Causal Ordering:

In a distributed system, messages are sent between processes, and these messages
might carry information about events that have occurred in one process. Causal
ordering ensures that if an event A causes event B (i.e., A → B), then A must happen
before B, and the message conveying event B must be delivered after the message
conveying event A. In other words, causality should be maintained in the order of
message delivery.

Causal ordering does not require a global clock, but it uses a logical clock (such as
Lamport's clock or vector clocks) to track the order of events within each process
and their causal relationships.

Causal Ordering in Distributed Systems:

1. Happens-Before Relation: The happens-before relation is central to


causal ordering. If an event A occurs before event B in the same process or
across processes, it must hold that A → B, meaning that B cannot be observed
until A has occurred. This relationship is crucial to avoid situations where
one process perceives an event happening before another, even when
causally it cannot happen that way.

2. Logical Clocks and Causal Ordering: The Lamport clock (or vector clock) is
often used to implement causal ordering. When a process sends a message, it
tags the message with its current logical clock value. When the receiving
process receives the message, it checks whether the clock of the sender is
less than its own clock and updates accordingly, ensuring that the messages
are received in the correct order.

3. Causal Delivery: Causal ordering ensures that messages are delivered in an


order where causality is respected, regardless of network delays or
concurrent events in different processes. This ordering helps in distributed
systems like messaging queues, databases, and multi-threaded applications
where the sequence of events is important for correctness.

Example of Causal Ordering:

Consider a distributed system with two processes, P1 and P2, and two events, E1
and E2, where E1 occurs in P1 and causes E2 in P2.

 P1 sends a message to P2 after event E1 occurs.

 P2 then performs event E2 in response to receiving the message from P1.

In this case, E1 causally precedes E2 (i.e., E1 → E2).

Let’s say the events are assigned logical clocks using Lamport’s clock.

1. At P1:
o E1 occurs, and P1 increments its clock, say it becomes clock(P1) = 1.

o P1 sends a message to P2 with the timestamp 1.

2. At P2:

o P2 receives the message, and its own clock is initially clock(P2) = 0.

o P2 updates its clock to clock(P2) = max(clock(P2), 1) + 1 = 2.

o Event E2 occurs at P2, and the system ensures that E2 happens after
E1 due to the happens-before relation.

11. If process p sends two messages m1 and m2 to process q, what problem


may arise if received out of order?
Problems Arising from Out-of-Order Message Delivery:

1. Violation of Causal Consistency: One of the most significant problems that


arise from out-of-order message delivery is the violation of causal
consistency. In a distributed system, certain events may have causal
dependencies. If a message that depends on another is processed first, it may
lead to inconsistent or erroneous behavior in the system.

2. Inconsistent State: Out-of-order message delivery can cause processes in a


distributed system to reach inconsistent states. If the messages are intended
to execute in a specific order, such as one message updating a shared
resource and the next one reading the updated resource, receiving them out
of order can cause incorrect processing.

3. Race Conditions and Deadlocks: Out-of-order message delivery can


introduce race conditions in distributed systems. A race condition occurs
when multiple processes or threads compete to access a shared resource or
data in an unsynchronized manner, leading to unpredictable outcomes.

4. Failure in Event Processing: In distributed systems, events may trigger


specific actions or processes. If messages are received out of order, it could
cause failures in event processing. For example, if p sends a message to
initiate a transaction, and another message for logging or rollback is received
first, it can result in an incomplete or failed transaction.

5. Inaccurate or Incomplete Communication: In some systems, such as


communication protocols or messaging queues, receiving messages out of
order can lead to communication breakdowns or incomplete information
being processed. For instance, in a messaging system where messages are
meant to be processed in sequence, processing them in the wrong order may
lead to incorrect responses or lost data. This is especially important in
protocols where acknowledgments or handshakes are exchanged between
processes in a specific sequence.

6. Incorrect Event Ordering and System Behavior: Certain distributed


systems are sensitive to the exact order in which events occur. For example,
in multi-party computations or consensus protocols, the order of
messages often dictates the flow of the algorithm. If messages are delivered
out of order, it can break the correctness of algorithms like Paxos or Raft.

12. Develop an algorithm which guarantees causal ordering of messages.


In distributed systems, ensuring causal ordering of messages is important to
maintain the consistency and correctness of the system. Causal ordering ensures
that messages are delivered in a manner that respects the causal relationships
between events. If an event AA causally influences another event BB, then message
AA should be delivered before message BB.

To achieve causal ordering, we can use Vector Clocks. A Vector Clock is a


mechanism that captures the causal relationships between events in a distributed
system and ensures that messages are processed in the correct order.

Causal Ordering Using Vector Clocks

Vector Clocks work by associating a vector of timestamps with each event, where
each process in the system maintains a logical clock. When a process sends a
message, it attaches its vector clock to the message. The receiving process uses this
vector clock to update its own clock, ensuring the message ordering is maintained
according to the causal relationship.

Vector Clock Algorithm for Causal Ordering

1. Initialization: Each process PiP_i in the distributed system maintains a


vector clock ViV_i. The vector clock is a vector of size nn, where nn is the total
number of processes in the system. Initially, each process has its vector clock
set to:

2. Event Generation: When a process PiP_i performs an event (e.g.,


computation, state change), it increments its local clock in the vector ViV_i at
position ii. This means that the clock for the process PiP_i is incremented
while other entries in the vector clock remain unchanged:
3. Message Sending: When a process PiP_i sends a message to another process
PjP_j, it sends the current vector clock ViV_i along with the message. This
allows the receiving process to know the state of PiP_i's clock when the
message was sent.

4. Message Receiving and Vector Clock Update: Upon receiving a message,


the receiving process PjP_j compares its own vector clock VjV_j with the
vector clock ViV_i received with the message. The receiving process updates
its vector clock as follows:

5. Message Delivery Order: To guarantee causal ordering of messages, each


process PiP_i keeps a queue of incoming messages. When a process PiP_i
receives a message, it checks whether it can process the message in causal
order. The message can be processed only if all preceding messages (as
determined by their vector clocks) have been received. If not, the message is
queued until all preceding messages arrive.

6. Causal Ordering of Messages: When processing multiple messages, the


system must ensure that the messages are delivered respecting the causal
dependencies. For instance, if message m1m_1 causally precedes message
m2m_2, the system ensures that m1m_1 is processed before m2m_2,
regardless of the physical order in which the messages are received. This is
achieved by comparing the vector clocks of the messages.

13. What are Deadlock and Starvation? Explain the fundamental causes and
detection methods.
Deadlock

A deadlock occurs in a system when a set of processes become blocked because


each process is waiting for another process to release resources or perform actions
that would allow them to continue. In other words, a set of processes is in a circular
wait, where no process can proceed because they are all waiting for each other to
release resources.

For example:

 Process A holds resource R1 and is waiting for resource R2.

 Process B holds resource R2 and is waiting for resource R1. In this scenario,
both processes are in a deadlock situation because neither can proceed until
the other releases the resources, but neither can release their resources
because they are waiting for the other.
Conditions for Deadlock

Deadlocks typically occur if all four of the following conditions are met:

1. Mutual Exclusion: Resources involved in the deadlock can only be held by


one process at a time.

2. Hold and Wait: A process holding at least one resource is waiting for
additional resources that are being held by other processes.

3. No Preemption: Resources cannot be forcibly taken from the processes


holding them; they must be released voluntarily.

4. Circular Wait: A set of processes exists such that each process is waiting for
a resource held by the next process in the set.

Starvation

Starvation is a situation where a process is perpetually delayed in getting access to


resources because other processes continuously preempt the resources. While
deadlock involves all processes in a circular waiting pattern, starvation occurs when
a process is indefinitely delayed because other processes are prioritized or consume
resources continuously.

For example, a process may have a low priority and never gets a chance to execute if
higher-priority processes keep consuming CPU time. This may lead to starvation of
the lower-priority process.

Causes of Deadlock and Starvation

 Deadlock Causes:

o Resource contention: When multiple processes compete for limited


resources, a deadlock situation can arise if resources are not allocated
in a way that avoids circular waits.

o Incorrect process synchronization: Improper synchronization of


processes and resource allocation can result in situations where
processes wait indefinitely.

o Concurrency issues: Poor handling of simultaneous access to shared


resources by multiple processes can lead to deadlock conditions.

 Starvation Causes:
o Priority Inversion: When high-priority processes continuously
preempt low-priority processes, the low-priority processes may never
get a chance to execute.

o Resource hogging: If a process consumes all available resources,


others may starve.

o Improper resource scheduling: Poor scheduling algorithms may


give an unfair advantage to some processes, causing others to wait
indefinitely.

Deadlock Detection Methods

Deadlock detection aims to identify if a deadlock has occurred and resolve it. Some
of the primary methods for detecting deadlocks include:

1. Wait-for Graphs:

o A wait-for graph is a directed graph used to detect deadlock. Each


process is represented as a node, and an edge is drawn from process
PiP_i to PjP_j if PiP_i is waiting for a resource held by PjP_j.

o A deadlock is detected when there is a cycle in the wait-for graph. This


cycle indicates that the processes in the cycle are waiting for each
other, forming a deadlock.

2. Resource Allocation Graph (RAG):

o A resource allocation graph is another graph-based method where


nodes represent processes and resources. Edges are drawn between
processes and resources, and between resources and processes,
indicating the allocation of resources to processes and the request for
resources.

o A cycle in this graph indicates a deadlock situation. If no cycle is


present, the system is free from deadlock.

3. Timeout Mechanism:

o This involves setting a timer for each process to wait for a resource. If
a process waits for more than a predefined threshold without
receiving the resource, it is assumed that a deadlock has occurred.

4. Banker's Algorithm (for Deadlock Detection):


o The Banker's algorithm is used to check for deadlock in a system
where multiple resources are involved. The algorithm checks whether
resource allocation leads to a safe state or if a process must wait for
resource allocation. If processes cannot finish due to the unavailability
of resources, the system enters a deadlock state.

Starvation Detection and Prevention

Starvation can be detected by monitoring the progress of processes. If a process is


continuously waiting while others are executing, it can be flagged as starved. To
prevent starvation, systems may:

1. Implement Aging: Gradually increase the priority of a process that has been
waiting for a long time, ensuring it eventually gets resources.

2. Fair Scheduling: Use scheduling algorithms such as Round-Robin that


allocate resources fairly to processes.

3. Priority Inversion Handling: Use priority inheritance to avoid situations


where lower-priority processes block higher-priority ones.

14. Explain the concept of Processes and Threads with state transition diagram.
A process is an instance of a program in execution. It includes the program's code,
its current activity, a stack, a heap, and all the resources needed to perform the
computation. A process is a more heavyweight entity compared to a thread because
it has its own memory space and system resources.

 Components of a Process:

1. Program Code (text section): The instructions to be executed.

2. Data Section: The global variables.

3. Heap: Memory dynamically allocated during execution.

4. Stack: Stores the execution history, including function calls and local
variables.

5. Registers: Stores the program's counter and other states of the CPU
during execution.

A process runs independently, and the operating system is responsible for


managing the resources allocated to each process.
A thread, on the other hand, is the smallest unit of execution within a process.
Threads within the same process share the same memory space and resources but
run independently. A process can contain multiple threads, all of which have their
own stack but share the same heap and code section.

 Thread Components:

1. Program Counter (PC): Points to the next instruction to execute.

2. Stack: Stores local variables and function calls for that specific thread.

3. Registers: Used by the CPU for executing instructions related to the


thread.

Threads are sometimes referred to as "lightweight processes" because they share


resources like memory and I/O with other threads in the same process, which
allows for more efficient context switching and resource utilization.

State Transition Diagram

Both processes and threads can have different states throughout their lifecycle. The
state transition diagram helps us visualize how processes and threads move
through various states.

Process State Diagram

The process state diagram consists of the following states:

1. New: The process is being created.

2. Ready: The process is ready to execute but is waiting for CPU time.

3. Running: The process is being executed by the CPU.

4. Blocked/Waiting: The process cannot proceed until some external event


occurs (like waiting for I/O or a resource).

5. Terminated: The process has completed its execution and is being removed
from the system.

The transition between these states occurs based on events such as process
creation, scheduling, waiting for resources, or completion.
Thread State Diagram

Threads have similar states, but they can also transition between more specific
states as they interact with the process. The thread states are:

1. New: The thread is created but has not yet started execution.

2. Runnable: The thread is ready to run, but the scheduler is yet to allocate
CPU time to it.

3. Running: The thread is currently executing.

4. Blocked/Waiting: The thread is waiting for a resource or event, such as I/O


completion.

5. Terminated: The thread has finished its execution.

In addition to these basic states, threads in some systems also have an "on hold"
state when temporarily suspended by the system for various reasons, like context
switching.

Thread State Transition Diagram:

+--------+ +-----------+ +-----------+


| |-----> | |-----> | |
| New | | Ready | | Running|
| |<----- | |<----- | |
+--------+ +----------+ +-----------+
| |
v |
+-----------+
| |
| Blocked |
| |
+------------+
|
v
+----------------+
| Terminated |
+----------------+
15. What are agreement protocols? Discuss the general system model and
classification.
Agreement protocols are mechanisms used in distributed systems to ensure that a
set of processes or nodes reach a consensus on a particular decision or state. These
protocols are crucial in environments where no single process has control over the
entire system, and therefore, agreements need to be reached in a decentralized
manner. In distributed computing, agreement protocols are particularly important
for maintaining consistency, synchronization, and coordination among distributed
components that may fail or behave unpredictably.

Key Characteristics of Agreement Protocols

 Consistency: Agreement protocols ensure that all processes in the system


agree on a single value or decision.

 Fault Tolerance: The protocol must tolerate faults such as message delays,
crashes, or network partitions, ensuring that the system continues to
function correctly.

 Fairness: Every participating process must have a fair opportunity to


contribute to the agreement and obtain the final decision.

In distributed systems, agreement protocols are essential for resolving conflicts,


particularly when processes do not have a global view of the system, and their
actions may affect others.

General System Model

The general system model for agreement protocols involves the following elements:

1. Processes (Participants): These are the entities that need to reach


agreement. Each process has local knowledge and may be subject to failure.

2. Messages: Communication is carried out through messages passed between


processes. These messages may be subject to delays, loss, or corruption,
making agreement more complex.

3. Decision: The objective of an agreement protocol is to ensure that all


processes agree on the same value, decision, or state.

4. Failures: Processes may fail in various ways—by crashing, losing messages,


or behaving maliciously. Agreement protocols must ensure that the system
can still make progress even in the face of such failures.
The system can be modeled as a set of processes connected by communication
channels. Each process maintains a state and communicates asynchronously with
others to exchange messages. However, despite the lack of a global clock and the
potential for asynchronous failures, these processes must reach a common decision.

Classification of Agreement Protocols

Agreement protocols can be classified based on the types of problems they are
solving and the assumptions made about the system’s behavior. Some of the
common classifications are as follows:

1. Consensus Protocols

Consensus protocols are used to ensure that a group of processes can agree on a
single value despite failures and network issues. The most famous example of a
consensus problem is the Byzantine Generals Problem, where processes (or
generals) must agree on a plan of action (attack or retreat) despite some generals
potentially being traitors.

 Key Problem: Given that some processes may fail or behave maliciously, the
challenge is to reach consensus on a value in a fault-tolerant manner.

 Example: Paxos and Raft are widely used consensus protocols in distributed
systems. These protocols are designed to ensure that even if some processes
fail, the remaining processes can still come to a consensus.

2. Agreement on Binary Decisions (Binary Consensus)

In binary consensus protocols, the processes need to agree on one of two values
(often 0 or 1). This problem becomes complex when processes are asynchronous
and may fail.

 Key Problem: The goal is for every process to either choose a value 0 or 1,
and for all non-failing processes to choose the same value, despite failures
and the possibility of asynchronous communication.

 Example: Fischer-Lynch-Paterson (FLP) Impossibility Theorem proves


that in asynchronous distributed systems with even one faulty process,
consensus is impossible under certain conditions.

3. Agreement on Multi-Value Decisions (Multi-Value Consensus)

Unlike binary consensus, multi-value consensus involves processes agreeing on a


set of potential values. This type of protocol is used when the decision space is
larger than two values, which adds complexity to the problem, especially with
multiple failures and asynchrony.

 Key Problem: Processes must agree on one of a large number of potential


values, and all non-failing processes must converge on the same value.

 Example: In Multi-Paxos or Raft protocols, processes can agree on a


sequence of commands or logs, rather than just one value, allowing more
complex distributed systems to operate effectively.

4. Byzantine Agreement Protocols

A Byzantine Agreement Protocol (also known as Byzantine Fault Tolerance) is a


stronger form of consensus protocol designed to handle situations where some
processes might be faulty or malicious and actively try to disrupt the system. This is
essential in distributed systems where not only crashes but also arbitrary faults
must be tolerated.

 Key Problem: The goal is to ensure that all non-faulty processes agree on a
decision, even if some processes are malicious or behave arbitrarily.

 Example: Byzantine Fault Tolerant protocols like PBFT (Practical


Byzantine Fault Tolerance) and Tendermint ensure that processes can
still reach a consensus even if up to a third of the processes are
compromised.

5. Majority-based Agreement Protocols

In some cases, agreement protocols can rely on a majority voting mechanism. In


these protocols, processes communicate and vote on a value, and the value that gets
the majority of votes becomes the agreed-upon decision.

 Key Problem: Processes need to collect enough votes from other processes
to ensure a majority consensus on a decision, even if some processes are
faulty or slow.

 Example: Raft Consensus Algorithm relies on leader election and majority-


based voting to ensure the consistency of logs in distributed systems.

16. Explain the Byzantine Agreement problem and Consensus problem.


The Byzantine Agreement Problem, also known as the Byzantine Generals
Problem, is a fundamental problem in distributed computing and fault tolerance. It
was introduced by Leslie Lamport, Robert Shostak, and Marshall Pease in 1982 to
address the challenges of achieving consensus in the presence of faulty or malicious
participants in a distributed system.

In the Byzantine Agreement Problem, a group of generals (processes) must agree on


a common plan of action (e.g., whether to attack or retreat). The challenge arises
because some of the generals may be traitors who attempt to disrupt the agreement
by sending conflicting or incorrect information to others. Despite the traitors, the
goal is for the non-faulty generals to reach a consistent decision.

The problem highlights the difficulty of achieving consensus in systems where


processes (or nodes) are not guaranteed to behave correctly, and may fail or act
maliciously. This problem is relevant in scenarios such as distributed databases,
blockchain systems, and multi-agent systems, where consensus must be reached in
the presence of failures or adversarial conditions.

Key Characteristics of the Byzantine Agreement Problem

1. Faulty Processes: Some of the processes may be faulty or malicious and can
send arbitrary, misleading messages to others. These faulty processes are
called Byzantine nodes, and they may attempt to disrupt the agreement or
prevent the system from reaching consensus.

2. Consensus: The goal is for the remaining non-faulty processes (also called
honest nodes) to agree on the same decision, even if some processes are
faulty or malicious. The system must ensure that:

o All non-faulty processes agree on the same value (agreement


property).

o If all non-faulty processes initially propose the same value, then they
must decide on that value (validity property).

3. Asynchronous Communication: The system is typically assumed to operate


in an asynchronous setting, meaning that messages can be delayed, and there
is no guarantee on message delivery times.

4. Arbitrary Failures: A faulty process may behave arbitrarily (i.e., it may send
incorrect or contradictory messages to different processes), making the
problem more challenging than simpler failure models like crash failures.
Byzantine Fault Tolerance (BFT)

To solve the Byzantine Agreement Problem, systems use Byzantine Fault Tolerant
(BFT) algorithms. These algorithms allow a distributed system to reach consensus
even if some of the processes are Byzantine.

 BFT Algorithm Example: One of the most widely known BFT algorithms is
the Practical Byzantine Fault Tolerance (PBFT) protocol. PBFT works by
having a primary process (also known as a leader) propose a value, and then
the system relies on the majority of processes to verify and agree on that
value. PBFT is capable of tolerating up to n−13\frac{n-1}{3} Byzantine faulty
processes out of a total of nn processes, ensuring that a decision can still be
reached even if some nodes are malicious.

Consensus Problem

The Consensus Problem is a broader and simpler problem compared to the


Byzantine Agreement Problem. It deals with ensuring that a set of processes or
nodes in a distributed system agree on a single value or decision, even in the face of
process crashes or network delays. The Consensus Problem is a foundational
problem in distributed systems, and its solution forms the basis for more complex
problems, such as the Byzantine Agreement.

In the Consensus Problem, there is an assumption that processes may fail in one of
two ways:

1. Crash Failures: A process may crash or become unresponsive, but it does


not behave maliciously or send arbitrary messages.

2. Byzantine Failures: A process may fail by acting maliciously and sending


incorrect or contradictory information (this is part of the Byzantine
Agreement Problem).

Key Characteristics of the Consensus Problem

1. Agreement: All non-faulty processes must agree on the same value.

2. Termination: Every non-faulty process must eventually decide on a value


(i.e., the protocol must not be stuck in an infinite loop).

3. Validity: If all non-faulty processes propose the same value initially, that
value must be chosen by all processes.
4. Fault Tolerance: The system should continue to function correctly even in
the presence of faults, typically crash failures. Byzantine failures are handled
in more advanced consensus protocols.

17. Compare centralized, distributed and hierarchical approaches for deadlock


detection.
Deadlock is a situation in a distributed system where a set of processes are blocked
because they are each waiting for a resource that is held by another process in the
set. This causes a circular dependency, and none of the processes can proceed.
Deadlock detection and resolution are critical in distributed systems to maintain
system efficiency and avoid indefinite process blocking.

There are different approaches to deadlock detection in distributed systems,


including centralized, distributed, and hierarchical approaches. Each of these
approaches has its advantages and disadvantages depending on the system’s
architecture, scalability, and failure conditions.

1. Centralized Deadlock Detection

In a centralized approach, a single process (or node) is responsible for monitoring


and detecting deadlock in the system. This process, known as the central
coordinator, maintains a global view of the system's resource allocation graph. The
coordinator receives information from all processes about resource allocation and
requests and uses this information to build a global graph for deadlock detection.

How it works:

 Every process in the system sends its resource allocation and request
information to the centralized coordinator.

 The coordinator periodically checks the global resource allocation graph for
cycles. If a cycle is detected, it indicates the presence of a deadlock.

 The coordinator may then take corrective action, such as aborting a process
to break the deadlock.

2. Distributed Deadlock Detection

In a distributed approach, deadlock detection is performed in a decentralized


manner, where each process is responsible for detecting deadlocks involving itself
and its neighbors. There is no central coordinator, and each process maintains local
information about resource allocation and requests. The processes communicate
with each other to detect deadlocks.
How it works:

 Each process maintains a local wait-for graph (WFG), which shows which
processes are waiting for resources held by other processes.

 Processes periodically send messages to each other to exchange information


about their wait-for relationships.

 If a process detects a cycle in its local graph, it tries to propagate this


information to other processes. Eventually, a distributed algorithm will
identify whether a global cycle exists, indicating a deadlock.

3. Hierarchical Deadlock Detection

In a hierarchical approach, the system is divided into multiple groups or clusters


of processes, and each cluster has a local coordinator responsible for detecting
deadlock within that cluster. The local coordinators are then arranged in a
hierarchical manner, with a top-level coordinator responsible for detecting
deadlocks across clusters.

How it works:

 Each process communicates with a local coordinator to report resource


allocation and requests.

 The local coordinator constructs a local wait-for graph and checks for
cycles to detect deadlocks within its cluster.

 If a local coordinator detects a deadlock, it may take action, such as


terminating one of the processes involved in the deadlock.

 If a deadlock occurs across clusters, the top-level coordinator is responsible


for resolving the global deadlock.

18. What is transaction? Explain ACID properties and compare flat vs nested
transactions.
A transaction is a sequence of operations performed as a single logical unit of work
in a distributed system or database. These operations are intended to ensure that
the system remains in a consistent state, even if some operations fail during
execution. The primary goal of a transaction is to ensure data integrity, consistency,
and reliability. In distributed systems, transactions often span multiple machines or
databases.

Transactions must adhere to the ACID properties to maintain correctness:


ACID Properties:

The ACID acronym stands for Atomicity, Consistency, Isolation, and Durability.
Each of these properties ensures that transactions behave in a way that preserves
the integrity of the system.

1. Atomicity:

o A transaction is an "all-or-nothing" operation. Either all of the


operations in the transaction are successfully completed, or none of
them are. If a failure occurs in the middle of a transaction, any changes
made by the transaction are rolled back, and the system remains in its
previous state.

o Example: Consider a bank transfer. If money is debited from one


account but not credited to the other, the entire transaction is rolled
back to maintain consistency.

2. Consistency:

o A transaction brings the system from one consistent state to another.


The database constraints (such as foreign keys, integrity rules, etc.)
must be respected, and no transaction should leave the system in an
inconsistent state.

o Example: A transfer from one bank account to another must maintain


the total sum of the money in the system.

3. Isolation:

o Transactions must operate independently of each other. The


intermediate states of a transaction are not visible to other
transactions. This ensures that the outcome of concurrent
transactions is as if they were executed sequentially.

o Isolation levels range from Read Uncommitted (where one


transaction can see changes from another uncommitted transaction)
to Serializable (where transactions are executed as if they were
serialized, with no overlap).

4. Durability:

o Once a transaction has been committed, its changes are permanent


and cannot be undone, even in the case of a system failure. Data is
written to stable storage, such as a disk or cloud service, ensuring that
it persists.

o Example: After completing a bank transaction, even if the system


crashes, the change will persist in the system.

Comparison Between Flat and Nested Transactions:

Feature Flat Transactions Nested Transactions

Structure Single-level, linear Multi-level, hierarchical (parent-child)

Atomicity is applied to the Atomicity is applied to both parent and


Atomicity
entire transaction sub-transactions

Entire transaction is rolled Only the failed sub-transaction is rolled


Rollback
back on failure back

Simpler, easier to More complex due to multiple levels and


Complexity
implement and understand dependencies

Better concurrency, as independent sub-


Limited concurrency in
Concurrency transactions can be executed
large systems
concurrently

Error If one operation fails, the Errors in sub-transactions can be


Handling whole transaction is undone handled independently

Simple operations where all Complex operations involving multiple


Use Case
steps must succeed or fail independent tasks

Bank withdrawal Complex order processing system


Example transaction (single set of (order creation, stock checking,
operations) payment processing)

19. Explain Two Phase Commit protocol for nested transactions.


Two-Phase Commit Protocol:

The Two-Phase Commit protocol involves two main phases: the prepare phase
and the commit phase. These phases apply not only to flat transactions but are also
adapted to nested transactions to maintain their atomicity and consistency.

1. Phase 1: Prepare Phase (Voting Phase)


 In this phase, the coordinator (usually the initiator of the transaction) sends
a prepare request to all participants involved in the transaction, including
sub-transactions.

 Each participant (including the sub-transactions, which are treated as


independent entities in a nested structure) then performs its local
operations. These participants check whether they can commit or need to
abort based on their local state.

o If a participant is ready to commit, it responds with a "vote yes".

o If a participant encounters an issue (like a failure or inconsistency), it


responds with a "vote no", indicating that the transaction cannot
proceed and must be aborted.

 If all participants (including those of the nested transactions) vote yes, the
transaction proceeds to the second phase. If any participant votes no, the
entire transaction is aborted.

2. Phase 2: Commit Phase (Decision Phase)

 Once the coordinator receives votes from all participants, it makes the final
decision.

o If all participants voted yes, the coordinator sends a commit message


to all participants, signaling that they should finalize their local
operations and make permanent changes to the data.

o If any participant voted no, the coordinator sends an abort message


to all participants, signaling that all participants must roll back their
changes to maintain consistency.

 All participants then either commit or abort their operations accordingly.

Nested Transactions in Two-Phase Commit:

In the case of nested transactions, the Two-Phase Commit protocol applies to both
the parent transaction and the sub-transactions. The sub-transactions, like
independent entities, must respond to the prepare request from the coordinator
and vote on whether they are ready to commit or need to abort. The coordinator
considers both the parent and sub-transactions together when making the final
decision.
Nested Two-Phase Commit Protocol Steps:

1. Initiate Parent Transaction:

o The coordinator starts the parent transaction and sends the prepare
request to all participants (including sub-transactions).

2. Sub-Transaction Preparation:

o Each sub-transaction checks its local state and sends a vote yes or
vote no to the coordinator, indicating whether it can commit.

3. Parent Transaction Preparation:

o Once all sub-transactions respond, the coordinator checks their votes


and then prepares the parent transaction.

o If all sub-transactions and the parent vote yes, the parent sends the
prepare request to the next level.

4. Commit or Abort Decision:

o If all sub-transactions and the parent have voted yes, the coordinator
sends a commit message to all involved, indicating that all operations
(in both the parent and sub-transactions) should be committed.

o If any sub-transaction votes no, the parent transaction is aborted, and


all changes are rolled back.

20. Explain Load Distribution in distributed system.


Load Distribution in distributed systems is the process of distributing tasks or
workloads across multiple machines, processors, or nodes in a system to ensure
efficient resource utilization, optimal performance, and better system
responsiveness. The goal is to balance the workload evenly among available
resources, avoid overloading any single node, and achieve high throughput and low
latency.

Load distribution is crucial in distributed systems because these systems typically


involve multiple computing resources located in different geographical locations or
managed by different administrators. The key to maximizing the performance of a
distributed system lies in distributing the work appropriately among its resources.

Key Concepts in Load Distribution:


1. Load Balancing: Load balancing is the process of distributing workloads
evenly across multiple resources to avoid bottlenecks and achieve optimal
performance. In a load-balanced system, no single machine or node should
be overwhelmed with tasks while others remain underutilized.

2. Load Distribution Algorithms: These are mechanisms designed to


distribute tasks across various nodes or processors. The algorithms decide
how to divide the workload efficiently and manage the distribution based on
resource availability, task size, and system health.

3. Task Scheduling: Task scheduling determines when and where to assign


specific tasks or processes to nodes in the system. A good scheduling
mechanism ensures that tasks are completed in a timely manner while
balancing the load effectively across the system.

4. Global Load Information: For efficient load distribution, each node in the
system must have access to the global system load information, which
includes the current workload on each node, resource availability, and the
state of each node (idle, busy, etc.).

Load Distribution Strategies:

Several strategies exist for distributing load in a distributed system. Each strategy
has different benefits and is suited to different system characteristics. Here are the
major strategies:

1. Static Load Distribution:

o In static load distribution, the distribution of tasks is decided in


advance before the tasks are executed. This approach is generally
used when the workload is predictable, and the computing resources
are evenly matched.

o The distribution is fixed, and no adjustments are made at runtime.

o Example: If you have a cluster of servers where each server handles


specific types of requests, the system allocates tasks to servers based
on their capabilities beforehand.

2. Dynamic Load Distribution:

o Dynamic load distribution adapts to the system’s state in real-time by


redistributing tasks based on current system conditions, including
load and available resources.
o It uses algorithms that monitor the load on each node and
dynamically move tasks from overloaded nodes to under-utilized
ones.

o Example: In a cloud computing environment, resources can be scaled


up or down dynamically based on current workload.

3. Centralized Load Distribution:

o In centralized load distribution, one central coordinator (or load


balancer) makes decisions about which tasks should be assigned to
which nodes. The coordinator has a global view of the system’s state
and uses this information to allocate workloads dynamically.

o Example: A web server that handles requests using a central load


balancer. The load balancer checks the server health and resource
utilization and directs requests to the server with the least load.

4. Decentralized Load Distribution:

o In decentralized load distribution, each node makes its own decisions


about load balancing. Nodes communicate with one another to share
state information, but there is no single entity controlling the
distribution process.

o Example: Distributed databases where nodes replicate data and


queries are routed based on the current load and availability of
replicas.

5. Hybrid Load Distribution:

o Hybrid systems combine both centralized and decentralized


approaches. For instance, a central coordinator may handle certain
types of tasks, but nodes can also autonomously handle local tasks
when required.

o Example: A cloud system that uses a combination of central control


for provisioning new virtual machines but allows nodes to decide
locally which tasks to process.

21. What is Fault Tolerance in distributed systems?


Fault tolerance in distributed systems refers to the ability of the system to continue
operating correctly even in the presence of hardware or software faults. In a
distributed environment, where multiple independent nodes or processes are
interacting, the system must be able to recover from failures or errors without
interrupting service or data integrity.

Fault tolerance is a critical feature in distributed systems, as these systems typically


span multiple nodes across various locations, each with its own potential for failure.
The presence of faults such as network partitions, server crashes, or communication
delays should not disrupt the overall system's functionality or cause data loss.

Types of Failures in Distributed Systems:

1. Crash Failures:

o These occur when a node or process unexpectedly stops working due


to hardware or software failure. The node may become non-
responsive or go offline. Recovery mechanisms like replication and
checkpointing are often used to handle crash failures.

2. Omission Failures:

o Omission failures happen when a node or process fails to send a


message or fails to receive a message, often due to network issues or
delays. These failures are typically resolved using timeouts or
retransmission mechanisms.

3. Timing Failures:

o A timing failure occurs when a system does not complete an operation


within the specified time bounds. In real-time distributed systems,
this could mean a missed deadline for processing a task, resulting in
incorrect behavior or degraded performance.

4. Response Failures:

o This occurs when a node or process sends incorrect or corrupted data


in response to a request, due to factors such as memory corruption or
logic errors.

5. Communication Failures:

o Failures in communication channels (e.g., network failures, lost


packets, or delays) that prevent processes from exchanging messages,
leading to possible inconsistencies or loss of synchronization in the
system.

6. Byzantine Failures:
o A Byzantine failure refers to the most severe type of failure in
distributed systems, where nodes can behave arbitrarily or
maliciously. In this type of failure, a node may not only crash but can
send incorrect or misleading information to other nodes.

Fault Tolerance Techniques in Distributed Systems:

To ensure that distributed systems can function properly despite failures, various
fault tolerance techniques are implemented. Some of these techniques are discussed
below:

1. Replication:

 Replication involves maintaining copies of data or services across multiple


nodes in the system. If one node or service fails, another replica can take
over, ensuring that the system remains available.

 Data Replication: Frequently used in databases, file systems, and storage


systems to ensure that data is not lost when a node crashes.

 Service Replication: Services or processes are replicated across different


nodes so that if one fails, the others can continue providing the service.

2. Checkpointing:

 Checkpointing is a technique where a system periodically saves the state of


a process or task to a stable storage. If a failure occurs, the system can roll
back to the most recent checkpoint to resume operations without having to
restart from the beginning.

 This is particularly useful in systems where tasks are long-running and must
be preserved even if a failure occurs during execution.

3. Redundancy:

 Redundancy involves duplicating critical components (e.g., nodes, services,


communication links) to reduce the risk of system failure. Redundant
components can take over the operations of the failed components.

 Redundancy can be implemented at the hardware, software, or data level,


depending on the system's requirements.

4. Consensus Protocols:
 Consensus protocols help distributed systems agree on a single value or
decision despite failures. They are especially important in fault-tolerant
systems to ensure consistency and reliability when nodes crash or behave
unpredictably.

 Popular consensus protocols include Paxos, Raft, and Quorum-based


protocols. These protocols allow systems to reach agreement even when
some nodes may fail or send incorrect information.

5. Timeouts and Retries:

 Timeouts and retries are mechanisms where a node waits for a response
within a specified time and retries the request if the response is not received.
This ensures that transient network issues or delays do not lead to failure.

 Timeouts are particularly useful in dealing with omission failures (e.g.,


failure to receive a message) and network latency issues.

6. Voting and Majority-based Schemes:

 Voting and majority-based schemes are used in distributed systems to


make decisions or recover from faults. When a node fails, the remaining
nodes "vote" to decide the next action or to confirm the failure.

 This is often used in leader election, consensus, and replication protocols to


ensure that the system can tolerate some number of faulty nodes.

7. Failure Detectors:

 Failure detectors monitor the health of nodes or processes and notify the
system when a failure occurs. These detectors help the system decide which
nodes to remove from the network and which nodes to bring back online
after recovery.

 Failure detectors can operate with varying levels of accuracy, from simple
heartbeats to more complex algorithms that detect crashes or arbitrary
failures.

8. Backup Systems:

 Backup systems maintain copies of critical data or services that can be


restored in case of failure. This ensures that even if a primary system fails,
the data and functionality can be restored from a backup.
22. What is routing? Discuss correctness criteria for routing algorithms.
Routing in distributed systems refers to the process of determining how data or
messages are forwarded from one node to another within a network, typically
across multiple intermediate nodes. The goal is to find an efficient, reliable, and
scalable way to direct data packets through the network from a source node to a
destination node, while considering various constraints like bandwidth, latency, and
reliability.

In distributed systems, routing is not just about finding paths between two nodes,
but also ensuring that communication occurs efficiently, even if parts of the network
fail or become congested. Routing algorithms are essential for ensuring that data
can traverse the system in an optimal manner, especially in large-scale systems with
a high number of nodes.

Types of Routing in Distributed Systems:

1. Unicast Routing:

o This involves sending a message from one source node to a single


destination node. The routing algorithm ensures that the message
reaches its destination without unnecessary detours.

2. Multicast Routing:

o This involves sending a message from one source node to multiple


destination nodes (a group). Multicast routing algorithms ensure that
messages are efficiently delivered to all members of a group without
excessive overhead.

3. Broadcast Routing:

o This involves sending a message from a source node to all nodes in


the network. Routing protocols must efficiently deliver the message to
all nodes while minimizing network load.

4. Anycast Routing:

o This type of routing allows a message to be delivered to any one of a


set of possible destination nodes, typically the one closest or most
available.

Routing Algorithms:

Routing algorithms can be categorized into two broad types: static and dynamic.
1. Static Routing:

o In static routing, the routes are predetermined and fixed. These routes
do not change during runtime, making the algorithm simple but
inflexible in handling network failures or changes.

o Example: Distance Vector and Link-State Routing protocols are


examples of static routing algorithms.

2. Dynamic Routing:

o Dynamic routing allows the routes to be adjusted in real-time based


on changes in the network (e.g., node failures, congestion). It is more
flexible and adaptive, but requires more complex algorithms and
overhead.

o Example: OSPF (Open Shortest Path First), BGP (Border Gateway


Protocol), and Adaptive Routing protocols are examples of dynamic
routing algorithms.

Correctness Criteria for Routing Algorithms:

When designing and analyzing routing algorithms in distributed systems, several


correctness criteria must be considered to ensure that the algorithm functions
correctly, efficiently, and can adapt to changes or failures in the network. Below are
the primary correctness criteria for routing algorithms:

1. Accuracy (Correctness of Path Calculation):

 The routing algorithm must correctly compute the best or optimal path
between nodes. This means the algorithm must ensure that the chosen path
leads to the destination without getting stuck or going in an incorrect
direction.

2. Termination:

 A routing algorithm should eventually reach a destination. If there are no


routes to the destination, the algorithm should be able to detect that and
return an error or failure message.

3. Optimality:

 The routing algorithm should find the optimal or near-optimal path for data
transmission. In many cases, this means finding the path with the lowest cost,
such as the shortest route, minimal latency, or maximum available
bandwidth.

4. Fault Tolerance (Robustness):

 A good routing algorithm should be fault-tolerant, meaning it can handle


situations where nodes or links in the network fail. The algorithm must be
able to reroute messages via alternate paths if the primary route is
unavailable.

5. Convergence:

 Convergence refers to the ability of the routing algorithm to reach a stable


state where the network has a consistent and optimal set of routes.

 The time it takes for the algorithm to reach this stable state is also crucial. A
routing algorithm with fast convergence is desirable as it can quickly adapt
to network changes such as node failures or additions.

6. Scalability:

 Scalability refers to the algorithm’s ability to handle large-scale networks


efficiently. As the network grows in terms of nodes and links, the routing
algorithm should continue to operate efficiently without significant
performance degradation.

7. Stability:

 Stability ensures that the routing algorithm does not introduce oscillations or
flapping routes. This can happen when there are constant changes in the
network, leading to frequent updates in routing tables. An unstable algorithm
may lead to excessive overhead and unreliable routing decisions.

8. Fairness:

 Fairness refers to the equitable distribution of resources (such as bandwidth


and load) among all nodes in the network. A routing algorithm should avoid
overloading some parts of the network while underutilizing others.

9. Adaptability:

 The routing algorithm must be able to adapt to changing conditions in the


network, such as traffic patterns, topology changes (node joins, leaves, or
failures), and dynamic network environments.
10. Complexity:

 The algorithm should be computationally efficient in terms of memory usage


and processing time. More complex algorithms might provide better
optimizations but at the cost of higher overhead. A balance between
performance and complexity should be maintained.

11. Consistency:

 Consistency ensures that all nodes in the network have a consistent view of
the network topology. This is crucial for maintaining optimal routing
decisions and avoiding routing loops or conflicts.

23. What are distributed objects? Explain RMI with example and role of
proxy/skeleton.
Distributed Objects

In distributed systems, a distributed object refers to an object that exists and


operates across different nodes in a network. Distributed objects represent entities
that can communicate and interact with each other, but they may reside on different
machines within a distributed environment. These objects may encapsulate data
and functionality, just like local objects, but they can be accessed remotely across
the network by other objects.

The main idea behind distributed objects is to allow for transparent communication
between objects that are physically separated but logically part of the same system.
In traditional object-oriented programming (OOP), an object is a bundle of data and
methods that operate on that data. However, in distributed systems, the methods of
an object might be executed on different machines or devices connected over a
network, and these remote interactions should be made as seamless as possible for
developers and users.

Distributed objects support the remote invocation of methods, meaning that a


method call is sent to a remote machine (where the object resides) instead of being
executed locally. The system must manage the complexities of remote
communication, network latency, and potential failure scenarios.

Remote Method Invocation (RMI)

Remote Method Invocation (RMI) is a Java-based mechanism for creating


distributed systems where methods of an object can be invoked remotely (on a
different machine) as if they were local method calls. It allows objects on one
machine to invoke methods on objects residing on another machine in a network.
RMI hides the complexities of network communication, providing an abstraction
that makes it appear as if the local and remote objects are part of the same system.

In a typical RMI setup, a client invokes methods on a remote object as though the
object were local, and the underlying RMI infrastructure takes care of serializing the
method arguments, sending them across the network, and returning the results.

How RMI Works

1. RMI Registry: The RMI registry acts as a directory for all remote objects. The
server registers its remote objects with the RMI registry, and the client looks
up the remote object in the registry before invoking methods.

2. Remote Object: The remote object is a class that implements a remote


interface. The remote interface declares the methods that can be called
remotely, and the implementation of this interface contains the actual
business logic.

3. Client: The client accesses the remote object by looking it up in the RMI
registry, where it then interacts with the object by invoking methods
remotely.

4. RMI Server: The server is responsible for creating the remote object and
registering it with the RMI registry. The server listens for client requests and
responds with results from the remote object.

Role of Proxy and Skeleton in RMI

In the context of RMI, two critical components—Proxy and Skeleton—are used to


facilitate communication between the client and the server:

1. Proxy:

o The proxy (also called the stub) acts as a placeholder for the remote
object on the client side. When a client wants to invoke a method on a
remote object, it interacts with the proxy rather than the actual
remote object. The proxy then forwards the method call to the real
remote object.

o The proxy is responsible for handling the communication between the


client and the remote server. It serializes the method parameters,
sends the request over the network, waits for the response, and then
deserializes the result and returns it to the client.
o The client code doesn't need to know whether it is interacting with a
local or remote object, as the proxy makes this interaction
transparent.

2. Skeleton:

o The skeleton is a server-side component that receives the method


calls sent by the client through the proxy. Once the skeleton receives
the request, it unpacks the data, invokes the appropriate method on
the remote object, and returns the result to the client through the
proxy.

o In earlier versions of Java RMI, skeletons were necessary for the


communication between the proxy and the remote object. However,
with later versions of Java (after JDK 5), the skeleton class has been
deprecated, and the RMI runtime automatically manages the
communication, reducing the need for explicit skeletons.

Example of RMI

Let's take a look at a simple example of RMI in Java to demonstrate the core
concepts:

1. Remote Interface (defines methods that can be called remotely):

import java.rmi.*;

public interface Hello extends Remote {

String sayHello() throws RemoteException;

2. Remote Object Implementation (implements the remote interface and


provides method logic):

import java.rmi.*;

public class HelloImpl extends UnicastRemoteObject implements Hello {

public HelloImpl() throws RemoteException {


super();

public String sayHello() throws RemoteException {

return "Hello, world!";

3. Server Code (registers the remote object with RMI Registry):

import java.rmi.*;

import java.rmi.registry.*;

public class HelloServer {

public static void main(String[] args) {

try {

// Create an instance of the remote object

HelloImpl obj = new HelloImpl();

// Bind the object to the RMI registry

Naming.rebind("Hello", obj);

System.out.println("Hello Server is ready.");

} catch (Exception e) {

System.out.println("Hello Server failed: " + e);

}
}

4. Client Code (looks up the remote object in the RMI registry and calls the
remote method):

import java.rmi.*;

public class HelloClient {

public static void main(String[] args) {

try {

// Look up the remote object in the RMI registry

Hello obj = (Hello) Naming.lookup("//localhost/Hello");

// Call the remote method

System.out.println(obj.sayHello());

} catch (Exception e) {

System.out.println("HelloClient failed: " + e);

Explanation:

 The Hello interface defines the remote method sayHello().

 HelloImpl implements the Hello interface and provides the method's logic.

 The HelloServer class creates and registers the HelloImpl object with the
RMI registry.

 The HelloClient class looks up the remote object from the registry and
invokes the sayHello() method remotely.
24. What is Remote Procedure Call (RPC)? Explain the basic RPC operation.
A Remote Procedure Call (RPC) is a protocol that allows a program to execute a
procedure (or function) on a remote server or system, as if it were a local procedure
call. In an RPC, the procedure call is made across a network, but the calling code
doesn't need to worry about the details of network communication. RPC abstracts
the complexities of remote communication and provides a convenient way for
programs to interact over a network.

The key idea behind RPC is that a client can invoke a function that is executed on a
remote machine, just like it would call a function that resides on the local machine.
The client program sends a request to the server, and the server processes the
request and sends back the results, all while maintaining the illusion of a local
procedure call. RPC systems are used widely in distributed systems for
communication between different services or components.

Basic RPC Operation

The basic flow of an RPC operation can be broken down into several steps:

1. Client Side:

o The client invokes a procedure (function) as if it were a local function


call.

o The client-side code is connected to a stub, which is a proxy for the


actual remote procedure.

o The stub serializes the arguments of the procedure into a format that
can be transmitted over the network (a process called marshalling).

o The stub then sends the marshaled request across the network to the
server, using the transport protocol (such as TCP/IP).

2. Server Side:

o On the server side, there is also a stub called the server stub. The
server stub listens for incoming RPC requests and unmarshals the
data received from the client (this process is called unmarshalling).

o The server stub then calls the actual procedure (also called a service
method) with the unmarshalled arguments.

o The service method on the server performs the computation and


sends the result back to the client through the server stub.
3. Return to Client:

o The server stub marshals the result and sends it back to the client
stub over the network.

o The client stub receives the result, unmarshals it, and returns the
result to the client application.

4. Client Application:

o Finally, the client application receives the result of the remote


procedure call, as though it had been executed locally.

RPC Architecture

The architecture of an RPC system typically consists of the following components:

1. Client: The entity that requests the remote procedure to be executed.

2. Client Stub: The local representation of the remote procedure. The client
interacts with the stub, which handles marshalling, sending the request, and
receiving the result.

3. Server Stub: The server-side counterpart to the client stub. It listens for
incoming RPC requests, unmarshals the request, calls the appropriate
method on the server, and sends the result back.

4. Server: The entity that implements the actual procedure being called
remotely. The server performs the requested operation and returns the
result.

5. Communication Protocol: A network protocol (such as TCP or UDP) used to


transmit data between the client and the server.

25. Write short notes on CORBA RMI.


CORBA (Common Object Request Broker Architecture) RMI

CORBA stands for Common Object Request Broker Architecture. It is a


middleware framework that enables communication between objects in a
distributed environment. CORBA provides a set of standards and specifications to
allow applications written in different programming languages and running on
different platforms to communicate seamlessly. It facilitates communication in a
way that objects can make requests to and receive responses from other objects that
are located remotely, potentially across different networks.
The primary goal of CORBA is to make distributed computing easier and more
interoperable. CORBA defines how objects can communicate across various
operating systems, network protocols, and programming languages, promoting
cross-platform communication in heterogeneous environments.

CORBA RMI

The term CORBA RMI refers to Remote Method Invocation (RMI) in the context of
CORBA. It is an extension of the concept of RMI, commonly used in Java-based
systems, to the CORBA world. While Java’s RMI allows for communication between
objects in the same Java programming environment, CORBA RMI extends this
capability to a more diverse set of environments.

CORBA RMI allows applications to invoke methods on objects located remotely,


much like local method invocations, but over a network, regardless of where the
objects reside (in different languages, operating systems, or hardware). The CORBA
Object Request Broker (ORB) serves as a mediator between the client and the
server, handling communication and request routing.

How CORBA Works

1. Client-Server Model:

o CORBA follows the traditional client-server model. The client sends a


request to invoke a method on a remote object, while the server hosts
the object that implements the method.

2. Object Request Broker (ORB):

o At the core of CORBA is the ORB, which acts as a communication


middleware, enabling clients to call methods on remote objects
without knowing the specifics of the object’s location or the language
it is implemented in.

3. Interface Definition Language (IDL):

o One of the key elements of CORBA is the Interface Definition


Language (IDL). It is used to define the interfaces that remote objects
will implement. IDL provides a language-neutral way of describing the
structure of data and methods that can be called remotely. IDL files
are compiled into code stubs and skeletons, which are used by the
client and server for remote communication.

4. Stubs and Skeletons:


o The client stub is a piece of code generated from the IDL file that
allows the client to call the remote methods as if they were local. The
stub handles the marshalling (serialization) of arguments, sends the
request over the network, and returns the result.

o The server skeleton is responsible for receiving requests from the


client, unmarshalling the arguments, calling the appropriate method
on the server-side object, and returning the result to the client.

5. Location Transparency:

o CORBA provides location transparency, meaning the client does not


need to know where the server object resides or which platform it is
running on. The ORB abstracts the location of objects, allowing clients
to interact with remote objects seamlessly.

6. Interoperability:

o One of CORBA's key features is its ability to support interoperability


between different programming languages and platforms. A CORBA
system can have components written in C++, Java, Python, and other
languages, running on different operating systems, and still
communicate through the ORB.

Steps in CORBA RMI

1. Interface Definition:

o The developer defines the interface of the remote object using IDL.
This interface describes the operations that can be invoked remotely,
and the data types that can be exchanged.

2. IDL Compilation:

o The IDL file is compiled into language-specific stubs (on the client-
side) and skeletons (on the server-side). These stubs and skeletons
provide the necessary code to invoke remote methods.

3. Client and Server Implementation:

o The client and server implement the logic for interacting with the
remote objects. The server provides the implementation for the
methods defined in the IDL interface, while the client calls those
methods using the stubs.
4. Remote Method Invocation:

o The client calls a remote method through the stub. The stub
communicates with the ORB to send the request to the remote object.
The ORB routes the request to the appropriate server, where the
skeleton unmarshals the request, invokes the method on the server-
side object, and sends the result back to the client.

5. Result Delivery:

o The client receives the result of the remote method call, just as if the
method had been executed locally.

You might also like