Distributed Systems QA
Distributed Systems QA
OF TECHNOLOGY,
SULTANPUR (UP)
ASSIGNMENT – 1, 2, 3, 4, 5
MASTERS IN TECHNOLOGY
1st YEAR 2nd SEMESTER
DISTRIBUTED COMPUTING (RCSE-201)
Transparency: Users perceive the system as a single unit despite its distribution.
Transparency: Users perceive the system as a single unit despite its distribution.
Fault Tolerance and Reliability: Even if parts of the system fail, the system as a
whole should continue functioning using replication, checkpointing, and recovery
techniques.
Performance: Optimized resource sharing and task distribution should ensure high
performance, even under load.
Security: Since the system is accessible over networks, it must ensure data
integrity, confidentiality, and secure access.
Characteristics:
Advantages:
Disadvantages:
In the Message-Passing model, processes have private memory and interact solely
by sending and receiving messages. This model is common in loosely coupled
systems, including distributed environments over a network.
Characteristics:
Advantages:
Disadvantages:
a) Client-Server Model
Description: All nodes act both as clients and servers. Each peer can request
and provide services.
Example: BitTorrent, blockchain.
c) Three-Tier Architecture
Description: Divides the system into three layers – presentation (UI), logic
(business logic), and data (database).
2. Fundamental Models
a) Interaction Model
Types:
b) Failure Model
Types of Failures:
c) Security Model
Common Threats:
In the Shared Address Space (SAS) model, multiple threads or processes operate
within a single address space. This setup is common in systems using shared
memory multiprocessors or in multi-threaded applications on the same machine.
Requirements:
Shared memory access: A physical or logical memory space must be
accessible to all participating threads or processes.
Working Methodology:
2. Message-Passing Architecture
In the Message-Passing Architecture (MPA), each process has its own private
memory, and all communication occurs by explicit message exchange. This is the
most common communication method in distributed systems.
Requirements:
Fairness: Requests should be honored in the order they were made (first-
come, first-served, if applicable).
No global clock.
Fault tolerance.
1. Centralized Algorithms:
Working:
After completion, the process informs the coordinator, who then grants
access to the next requester.
Working:
It enters the CS only after receiving permission from the majority (or all).
3. Token-Based Algorithms:
Concept: A unique token circulates among processes. Possession of the token grants
access to the CS.
Working:
When a process wants to enter the CS, it waits for the token.
If it doesn’t have the token, it sends a request to the current token holder.
Once it completes the CS, it passes the token to the next requesting process.
Assumptions:
The communication graph is fully connected (i.e., any process can send a
message to any other).
Algorithm Steps:
2. Receiving a Request:
3. Entering CS:
o Once P has received REPLY from all other processes, it enters the CS.
4. Exiting CS:
Example:
Message Complexity:
o (N - 1) REQUESTs + (N - 1) REPLYs.
Advantages:
No token required.
A global clock would allow all processes in the distributed system to have a
consistent view of time. However, in reality:
Message delays and system latency make it impossible to assume any fixed
or consistent time ordering across distributed nodes.
8. What is Logical clock and Vector clock? How logical clock can be used to
implement semaphores?
Lamport Logical Clock, proposed by Leslie Lamport in 1978, is a simple
mechanism to establish a partial ordering of events in distributed systems.
Approach:
1. Request Queue: Each process sends a request to enter the critical section
with its logical timestamp.
2. Ordering Requests: Requests are ordered based on logical clock values. The
lowest timestamp wins.
3. Accessing CS:
4. Releasing Semaphore:
o After leaving the critical section, the process sends release messages,
and other processes update their request queues.
9. What is a Lamport clock? What are its limitations? What is the use of Virtual
time in Lamport's logical clock?
Lamport Clock is a mechanism used in distributed systems to provide a partial
ordering of events. It was introduced by Leslie Lamport in 1978 to handle the issue
of event ordering in systems without a global clock. Unlike physical clocks that
require synchronization across nodes, Lamport’s logical clock provides a way to
order events based on causal relationships.
While Lamport clocks are an effective tool for ordering events, they do have several
limitations:
Virtual time refers to the system of timestamps created by Lamport clocks that
order events within a distributed system. It is called "virtual" because it does not
rely on physical time or any actual clock.
10. What do you mean by causal ordering of message? Give an example of it.
Causal Ordering:
In a distributed system, messages are sent between processes, and these messages
might carry information about events that have occurred in one process. Causal
ordering ensures that if an event A causes event B (i.e., A → B), then A must happen
before B, and the message conveying event B must be delivered after the message
conveying event A. In other words, causality should be maintained in the order of
message delivery.
Causal ordering does not require a global clock, but it uses a logical clock (such as
Lamport's clock or vector clocks) to track the order of events within each process
and their causal relationships.
2. Logical Clocks and Causal Ordering: The Lamport clock (or vector clock) is
often used to implement causal ordering. When a process sends a message, it
tags the message with its current logical clock value. When the receiving
process receives the message, it checks whether the clock of the sender is
less than its own clock and updates accordingly, ensuring that the messages
are received in the correct order.
Consider a distributed system with two processes, P1 and P2, and two events, E1
and E2, where E1 occurs in P1 and causes E2 in P2.
Let’s say the events are assigned logical clocks using Lamport’s clock.
1. At P1:
o E1 occurs, and P1 increments its clock, say it becomes clock(P1) = 1.
2. At P2:
o Event E2 occurs at P2, and the system ensures that E2 happens after
E1 due to the happens-before relation.
Vector Clocks work by associating a vector of timestamps with each event, where
each process in the system maintains a logical clock. When a process sends a
message, it attaches its vector clock to the message. The receiving process uses this
vector clock to update its own clock, ensuring the message ordering is maintained
according to the causal relationship.
13. What are Deadlock and Starvation? Explain the fundamental causes and
detection methods.
Deadlock
For example:
Process B holds resource R2 and is waiting for resource R1. In this scenario,
both processes are in a deadlock situation because neither can proceed until
the other releases the resources, but neither can release their resources
because they are waiting for the other.
Conditions for Deadlock
Deadlocks typically occur if all four of the following conditions are met:
2. Hold and Wait: A process holding at least one resource is waiting for
additional resources that are being held by other processes.
4. Circular Wait: A set of processes exists such that each process is waiting for
a resource held by the next process in the set.
Starvation
For example, a process may have a low priority and never gets a chance to execute if
higher-priority processes keep consuming CPU time. This may lead to starvation of
the lower-priority process.
Deadlock Causes:
Starvation Causes:
o Priority Inversion: When high-priority processes continuously
preempt low-priority processes, the low-priority processes may never
get a chance to execute.
Deadlock detection aims to identify if a deadlock has occurred and resolve it. Some
of the primary methods for detecting deadlocks include:
1. Wait-for Graphs:
3. Timeout Mechanism:
o This involves setting a timer for each process to wait for a resource. If
a process waits for more than a predefined threshold without
receiving the resource, it is assumed that a deadlock has occurred.
1. Implement Aging: Gradually increase the priority of a process that has been
waiting for a long time, ensuring it eventually gets resources.
14. Explain the concept of Processes and Threads with state transition diagram.
A process is an instance of a program in execution. It includes the program's code,
its current activity, a stack, a heap, and all the resources needed to perform the
computation. A process is a more heavyweight entity compared to a thread because
it has its own memory space and system resources.
Components of a Process:
4. Stack: Stores the execution history, including function calls and local
variables.
5. Registers: Stores the program's counter and other states of the CPU
during execution.
Thread Components:
2. Stack: Stores local variables and function calls for that specific thread.
Both processes and threads can have different states throughout their lifecycle. The
state transition diagram helps us visualize how processes and threads move
through various states.
2. Ready: The process is ready to execute but is waiting for CPU time.
5. Terminated: The process has completed its execution and is being removed
from the system.
The transition between these states occurs based on events such as process
creation, scheduling, waiting for resources, or completion.
Thread State Diagram
Threads have similar states, but they can also transition between more specific
states as they interact with the process. The thread states are:
1. New: The thread is created but has not yet started execution.
2. Runnable: The thread is ready to run, but the scheduler is yet to allocate
CPU time to it.
In addition to these basic states, threads in some systems also have an "on hold"
state when temporarily suspended by the system for various reasons, like context
switching.
Fault Tolerance: The protocol must tolerate faults such as message delays,
crashes, or network partitions, ensuring that the system continues to
function correctly.
The general system model for agreement protocols involves the following elements:
Agreement protocols can be classified based on the types of problems they are
solving and the assumptions made about the system’s behavior. Some of the
common classifications are as follows:
1. Consensus Protocols
Consensus protocols are used to ensure that a group of processes can agree on a
single value despite failures and network issues. The most famous example of a
consensus problem is the Byzantine Generals Problem, where processes (or
generals) must agree on a plan of action (attack or retreat) despite some generals
potentially being traitors.
Key Problem: Given that some processes may fail or behave maliciously, the
challenge is to reach consensus on a value in a fault-tolerant manner.
Example: Paxos and Raft are widely used consensus protocols in distributed
systems. These protocols are designed to ensure that even if some processes
fail, the remaining processes can still come to a consensus.
In binary consensus protocols, the processes need to agree on one of two values
(often 0 or 1). This problem becomes complex when processes are asynchronous
and may fail.
Key Problem: The goal is for every process to either choose a value 0 or 1,
and for all non-failing processes to choose the same value, despite failures
and the possibility of asynchronous communication.
Key Problem: The goal is to ensure that all non-faulty processes agree on a
decision, even if some processes are malicious or behave arbitrarily.
Key Problem: Processes need to collect enough votes from other processes
to ensure a majority consensus on a decision, even if some processes are
faulty or slow.
1. Faulty Processes: Some of the processes may be faulty or malicious and can
send arbitrary, misleading messages to others. These faulty processes are
called Byzantine nodes, and they may attempt to disrupt the agreement or
prevent the system from reaching consensus.
2. Consensus: The goal is for the remaining non-faulty processes (also called
honest nodes) to agree on the same decision, even if some processes are
faulty or malicious. The system must ensure that:
o If all non-faulty processes initially propose the same value, then they
must decide on that value (validity property).
4. Arbitrary Failures: A faulty process may behave arbitrarily (i.e., it may send
incorrect or contradictory messages to different processes), making the
problem more challenging than simpler failure models like crash failures.
Byzantine Fault Tolerance (BFT)
To solve the Byzantine Agreement Problem, systems use Byzantine Fault Tolerant
(BFT) algorithms. These algorithms allow a distributed system to reach consensus
even if some of the processes are Byzantine.
BFT Algorithm Example: One of the most widely known BFT algorithms is
the Practical Byzantine Fault Tolerance (PBFT) protocol. PBFT works by
having a primary process (also known as a leader) propose a value, and then
the system relies on the majority of processes to verify and agree on that
value. PBFT is capable of tolerating up to n−13\frac{n-1}{3} Byzantine faulty
processes out of a total of nn processes, ensuring that a decision can still be
reached even if some nodes are malicious.
Consensus Problem
In the Consensus Problem, there is an assumption that processes may fail in one of
two ways:
3. Validity: If all non-faulty processes propose the same value initially, that
value must be chosen by all processes.
4. Fault Tolerance: The system should continue to function correctly even in
the presence of faults, typically crash failures. Byzantine failures are handled
in more advanced consensus protocols.
How it works:
Every process in the system sends its resource allocation and request
information to the centralized coordinator.
The coordinator periodically checks the global resource allocation graph for
cycles. If a cycle is detected, it indicates the presence of a deadlock.
The coordinator may then take corrective action, such as aborting a process
to break the deadlock.
Each process maintains a local wait-for graph (WFG), which shows which
processes are waiting for resources held by other processes.
How it works:
The local coordinator constructs a local wait-for graph and checks for
cycles to detect deadlocks within its cluster.
18. What is transaction? Explain ACID properties and compare flat vs nested
transactions.
A transaction is a sequence of operations performed as a single logical unit of work
in a distributed system or database. These operations are intended to ensure that
the system remains in a consistent state, even if some operations fail during
execution. The primary goal of a transaction is to ensure data integrity, consistency,
and reliability. In distributed systems, transactions often span multiple machines or
databases.
The ACID acronym stands for Atomicity, Consistency, Isolation, and Durability.
Each of these properties ensures that transactions behave in a way that preserves
the integrity of the system.
1. Atomicity:
2. Consistency:
3. Isolation:
4. Durability:
The Two-Phase Commit protocol involves two main phases: the prepare phase
and the commit phase. These phases apply not only to flat transactions but are also
adapted to nested transactions to maintain their atomicity and consistency.
If all participants (including those of the nested transactions) vote yes, the
transaction proceeds to the second phase. If any participant votes no, the
entire transaction is aborted.
Once the coordinator receives votes from all participants, it makes the final
decision.
In the case of nested transactions, the Two-Phase Commit protocol applies to both
the parent transaction and the sub-transactions. The sub-transactions, like
independent entities, must respond to the prepare request from the coordinator
and vote on whether they are ready to commit or need to abort. The coordinator
considers both the parent and sub-transactions together when making the final
decision.
Nested Two-Phase Commit Protocol Steps:
o The coordinator starts the parent transaction and sends the prepare
request to all participants (including sub-transactions).
2. Sub-Transaction Preparation:
o Each sub-transaction checks its local state and sends a vote yes or
vote no to the coordinator, indicating whether it can commit.
o If all sub-transactions and the parent vote yes, the parent sends the
prepare request to the next level.
o If all sub-transactions and the parent have voted yes, the coordinator
sends a commit message to all involved, indicating that all operations
(in both the parent and sub-transactions) should be committed.
4. Global Load Information: For efficient load distribution, each node in the
system must have access to the global system load information, which
includes the current workload on each node, resource availability, and the
state of each node (idle, busy, etc.).
Several strategies exist for distributing load in a distributed system. Each strategy
has different benefits and is suited to different system characteristics. Here are the
major strategies:
1. Crash Failures:
2. Omission Failures:
3. Timing Failures:
4. Response Failures:
5. Communication Failures:
6. Byzantine Failures:
o A Byzantine failure refers to the most severe type of failure in
distributed systems, where nodes can behave arbitrarily or
maliciously. In this type of failure, a node may not only crash but can
send incorrect or misleading information to other nodes.
To ensure that distributed systems can function properly despite failures, various
fault tolerance techniques are implemented. Some of these techniques are discussed
below:
1. Replication:
2. Checkpointing:
This is particularly useful in systems where tasks are long-running and must
be preserved even if a failure occurs during execution.
3. Redundancy:
4. Consensus Protocols:
Consensus protocols help distributed systems agree on a single value or
decision despite failures. They are especially important in fault-tolerant
systems to ensure consistency and reliability when nodes crash or behave
unpredictably.
Timeouts and retries are mechanisms where a node waits for a response
within a specified time and retries the request if the response is not received.
This ensures that transient network issues or delays do not lead to failure.
7. Failure Detectors:
Failure detectors monitor the health of nodes or processes and notify the
system when a failure occurs. These detectors help the system decide which
nodes to remove from the network and which nodes to bring back online
after recovery.
Failure detectors can operate with varying levels of accuracy, from simple
heartbeats to more complex algorithms that detect crashes or arbitrary
failures.
8. Backup Systems:
In distributed systems, routing is not just about finding paths between two nodes,
but also ensuring that communication occurs efficiently, even if parts of the network
fail or become congested. Routing algorithms are essential for ensuring that data
can traverse the system in an optimal manner, especially in large-scale systems with
a high number of nodes.
1. Unicast Routing:
2. Multicast Routing:
3. Broadcast Routing:
4. Anycast Routing:
Routing Algorithms:
Routing algorithms can be categorized into two broad types: static and dynamic.
1. Static Routing:
o In static routing, the routes are predetermined and fixed. These routes
do not change during runtime, making the algorithm simple but
inflexible in handling network failures or changes.
2. Dynamic Routing:
The routing algorithm must correctly compute the best or optimal path
between nodes. This means the algorithm must ensure that the chosen path
leads to the destination without getting stuck or going in an incorrect
direction.
2. Termination:
3. Optimality:
The routing algorithm should find the optimal or near-optimal path for data
transmission. In many cases, this means finding the path with the lowest cost,
such as the shortest route, minimal latency, or maximum available
bandwidth.
5. Convergence:
The time it takes for the algorithm to reach this stable state is also crucial. A
routing algorithm with fast convergence is desirable as it can quickly adapt
to network changes such as node failures or additions.
6. Scalability:
7. Stability:
Stability ensures that the routing algorithm does not introduce oscillations or
flapping routes. This can happen when there are constant changes in the
network, leading to frequent updates in routing tables. An unstable algorithm
may lead to excessive overhead and unreliable routing decisions.
8. Fairness:
9. Adaptability:
11. Consistency:
Consistency ensures that all nodes in the network have a consistent view of
the network topology. This is crucial for maintaining optimal routing
decisions and avoiding routing loops or conflicts.
23. What are distributed objects? Explain RMI with example and role of
proxy/skeleton.
Distributed Objects
The main idea behind distributed objects is to allow for transparent communication
between objects that are physically separated but logically part of the same system.
In traditional object-oriented programming (OOP), an object is a bundle of data and
methods that operate on that data. However, in distributed systems, the methods of
an object might be executed on different machines or devices connected over a
network, and these remote interactions should be made as seamless as possible for
developers and users.
In a typical RMI setup, a client invokes methods on a remote object as though the
object were local, and the underlying RMI infrastructure takes care of serializing the
method arguments, sending them across the network, and returning the results.
1. RMI Registry: The RMI registry acts as a directory for all remote objects. The
server registers its remote objects with the RMI registry, and the client looks
up the remote object in the registry before invoking methods.
3. Client: The client accesses the remote object by looking it up in the RMI
registry, where it then interacts with the object by invoking methods
remotely.
4. RMI Server: The server is responsible for creating the remote object and
registering it with the RMI registry. The server listens for client requests and
responds with results from the remote object.
1. Proxy:
o The proxy (also called the stub) acts as a placeholder for the remote
object on the client side. When a client wants to invoke a method on a
remote object, it interacts with the proxy rather than the actual
remote object. The proxy then forwards the method call to the real
remote object.
2. Skeleton:
Example of RMI
Let's take a look at a simple example of RMI in Java to demonstrate the core
concepts:
import java.rmi.*;
import java.rmi.*;
import java.rmi.*;
import java.rmi.registry.*;
try {
Naming.rebind("Hello", obj);
} catch (Exception e) {
}
}
4. Client Code (looks up the remote object in the RMI registry and calls the
remote method):
import java.rmi.*;
try {
System.out.println(obj.sayHello());
} catch (Exception e) {
Explanation:
HelloImpl implements the Hello interface and provides the method's logic.
The HelloServer class creates and registers the HelloImpl object with the
RMI registry.
The HelloClient class looks up the remote object from the registry and
invokes the sayHello() method remotely.
24. What is Remote Procedure Call (RPC)? Explain the basic RPC operation.
A Remote Procedure Call (RPC) is a protocol that allows a program to execute a
procedure (or function) on a remote server or system, as if it were a local procedure
call. In an RPC, the procedure call is made across a network, but the calling code
doesn't need to worry about the details of network communication. RPC abstracts
the complexities of remote communication and provides a convenient way for
programs to interact over a network.
The key idea behind RPC is that a client can invoke a function that is executed on a
remote machine, just like it would call a function that resides on the local machine.
The client program sends a request to the server, and the server processes the
request and sends back the results, all while maintaining the illusion of a local
procedure call. RPC systems are used widely in distributed systems for
communication between different services or components.
The basic flow of an RPC operation can be broken down into several steps:
1. Client Side:
o The stub serializes the arguments of the procedure into a format that
can be transmitted over the network (a process called marshalling).
o The stub then sends the marshaled request across the network to the
server, using the transport protocol (such as TCP/IP).
2. Server Side:
o On the server side, there is also a stub called the server stub. The
server stub listens for incoming RPC requests and unmarshals the
data received from the client (this process is called unmarshalling).
o The server stub then calls the actual procedure (also called a service
method) with the unmarshalled arguments.
o The server stub marshals the result and sends it back to the client
stub over the network.
o The client stub receives the result, unmarshals it, and returns the
result to the client application.
4. Client Application:
RPC Architecture
2. Client Stub: The local representation of the remote procedure. The client
interacts with the stub, which handles marshalling, sending the request, and
receiving the result.
3. Server Stub: The server-side counterpart to the client stub. It listens for
incoming RPC requests, unmarshals the request, calls the appropriate
method on the server, and sends the result back.
4. Server: The entity that implements the actual procedure being called
remotely. The server performs the requested operation and returns the
result.
CORBA RMI
The term CORBA RMI refers to Remote Method Invocation (RMI) in the context of
CORBA. It is an extension of the concept of RMI, commonly used in Java-based
systems, to the CORBA world. While Java’s RMI allows for communication between
objects in the same Java programming environment, CORBA RMI extends this
capability to a more diverse set of environments.
1. Client-Server Model:
5. Location Transparency:
6. Interoperability:
1. Interface Definition:
o The developer defines the interface of the remote object using IDL.
This interface describes the operations that can be invoked remotely,
and the data types that can be exchanged.
2. IDL Compilation:
o The IDL file is compiled into language-specific stubs (on the client-
side) and skeletons (on the server-side). These stubs and skeletons
provide the necessary code to invoke remote methods.
o The client and server implement the logic for interacting with the
remote objects. The server provides the implementation for the
methods defined in the IDL interface, while the client calls those
methods using the stubs.
4. Remote Method Invocation:
o The client calls a remote method through the stub. The stub
communicates with the ORB to send the request to the remote object.
The ORB routes the request to the appropriate server, where the
skeleton unmarshals the request, invokes the method on the server-
side object, and sends the result back to the client.
5. Result Delivery:
o The client receives the result of the remote method call, just as if the
method had been executed locally.