DS Ass
DS Ass
Page 1 of 21
Q1. How distributed computing systems are going to be evolved in future and
explain it briefly mentioning/citing with proper references.
ANS:
The present computing paradigm is not scalable since it depends on "shared memory", yet most
physical systems work with message passing, so to gain progress you need to convince people to
surrender one for the other. You can bring down the bar on the progress by making Simultaneous
Multiple processing work as a programming paradigm on the top of message passing.
Because of the quick advancement in PC equipment, software, web, sensor networks, portable
device communications, and interactive media advances, distributed computing systems have
evolved radically to improve and grow various applications with better nature of administrations
and lower cost, particularly those involving human factors [2]. Besides reliability, performance
and availability, many other attributes, such as security, privacy, trustworthiness, situation
awareness, flexibility and rapid development of various applications, have also become
important. Distributed Computing System will serve and evolved in the long run.
With the rapid development of various emerging distributed computing technologies such as
Web services, Grid computing, and Cloud computing, computer networks become the integrant
of the next generation distributed computing systems. Therefore, integration of networking and
distributed computing systems becomes an important research problem for building the next-
generation high performance distributed information infrastructure.
In the near future, distributed application frameworks will support mobile code, multimedia data
streams, user and device mobility, and spontaneous networking [3].
Looking further into the future, essential techniques of distributed systems will be incorporated
into an emerging new area, envisioning billions of communicating smart devices forming a
Page 2 of 21
world-wide distributed computing system several orders of magnitude larger than today's
Internet.
a. XML:
Parsing and processing XML documents can be computationally expensive and time-consuming,
especially for large documents.
XML is verbose and can lead to larger file sizes, which can affect network transfer times and
storage requirements.
The complexity of XML can make it difficult to read and understand, which can increase the
likelihood of errors and decrease developer productivity.
b. SOA:
SOA is a complex and highly configurable architecture that can be difficult to design and
implement correctly, leading to higher development and maintenance costs.
The decoupling of services in SOA can lead to increased network traffic, which can affect
system performance.
The service-oriented nature of SOA can lead to a proliferation of services, which can become
difficult to manage and govern.
c. SOAP:
SOAP messages can be verbose and can lead to larger file sizes, which can affect network
transfer times and storage requirements.
Page 3 of 21
SOAP can be slower than other communication protocols, such as REST, due to its use of XML
and additional processing requirements.
SOAP is tightly coupled and can be difficult to modify once deployed, which can affect system
agility and flexibility.
d. RESTful:
RESTful services can be less secure than SOAP services, as they rely on HTTP methods and do
not provide built-in security features.
RESTful services can be less reliable than SOAP services, as they rely on the statelessness of
HTTP, which can be affected by network errors and server failures.
The lack of a standardized approach to RESTful service design and documentation can lead to
inconsistencies and difficulties in service discovery and integration.
It is important to note that these technologies have their own strengths and are widely used in
various applications despite their limitations. Therefore, it is important to carefully consider the
specific requirements of an application before selecting a technology to use.
Two-phase locking (2PL) and three-phase locking (3PL) are concurrency control protocols used
in database management systems to ensure transaction atomicity, consistency, isolation, and
durability (ACID properties). They are used to prevent conflicts between transactions that may
try to access the same data simultaneously.
Two-phase locking (2PL): In two-phase locking, a transaction acquires all the locks it
needs before it performs any modifications to the data. The protocol consists of two phases: the
growing phase and the shrinking phase.
Growing phase: During the growing phase, a transaction acquires locks on the data items it
needs to access. Once a lock is acquired, it cannot be released until the end of the transaction.
Page 4 of 21
Shrinking phase: During the shrinking phase, a transaction releases the locks it has acquired
after it has completed its modifications to the data. Once a lock is released, it cannot be
reacquired.
The two-phase locking protocol guarantees serializability, meaning that the transactions are
executed in a way that produces the same result as if they were executed serially, one after the
other.
Growing phase: During the growing phase, a transaction acquires locks on the data items it
needs to access.
Validation phase: During the validation phase, a transaction checks if it can acquire all the locks
it needs to complete its modifications. If it cannot, it releases all the locks it has acquired and
starts again from the beginning of the growing phase.
Shrinking phase: During the shrinking phase, a transaction releases the locks it has acquired
after it has completed its modifications to the data.
The three-phase locking protocol ensures strict two-phase locking, meaning that a transaction
does not release any locks until the end of the transaction. It also prevents deadlocks by releasing
all locks if a transaction cannot acquire all the locks it needs during the validation phase.
In summary, 2PL and 3PL are concurrency control protocols that ensure the atomicity,
consistency, isolation, and durability of transactions in a database management system. Two-
phase locking acquires all the locks before modifications and releases them all at the end, while
three-phase locking adds a validation phase to prevent deadlock.
Page 5 of 21
Transaction management is a critical aspect of distributed computing as it ensures data
consistency and integrity when multiple systems interact with each other. In distributed systems,
transactions may span multiple nodes, and failures in any of the participating nodes can cause
data inconsistencies. Transaction management provides a mechanism for coordinating and
controlling transactions, ensuring that they are atomic, consistent, isolated, and durable (ACID).
ACID properties ensure that transactions are executed reliably and that data integrity is
maintained even in the event of failures or conflicts. The following are the essential properties of
a transaction:
Atomicity: A transaction should be treated as a single, indivisible unit of work. Either all of the
changes in the transaction are committed, or none of them are committed.
Consistency: A transaction should ensure that the database is in a consistent state both before
and after the transaction is executed.
Isolation: Transactions should be executed independently of each other. Changes made by one
transaction should not affect other transactions.
Durability: Once a transaction is committed, its changes should persist even in the event of a
system failure.
The transaction coordinator (TC) sends a prepare message to all the nodes participating in the
transaction, asking them to prepare to commit the transaction.
Each node checks whether it can commit the transaction. If it can, it sends an agreement message
to the TC. If it cannot, it sends a abort message.
The TC collects all the agreement messages from the nodes. If it receives an abort message from
any node, it sends a rollback message to all the nodes, instructing them to abort the transaction.
If the TC receives agreement messages from all the nodes, it sends a commit message to all the
nodes, instructing them to commit the transaction.
Page 6 of 21
This protocol ensures that all the nodes participating in the transaction agree to commit the
transaction before any changes are made, ensuring consistency and durability across the
distributed system.
In summary, transaction management is essential for ensuring data consistency and integrity in
distributed systems. The two-phase commit protocol provides a mechanism for coordinating and
controlling transactions, ensuring that they are executed reliably and that data integrity is
maintained even in the event of failures or conflicts.
There are several simulation and emulation frameworks available for distributed computing
platforms that allow for the testing, comparison, and optimization of various distributed
computing capabilities. Here are a few examples of such frameworks:
CloudSim: CloudSim is a simulation framework for modeling and simulating cloud computing
environments. It provides a platform for simulating various cloud computing scenarios, including
infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), and software-as-a-service (SaaS)
models. CloudSim can help to analyze and optimize the performance, energy consumption, and
cost-effectiveness of cloud computing environments.
GridSim: GridSim is a simulation framework for modeling and simulating grid computing
environments. It provides a platform for simulating various grid computing scenarios, including
job scheduling, data management, and resource allocation. GridSim can help to analyze and
optimize the performance and efficiency of grid computing environments.
Page 7 of 21
Distem: Distem is an emulation framework for distributed systems and applications. It allows for
the creation of a virtual testbed for running and testing distributed computing applications.
Distem can simulate different network topologies, communication protocols, and application
behaviors to analyze and optimize system performance.
Shadow: Shadow is an emulation framework for network systems and applications. It provides a
platform for running and testing distributed systems and applications in a realistic environment.
Shadow can simulate various network scenarios, including different network topologies, link
delays, packet losses, and congestion, to analyze and optimize the performance of distributed
systems and applications.
These simulation and emulation frameworks can help developers and researchers to analyze,
compare, and optimize the performance of various distributed computing platforms and
applications. By simulating and emulating various scenarios, these frameworks can help to
identify potential issues and bottlenecks, and optimize the performance and efficiency of
distributed computing systems and applications.
Load Balancing: Load balancing is the process of distributing workloads across multiple
machines in a way that ensures no machine is overloaded. This approach can help to improve
performance by making sure that resources are being used efficiently.
Page 8 of 21
Caching: Caching involves storing frequently accessed data in memory, which can help to
reduce the number of times the data needs to be retrieved from disk. This approach can help to
improve performance by reducing latency.
Data Partitioning: Data partitioning involves dividing a dataset into smaller, more manageable
pieces that can be processed in parallel. This approach can help to improve performance by
reducing the amount of data that needs to be processed at any one time.
Replication: Replication involves duplicating data across multiple machines. This approach can
help to improve performance by reducing the time it takes to access data, as data can be retrieved
from the nearest machine.
Message Queuing: Message queuing involves sending messages between machines using a
queuing system. This approach can help to improve performance by reducing the time it takes to
process messages, as messages can be processed asynchronously.
These approaches can be combined and customized based on the specific requirements of a
distributed system to achieve high performance.
Q7. Explain major distributed platform areas and its algorithm strengths and
Weakness
ANS:
Page 9 of 21
Distributed platforms are designed to support the execution of complex distributed applications
across multiple nodes in a network. They typically provide a set of services and APIs to enable
the development of distributed applications that can leverage the underlying infrastructure's
processing and storage capabilities. Here are some major areas of distributed platforms and their
algorithm strengths and weaknesses:
Distributed Storage:
Distributed storage systems are designed to store and manage large amounts of data across
multiple nodes in a network. These systems typically use algorithms such as distributed hash
tables (DHTs) and gossip protocols to manage data distribution, replication, and consistency.
Strengths:
High availability and fault tolerance: Data is distributed across multiple nodes, making it highly
available and resilient to node failures.
Scalability: The storage capacity can be easily increased by adding more nodes to the network.
Low latency: Data can be accessed quickly from the node closest to the user.
Weaknesses:
Complexity: Managing a distributed storage system can be complex due to the need for data
distribution, replication, and consistency.
Distributed Computing:
Distributed computing systems are designed to distribute computational tasks across multiple
nodes in a network. These systems typically use algorithms such as MapReduce, Apache Spark,
and Hadoop to distribute tasks, process data, and aggregate results.
Strengths:
Page 10 of 21
Fault tolerance: Distributed computing systems can continue to function even if some nodes fail.
Efficiency: Distributed computing systems can process large amounts of data in a relatively
short amount of time.
Weaknesses:
Overhead: The overhead of data transfer and coordination between nodes can affect
performance.
Complexity: Developing and managing distributed computing systems can be complex due to
the need for distributed data processing, task coordination, and error handling.
Distributed Messaging:
Strengths:
Scalability: Messaging systems can handle large volumes of messages and events.
Asynchronous: Messaging systems can process messages and events asynchronously, allowing
for more efficient use of computational resources.
Weaknesses:
Reliability: Messaging systems can be affected by network latency and failures, which can
affect reliability.
In summary, distributed platforms are designed to provide a set of services and APIs to enable
the development of distributed applications that can leverage the underlying infrastructure's
processing and storage capabilities. Each of the major areas of distributed platforms has its own
Page 11 of 21
algorithm strengths and weaknesses, which must be carefully considered when designing and
implementing distributed applications.
The clock synchronization can be achieved by 2 ways: External and Internal Clock
Synchronization.
1. Centralized is the one in which a time server is used as a reference. The single time
server propagates its time to the nodes and all the nodes adjust the time accordingly.
It is dependent on single time server so if that node fails, the whole system will lose
Page 12 of 21
synchronization. Examples of centralized are- Berkeley Algorithm, Passive Time
Server, Active Time Server etc.
2. Distributed is the one in which there is no centralized time server present. Instead
the nodes adjust their time by using their local time and then, taking the average of
the differences of time with other nodes. Distributed algorithms overcome the issue
of centralized algorithms like the scalability and single point failure. Examples of
Distributed algorithms are – Global Averaging Algorithm, Localized Averaging
Algorithm, NTP (Network time protocol) etc.
Leadership algorithms for distributed platforms
Many distributed election algorithms have been proposed to resolve the problem of leader
election. Among all the existing algorithms, the most prominent algorithms are as:
Algorithm is one of the most promising election algorithms which were presented by Gracia
Molina in 1982.
It required that every process should know the identity of every other process in the system so
it takes very large space in the system.
It has high number of message passing during communication which increases heavy
traffic .the message passing has order o (n2).
Page 13 of 21
It also overcomes the disadvantages of the original bully. The main concept of this algorithm is
that the algorithm declares the new coordinator before actual or current coordinator is crashed.
Disadvantages
Disadvantages
It is better than bully but also has o (n2) complexity in worst case.
RING ALGORITHM
This election algorithm is based on the use of a ring. We assume that the processes are physically
or logically ordered, so that each process knows who its successor[4].
In this article, we will see the history of distributed computing systems from the mainframe era
to the current day to the best of my knowledge. It is important to understand the history of
anything in order to track how far we progressed. The distributed computing system is all
about evolution from centralization to decentralization, it depicts how the centralized systems
evolved from time to time towards decentralization. We had a centralized system like
Page 14 of 21
mainframe in early 1955 but now we are probably using a decentralized system like edge
computing and containers.
Page 15 of 21
5. P2P, Grids & Web Services: Peer-to-peer (P2P) computing or networking is a distributed
application architecture that partitions tasks or workloads between peers without the
requirement of a central coordinator. Peers share equal privileges. In a P2P network, each
client acts as a client and server.P2P file sharing was introduced in 1999 when American
college student Shawn Fanning created the music-sharing service Napster.P2P networking
enables decentralized internet. With the introduction of Grid computing, multiple tasks can be
completed by computers jointly connected over a network. It basically makes use of a data grid
i.e., a set of computers can directly interact with each other to perform similar tasks by using
middleware. During 1994 – 2000, we also saw the creation of effective x86 virtualization.
With the introduction of web service, platform-independent communication was established
which uses XML-based information exchange systems that use the Internet for direct
application-to-application interaction. Through web services Java can talk with Perl; Windows
applications can talk with Unix applications. Peer-to-peer networks are often created by
collections of 12 or fewer machines.
6. Cloud, Mobile & IoT: Cloud computing came up with the convergence of cluster
technology, virtualization, and middleware. Through cloud computing, you can manage your
resources and applications online over the internet without explicitly building on your hard
drive or server. The major advantage is provided that it can be accessed by anyone from
anywhere in the world. Many cloud providers offer subscription-based services. After paying
for a subscription, customers can access all the computing resources they need. Customers no
longer need to update outdated servers, buy hard drives when they run out of storage, install
software updates or buy a software licenses. The vendor does all that for them. Mobile
computing allows us to transmit data, such as voice, and video over a wireless network. We no
longer need to connect our mobile phones with switches
The evolution of Application Programming Interface (API) based communication over the
REST model was needed to implement scalability, flexibility, portability, caching, and
security. Instead of implementing these capabilities at each and every API separately, there
came the requirement to have a common component to apply these features on top of the API.
This requirement leads the API management platform evolution and today it has become one
Page 16 of 21
of the core features of any distributed system. Instead of considering one computer as one
computer, the idea to have multiple systems within one computer came into existence.
7. Fog and Edge Computing: When the data produced by mobile computing and IoT services
started to grow tremendously, collecting and processing millions of data in real-time was still
an issue. This leads to the concept of edge computing in which client data is processed at the
periphery of the network, it’s all about the matter of location. That data is moved across a
WAN such as the internet, processed, and analyzed closer to the point such as corporate LAN,
where it’s created instead of the centralized data center which may cause latency issues. Fog
computing greatly reduces the need for bandwidth by not sending every bit of information over
cloud channels, and instead aggregating it at certain access points. This type of distributed
strategy lowers costs and improves efficiencies. Companies like IBM are the driving force
behind fog computing. The composition of Fog and Edge computing further extends the Cloud
computing model away from centralized stakeholders to decentralized multi-stakeholder
systems which are capable of providing ultra-low service response times, and increased
aggregate bandwidths.
Today distributed system is programmed by application programmers while the underlying
infrastructure management is done by a cloud provider. This is the current state of distributed
systems of computing and it keeps on evolving.
Distributed computing platforms have become increasingly popular in recent years due to their
ability to handle large-scale and complex computations. Here are the current limitations and
strengths of some leading distributed computing platforms:
Apache Hadoop: Strengths: Hadoop is widely used for processing large volumes of data in a
fault-tolerant and scalable manner. It supports a variety of data sources and offers a flexible
data processing framework. Hadoop is well-supported by a large community and offers a
variety of tools for data analytics.
Page 17 of 21
Limitations: Hadoop has a high latency due to its reliance on disk I/O and can suffer from
performance issues when processing small files. It is not designed for real-time data processing
and can be complex to set up and manage.
Apache Spark: Strengths: Spark is a high-performance data processing engine that can
handle both batch and real-time processing. It supports a variety of data sources and offers a
flexible programming model. Spark is well-supported by a large community and offers a
variety of tools for data analytics.
Apache Flink: Strengths: Flink is a high-performance data processing engine that offers both
batch and stream processing capabilities. It is designed for low-latency data processing and
offers a flexible programming model. Flink is well-suited for complex event processing and
real-time analytics.
Limitations: Flink is relatively new compared to other distributed computing platforms and may
not have as large of a community or ecosystem of tools. It may also require more expertise to set
up and manage compared to other platforms.
Apache Kafka: Strengths: Kafka is a high-performance messaging system that can handle
large volumes of data streams. It offers low-latency and high-throughput data processing and
is well-suited for real-time data processing and event-driven architectures. Kafka is widely
used for building scalable and reliable data pipelines.
Limitations: Kafka is not a full-featured data processing engine and may require additional tools
for data processing and analytics. It may also require more expertise to set up and manage
compared to other messaging systems.
In summary, distributed computing platforms have strengths and limitations that should be
considered when selecting a platform for a particular use case. Apache Hadoop, Spark, Flink,
and Kafka are popular platforms with different strengths and limitations that make them suitable
for different types of data processing and analytics.
Page 18 of 21
Reference
[1] M. van Steen and A. S. Tanenbaum, “A brief introduction to distributed
systems,” Computing, vol. 98, no. 10, pp. 967–1009, 2016, doi:
10.1007/s00607-016-0508-7.
[2] S. S. Yau, “Challenges and Future Trends of Distributed Computing
Systems,” pp. 758–758, 2011, doi: 10.1109/hpcc.2011.151.
[3] J. Brier and lia dwi jayanti, No 主観的健康感を中心とした在宅高齢者に
おける 健康関連指標に関する共分散構造分析 Title, vol. 21, no. 1. 2020.
[Online]. Available:
http://journal.um-surabaya.ac.id/index.php/JKM/article/view/2203
[4] S. Balhara and K. Khanna, “Leader Election Algorithms in Distributed
Systems,” Int. J. Comput. Sci. Mob. Comput., vol. 3, no. 6, pp. 374–379,
2014.
[5] Zaharia, M., et al. (2010). "Spark: Cluster Computing with Working Sets."
Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud
Computing.
[6] Zaharia, M., et al. (2016). "Apache Flink: Stream and Batch Processing in a
Single Engine." IEEE Data Engineering Bulletin.
[7] Apache Kafka (2022). "Why Use Kafka?" Retrieved from
Web sites
1. https://www.geeksforgeeks.org/evolution-of-distributed-computing-systems/
2. https://insights.daffodilsw.com/blog/distributed-cloud-computing-benefits-and-limitations
3. https://www.geeksforgeeks.org/limitation-of-distributed-system/
Page 19 of 21
4. https://www.techtarget.com/whatis/definition/distributed-computing
5. https://www.geeksforgeeks.org/synchronization-in-distributed-systems/
Page 20 of 21