Week 5
Week 5
A distributed database refers to a database system in which data is spread across multiple nodes
or locations, often connected via a network. Unlike a centralized database where all data is
stored in a single location, a distributed database distributes data across different sites for
improved scalability, availability, and performance.
In a distributed database, each node typically manages a portion of the data independently
while still being able to communicate and coordinate with other nodes in the network. This
allows for parallel processing, fault tolerance, and better load balancing.
1. Replication: Data is duplicated across multiple nodes, providing redundancy and fault
tolerance. Changes to data in one node are propagated to other nodes to ensure
consistency.
3. Federated: Each node maintains its own database, but a federated system provides a
unified interface to access data across multiple databases. Queries can be distributed
and executed across different nodes transparently to the user.
Distributed databases are commonly used in scenarios where data needs to be accessed and
updated by users or applications distributed across different geographical locations, such as in
global enterprises, cloud computing environments, and decentralized applications. However,
designing and managing distributed databases can be complex due to challenges related to data
consistency, concurrency control, and network latency. Various distributed database
management systems (DBMS) and architectural patterns have been developed to address these
challenges and provide efficient and reliable data management in distributed environments.
Here's how a distributed database might be implemented in the context of the global financial
system:
By distributing data and processing across multiple locations and using replication, partitioning,
federation, and hybrid strategies, the global financial system can ensure high availability,
scalability, and security while meeting the stringent requirements of the financial industry and
regulatory authorities. However, managing a distributed database for such a system requires
sophisticated technologies, robust security measures, and stringent governance processes to
protect sensitive financial data and ensure the integrity and reliability of financial transactions.
1. Improved Scalability: Distributed databases can scale horizontally by adding more nodes
or partitions, allowing them to handle increasing data volumes and user loads more
efficiently. This scalability is essential for accommodating growth in data-driven
applications and services without experiencing performance bottlenecks.
5. Data Redundancy and Disaster Recovery: Distributed databases often replicate data
across multiple nodes, providing redundancy and resilience against hardware failures,
network outages, and natural disasters. In the event of a failure or data loss, redundant
copies of the data can be used for recovery, ensuring data integrity and business
continuity.
6. Flexible Data Placement: Distributed databases offer flexibility in how data is stored and
managed, allowing organizations to partition data based on different criteria such as
geographical location, access patterns, or compliance requirements. This flexibility
enables organizations to optimize data storage and access for their specific needs and
use cases.
7. Support for Big Data and Analytics: Distributed databases are well-suited for handling
large volumes of data and performing complex analytics tasks such as data mining,
machine learning, and real-time analytics. By distributing data and processing across
multiple nodes, distributed databases can efficiently process and analyze massive
datasets, enabling organizations to derive valuable insights and make data-driven
decisions.
Overall, distributed databases offer a robust and scalable solution for managing and processing
data in modern, distributed environments, providing organizations with the flexibility,
availability, and performance they need to meet the demands of today's data-driven world.
5. Increased Overhead: Replicating and synchronizing data across distributed nodes incurs
additional overhead in terms of storage, bandwidth, and computational resources. This
overhead can impact performance and scalability, especially for write-intensive
workloads and large-scale deployments. Optimizing data replication strategies and
resource allocation is essential to minimize overhead and maximize efficiency.
6. Data Partitioning Challenges: Partitioning data across distributed nodes requires careful
consideration of data distribution strategies and key design decisions. Inadequate
partitioning can lead to data skew, hotspots, and uneven workload distribution, affecting
performance and scalability. Balancing data partitioning with query performance and
data access patterns is a non-trivial task.
7. Vendor Lock-in: Adopting a specific distributed database technology or vendor may lead
to vendor lock-in, limiting flexibility and interoperability with other systems and
platforms. Migrating data between different distributed databases or transitioning to a
different architecture can be complex and costly, especially if the system relies on
proprietary features or vendor-specific APIs.
Overall, while distributed databases offer numerous benefits, addressing the challenges
associated with complexity, data consistency, network dependency, security, overhead, data
partitioning, and vendor lock-in is essential for successfully deploying and managing distributed
database systems. Organizations need to carefully evaluate their requirements, architecture, and
deployment strategies to mitigate these disadvantages and maximize the benefits of distributed
databases.
Distributed Database Management System (DDBMS)
A Distributed Database Management System (DDBMS) is a software system that manages a
distributed database, which is a database that is spread across multiple nodes or locations, often
connected via a network. The primary purpose of a DDBMS is to provide efficient and reliable
access to distributed data while ensuring data consistency, availability, and integrity.
1. Data Distribution and Replication: A DDBMS facilitates the distribution and replication
of data across multiple nodes in a distributed environment. Data distribution involves
dividing the database into partitions or fragments and assigning them to different nodes
based on certain criteria. Data replication involves creating copies of data fragments and
distributing them across multiple nodes to improve availability and fault tolerance.
4. Data Consistency and Concurrency Control: Ensuring data consistency and managing
concurrency control in a distributed environment is a critical aspect of a DDBMS.
Techniques such as distributed locking, timestamp-based concurrency control, and
distributed snapshot isolation are used to manage concurrent access to distributed data
and maintain data consistency.
5. Data Recovery and Fault Tolerance: A DDBMS implements mechanisms for data
recovery and fault tolerance to handle failures and ensure the availability and integrity
of data in the event of node failures, network partitions, or other disruptions. This
includes techniques such as distributed logging, distributed checkpointing, and
distributed recovery protocols.
6. Security and Access Control: A DDBMS provides features for ensuring the security and
privacy of distributed data, including authentication, authorization, encryption, and data
masking. Access control mechanisms are used to enforce security policies and regulate
access to sensitive data stored across distributed nodes.
Overall, a Distributed Database Management System plays a crucial role in enabling efficient and
reliable access to distributed data in modern, interconnected environments, facilitating
applications and services that span multiple locations and require scalable and resilient data
management capabilities.
These are some of the common types of distributed database systems, each offering different
characteristics, advantages, and challenges. The choice of a distributed database system
depends on factors such as scalability requirements, data consistency and availability
constraints, performance objectives, and organizational needs
Data distribution
Data distribution in distributed database systems refers to the process of dividing the database
into smaller fragments or partitions and distributing these fragments across multiple nodes or
locations within a distributed computing environment. This distribution enables efficient data
storage, access, and management across distributed nodes, providing benefits such as improved
performance, scalability, fault tolerance, and availability. Data distribution strategies play a
crucial role in designing and optimizing distributed database systems.
The various issues related to data distribution are
data fragmentation, data allocation and data
replication.
data distribution in distributed database systems introduces several key issues, including data
fragmentation, data allocation, and data replication. Let's explore each of these issues in more
detail:
1. Data Fragmentation:
Data fragmentation refers to the process of dividing a database into smaller fragments or
partitions distributed across multiple nodes in a distributed environment. The decision regarding
which portions of the database will be stored at which site are generally taken during the
distributed database design.
1. Horizontal Fragmentation: Divides the database table rows into subsets based on a specified
condition or attribute value. For example, customer records can be horizontally fragmented
based on geographical regions.
The horizontal partitions for a distributed database have the following major advantages:
Efficiency: Data are stored close to where they are used and separate from other data
used by other users or applications.
Local optimization: Data can be stored to optimize performance for local access.
Security: Data not relevant to usage at a particular site are not made available.
Ease of querying: Combining data across horizontal partitions is easy because rows are
simply merged by unions across the partitions.
Inconsistent access speed: When data from several partitions are required, the access
time can be significantly different from local-only data access.
Backup vulnerability: When data at one site become inaccessible or damaged, user
cannot switch to another site where a copy exists, because data are not replicated.
2. Vertical Fragmentation: Divides the database table columns into subsets, typically based on
column relevance or access patterns. For example, frequently accessed columns can be
placed in one fragment, while less accessed columns can be placed in another.
The advantages and disadvantages of vertical partitions are identical to those for horizontal
partitions, with the exception that combining data across vertical partitions is more difficult than
across horizontal partitions.
3. Data Replication:
Data replication involves creating and maintaining copies of data fragments across multiple
nodes to improve fault tolerance, availability, and data access performance. The replication of
data improves the performance, availability and reliability of the distributed database system.
Reliability: If one or more sites containing the database fail, the copy of the database can
always be found at another site without network traffic delays.
Fast Response: Every site that has a full copy of database can process queries locally,
thus queries can be processed rapidly.
Possible Avoidance of Complicated Distributed Transaction Integrity Routines: Replicated
databases are usually refreshed at scheduled intervals, thus most forms of replication
are used when some relaxing of synchronization across database copies is acceptable.
Node Decoupling: If some sites are down, busy, or disconnected, a transaction is handled
when the user desires. This is possible since each transaction may proceed without
coordination across the network.
Reduced Network Traffic at Prime Time: In general, the updation of data happens during
prime business hours, and at this time the network traffic is highest and the demands for
rapid response greatest. Due to replication, the delayed updating.
Storage Requirements: Each site that has a full copy must have the same storage
capacity as if the data were stored centrally. Each copy of the database needs to be
updated on each site that holds a copy. This requires storage space and processing time.
Complexity and Cost of Updating: Whenever a database is updated, it must be updated
at each site that holds a copy. Careful coordination is required in synchronizing the
updating in near real time.
Full Replication
No Replication
Partial Replication.
FULL REPLICATION
In full replication, the replica of the whole database is stored at every site in the distributed
system. This means every relation is available to every user locally.
Advantages :
The availability increases drastically as the system continue to operate as long as at least
one site is up.
The performance increases since result of every query can be obtained locally from any
site.
Queries can be processed rapidly.
Disadvantages :
It slows down the update operations drastically, since the update must be performed on
every copy of the database to keep the copies consistent.
The concurrency control and recovery techniques become more expensive.
Since each site that has a full copy, it must have the same storage capacity that would be
required if the data were stored centrally.
NO REPLICATION
In no replication, each fragment of the database is stored at exactly one site. Thus all fragments
of the database are disjoint except the primary key.
Advantages :
Updating is very easy since only at one place the data need to be updated.
The concurrency control and recovery techniques are less expensive.
Storage requirement is very-very less compared to full replication.
Disadvantages :
PARTIAL REPLICATION
In partial replication, some fragments of the database may be replicated whereas others may
not. The copies of each fragment varies from one to total number of sites in the distributed
system.
Advantages :
Disadvantages :