0% found this document useful (0 votes)
6 views11 pages

Week 5

A distributed database is a system where data is spread across multiple nodes for improved scalability, availability, and performance, allowing for independent management of data at each node. It can be structured through replication, partitioning, federation, or hybrid approaches, and is commonly used in global enterprises and cloud environments. While offering advantages like enhanced availability and support for big data, distributed databases also present challenges such as complexity, data consistency, and security risks.

Uploaded by

haederredha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Week 5

A distributed database is a system where data is spread across multiple nodes for improved scalability, availability, and performance, allowing for independent management of data at each node. It can be structured through replication, partitioning, federation, or hybrid approaches, and is commonly used in global enterprises and cloud environments. While offering advantages like enhanced availability and support for big data, distributed databases also present challenges such as complexity, data consistency, and security risks.

Uploaded by

haederredha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Distributed Database

A distributed database refers to a database system in which data is spread across multiple nodes
or locations, often connected via a network. Unlike a centralized database where all data is
stored in a single location, a distributed database distributes data across different sites for
improved scalability, availability, and performance.

In a distributed database, each node typically manages a portion of the data independently
while still being able to communicate and coordinate with other nodes in the network. This
allows for parallel processing, fault tolerance, and better load balancing.

Distributed databases can be structured in various ways, including:

1. Replication: Data is duplicated across multiple nodes, providing redundancy and fault
tolerance. Changes to data in one node are propagated to other nodes to ensure
consistency.

2. Partitioning: Data is partitioned or sharded across multiple nodes based on certain


criteria (e.g., key ranges, hash values). Each node is responsible for storing and managing
a specific subset of the data.

3. Federated: Each node maintains its own database, but a federated system provides a
unified interface to access data across multiple databases. Queries can be distributed
and executed across different nodes transparently to the user.

4. Hybrid: Combines elements of replication, partitioning, and federation to meet specific


requirements of the application or workload.

Distributed databases are commonly used in scenarios where data needs to be accessed and
updated by users or applications distributed across different geographical locations, such as in
global enterprises, cloud computing environments, and decentralized applications. However,
designing and managing distributed databases can be complex due to challenges related to data
consistency, concurrency control, and network latency. Various distributed database
management systems (DBMS) and architectural patterns have been developed to address these
challenges and provide efficient and reliable data management in distributed environments.

A real-life example of a distributed database


A real-life example of a distributed database is the global financial system used by banks and
financial institutions. This system is distributed across multiple geographical locations and
involves various components such as transaction processing, customer accounts, fraud
detection, and regulatory compliance.

Here's how a distributed database might be implemented in the context of the global financial
system:

1. Replicated Customer Accounts: Customer account information, including account


balances, transaction history, and personal details, is replicated across multiple data
centers or regions. This ensures that customer data is available and accessible for
transactions, inquiries, and reporting regardless of the customer's location or the data
center's availability.

2. Partitioned Transaction Processing: Transaction processing is partitioned based on


geographical regions or types of transactions. For example, transactions initiated in
Europe may be processed by data centers located in Europe, while transactions initiated
in Asia may be processed by data centers located in Asia. This partitioning approach
helps reduce latency and ensures compliance with regional regulations governing
financial transactions.

3. Federated Fraud Detection: Fraud detection systems use a federated approach to


analyze transactions across multiple data centers in real-time. Each data center hosts its
own fraud detection system, but these systems share information and coordinate their
efforts to detect and prevent fraudulent activities, such as unauthorized transactions or
identity theft, on a global scale.

4. Hybrid Regulatory Compliance: Regulatory compliance requirements, such as anti-


money laundering (AML) regulations or know your customer (KYC) regulations, are
managed using a hybrid approach. Common compliance rules and policies may be
enforced centrally across all data centers, while region-specific regulations may be
implemented locally in each data center to ensure compliance with local laws and
regulations.

By distributing data and processing across multiple locations and using replication, partitioning,
federation, and hybrid strategies, the global financial system can ensure high availability,
scalability, and security while meeting the stringent requirements of the financial industry and
regulatory authorities. However, managing a distributed database for such a system requires
sophisticated technologies, robust security measures, and stringent governance processes to
protect sensitive financial data and ensure the integrity and reliability of financial transactions.

Distributed databases advantages


Distributed databases offer several advantages over traditional centralized databases, especially
in modern, globally interconnected environments. Some of the key advantages include:

1. Improved Scalability: Distributed databases can scale horizontally by adding more nodes
or partitions, allowing them to handle increasing data volumes and user loads more
efficiently. This scalability is essential for accommodating growth in data-driven
applications and services without experiencing performance bottlenecks.

2. Enhanced Availability: By distributing data across multiple nodes or locations,


distributed databases can provide higher availability and fault tolerance. Even if some
nodes fail or become inaccessible, the system can continue to operate and serve
requests using data from other nodes, reducing the risk of downtime and data loss.

3. Geographical Distribution: Distributed databases can store data closer to where it is


needed, reducing latency and improving the responsiveness of applications for users in
different geographical regions. This is particularly beneficial for global enterprises and
online services with users distributed across multiple locations.

4. Increased Performance: Distributed databases can leverage parallel processing and


distributed query execution to improve performance for complex analytical queries and
data-intensive operations. By distributing the workload across multiple nodes,
distributed databases can achieve faster response times and better utilization of
computational resources.

5. Data Redundancy and Disaster Recovery: Distributed databases often replicate data
across multiple nodes, providing redundancy and resilience against hardware failures,
network outages, and natural disasters. In the event of a failure or data loss, redundant
copies of the data can be used for recovery, ensuring data integrity and business
continuity.

6. Flexible Data Placement: Distributed databases offer flexibility in how data is stored and
managed, allowing organizations to partition data based on different criteria such as
geographical location, access patterns, or compliance requirements. This flexibility
enables organizations to optimize data storage and access for their specific needs and
use cases.

7. Support for Big Data and Analytics: Distributed databases are well-suited for handling
large volumes of data and performing complex analytics tasks such as data mining,
machine learning, and real-time analytics. By distributing data and processing across
multiple nodes, distributed databases can efficiently process and analyze massive
datasets, enabling organizations to derive valuable insights and make data-driven
decisions.

Overall, distributed databases offer a robust and scalable solution for managing and processing
data in modern, distributed environments, providing organizations with the flexibility,
availability, and performance they need to meet the demands of today's data-driven world.

Distributed databases disadvantages


While distributed databases offer numerous advantages, they also come with some inherent
disadvantages and challenges. Here are several drawbacks associated with distributed
databases:

1. Complexity: Distributed databases are inherently more complex to design, implement,


and manage compared to centralized databases. They require expertise in distributed
systems, networking, and database administration. Dealing with issues such as data
consistency, replication, partitioning, and distributed query optimization adds to the
complexity.

2. Data Consistency: Ensuring data consistency across distributed nodes can be


challenging. With data being replicated and distributed, maintaining consistency in the
face of concurrent updates, network partitions, and node failures requires sophisticated
consistency models, distributed transactions, and coordination protocols. Achieving
strong consistency often comes at the cost of increased latency and reduced availability.

3. Network Dependency: Distributed databases rely heavily on network communication for


data replication, synchronization, and coordination. Network outages, latency, and
bandwidth limitations can impact the performance and availability of distributed
systems. Ensuring robust network infrastructure and mitigating network-related issues
are crucial for maintaining the reliability of distributed databases.

4. Security Risks: Distributed databases introduce additional security risks compared to


centralized databases. Data being distributed across multiple nodes increases the attack
surface and complexity of security management. Securing data transmission, enforcing
access controls, and protecting against unauthorized access, data breaches, and insider
threats require comprehensive security measures and encryption techniques.

5. Increased Overhead: Replicating and synchronizing data across distributed nodes incurs
additional overhead in terms of storage, bandwidth, and computational resources. This
overhead can impact performance and scalability, especially for write-intensive
workloads and large-scale deployments. Optimizing data replication strategies and
resource allocation is essential to minimize overhead and maximize efficiency.

6. Data Partitioning Challenges: Partitioning data across distributed nodes requires careful
consideration of data distribution strategies and key design decisions. Inadequate
partitioning can lead to data skew, hotspots, and uneven workload distribution, affecting
performance and scalability. Balancing data partitioning with query performance and
data access patterns is a non-trivial task.

7. Vendor Lock-in: Adopting a specific distributed database technology or vendor may lead
to vendor lock-in, limiting flexibility and interoperability with other systems and
platforms. Migrating data between different distributed databases or transitioning to a
different architecture can be complex and costly, especially if the system relies on
proprietary features or vendor-specific APIs.

Overall, while distributed databases offer numerous benefits, addressing the challenges
associated with complexity, data consistency, network dependency, security, overhead, data
partitioning, and vendor lock-in is essential for successfully deploying and managing distributed
database systems. Organizations need to carefully evaluate their requirements, architecture, and
deployment strategies to mitigate these disadvantages and maximize the benefits of distributed
databases.
Distributed Database Management System (DDBMS)
A Distributed Database Management System (DDBMS) is a software system that manages a
distributed database, which is a database that is spread across multiple nodes or locations, often
connected via a network. The primary purpose of a DDBMS is to provide efficient and reliable
access to distributed data while ensuring data consistency, availability, and integrity.

Key features of a Distributed Database Management System include:

1. Data Distribution and Replication: A DDBMS facilitates the distribution and replication
of data across multiple nodes in a distributed environment. Data distribution involves
dividing the database into partitions or fragments and assigning them to different nodes
based on certain criteria. Data replication involves creating copies of data fragments and
distributing them across multiple nodes to improve availability and fault tolerance.

2. Transaction Management: A DDBMS supports distributed transactions, which involve


multiple operations that must be executed atomically, consistently, and durably across
distributed nodes. The DDBMS ensures transactional properties such as ACID (Atomicity,
Consistency, Isolation, Durability) compliance, distributed deadlock detection and
resolution, and distributed concurrency control.

3. Query Processing and Optimization: A DDBMS provides mechanisms for processing


queries that span multiple nodes and for optimizing query execution in a distributed
environment. This includes distributed query optimization techniques, distributed query
processing algorithms, and query routing and forwarding mechanisms to efficiently
retrieve data from distributed nodes.

4. Data Consistency and Concurrency Control: Ensuring data consistency and managing
concurrency control in a distributed environment is a critical aspect of a DDBMS.
Techniques such as distributed locking, timestamp-based concurrency control, and
distributed snapshot isolation are used to manage concurrent access to distributed data
and maintain data consistency.

5. Data Recovery and Fault Tolerance: A DDBMS implements mechanisms for data
recovery and fault tolerance to handle failures and ensure the availability and integrity
of data in the event of node failures, network partitions, or other disruptions. This
includes techniques such as distributed logging, distributed checkpointing, and
distributed recovery protocols.

6. Security and Access Control: A DDBMS provides features for ensuring the security and
privacy of distributed data, including authentication, authorization, encryption, and data
masking. Access control mechanisms are used to enforce security policies and regulate
access to sensitive data stored across distributed nodes.

7. Administrative and Monitoring Tools: A DDBMS includes administrative tools and


monitoring utilities for managing and monitoring distributed database operations,
performance, and resource utilization. This includes tools for configuring distributed
database configurations, monitoring distributed transactions, and diagnosing
performance issues.

Overall, a Distributed Database Management System plays a crucial role in enabling efficient and
reliable access to distributed data in modern, interconnected environments, facilitating
applications and services that span multiple locations and require scalable and resilient data
management capabilities.

Types of distributed database systems


Distributed database systems can be classified into different types based on various criteria such
as data distribution model, architecture, and deployment approach. Here are some common
types of distributed database systems:

1. Homogeneous Distributed Database: In a homogeneous distributed database system,


all nodes use the same database management system (DBMS) software, and the data
model and schema are uniform across all nodes. This type of distributed database
system offers simplicity in management and interoperability but may lack flexibility in
accommodating diverse data models or specialized requirements.

2. Heterogeneous Distributed Database: In contrast to homogeneous distributed


databases, heterogeneous distributed database systems involve nodes that use different
types of DBMS software or support different data models. These systems may involve
data integration and translation mechanisms to enable interoperability and data
exchange between disparate systems. Heterogeneous distributed databases are often
used in environments where legacy systems or diverse data sources need to be
integrated.

3. Federated Database System: A federated database system integrates multiple


autonomous and independent databases into a single unified view or virtual database.
Each database retains its autonomy and management while participating in a federation
that provides a unified query interface. Federated database systems enable data sharing
and interoperability across distributed data sources without requiring data replication or
consolidation.

4. Replicated Database System: In a replicated database system, copies of data are


maintained across multiple nodes or locations for redundancy, fault tolerance, and
improved availability. Updates to data in one node are propagated to other replicas to
ensure consistency. Replicated database systems can be synchronous or asynchronous,
depending on the consistency guarantees and replication latency requirements.

5. Partitioned Database System: Partitioned database systems distribute data across


multiple nodes or partitions based on certain criteria, such as key ranges, hash values, or
geographical regions. Each node is responsible for storing and managing a subset of the
data, and data access is routed to the appropriate node based on the partitioning
scheme. Partitioned database systems enable scalable and efficient data storage and
access by distributing the workload across multiple nodes.

6. Hybrid Distributed Database System: Hybrid distributed database systems combine


elements of different distributed database types to meet specific requirements or
address diverse use cases. For example, a hybrid system may incorporate both
replication and partitioning strategies to achieve fault tolerance, scalability, and
performance optimization. Hybrid distributed database systems offer flexibility in design
and deployment to accommodate varied application needs.

7. Cloud Database Systems: Cloud database systems leverage cloud computing


infrastructure to provide distributed storage, processing, and management of data.
These systems often utilize virtualization, elastic scaling, and on-demand resource
allocation to support dynamic workloads and accommodate changing resource
requirements. Cloud database systems can be homogeneous or heterogeneous and may
involve various deployment models such as public cloud, private cloud, or hybrid cloud.

These are some of the common types of distributed database systems, each offering different
characteristics, advantages, and challenges. The choice of a distributed database system
depends on factors such as scalability requirements, data consistency and availability
constraints, performance objectives, and organizational needs

Data distribution
Data distribution in distributed database systems refers to the process of dividing the database
into smaller fragments or partitions and distributing these fragments across multiple nodes or
locations within a distributed computing environment. This distribution enables efficient data
storage, access, and management across distributed nodes, providing benefits such as improved
performance, scalability, fault tolerance, and availability. Data distribution strategies play a
crucial role in designing and optimizing distributed database systems.
The various issues related to data distribution are
data fragmentation, data allocation and data
replication.
data distribution in distributed database systems introduces several key issues, including data
fragmentation, data allocation, and data replication. Let's explore each of these issues in more
detail:

1. Data Fragmentation:
Data fragmentation refers to the process of dividing a database into smaller fragments or
partitions distributed across multiple nodes in a distributed environment. The decision regarding
which portions of the database will be stored at which site are generally taken during the
distributed database design.

There are many ways to distribute/fragment the database :

1. Horizontal Fragmentation: Divides the database table rows into subsets based on a specified
condition or attribute value. For example, customer records can be horizontally fragmented
based on geographical regions.

The horizontal partitions for a distributed database have the following major advantages:

 Efficiency: Data are stored close to where they are used and separate from other data
used by other users or applications.
 Local optimization: Data can be stored to optimize performance for local access.
 Security: Data not relevant to usage at a particular site are not made available.
 Ease of querying: Combining data across horizontal partitions is easy because rows are
simply merged by unions across the partitions.

Horizontal partitions also have the following disadvantages:

 Inconsistent access speed: When data from several partitions are required, the access
time can be significantly different from local-only data access.
 Backup vulnerability: When data at one site become inaccessible or damaged, user
cannot switch to another site where a copy exists, because data are not replicated.

2. Vertical Fragmentation: Divides the database table columns into subsets, typically based on
column relevance or access patterns. For example, frequently accessed columns can be
placed in one fragment, while less accessed columns can be placed in another.

The advantages and disadvantages of vertical partitions are identical to those for horizontal
partitions, with the exception that combining data across vertical partitions is more difficult than
across horizontal partitions.

3. Hybrid fragmentation: A hybrid fragmentation can be obtained by intermixing the horizontal


and vertical fragmentation. The (UNION and OUTER UNION) or (UNION and OUTER JOIN)
operations are applied in the appropriate order to obtain the original relation
2. Data Allocation:
 The process of assigning each fragment or its copy to a particular site in a distributed
system is called data allocation.
 The choice of sites and the degree of replication depends on many factors like
performance, availability and the type and frequency of transactions submitted at each
site.
 A fully replicated database is better if the requirement is high availability, .
 A partial replicated database is better if data is accessed at multiple sites and many
updates are performed.
 Thus finding an optimal or best solution to distributed data allocation is very much
complex.

3. Data Replication:
Data replication involves creating and maintaining copies of data fragments across multiple
nodes to improve fault tolerance, availability, and data access performance. The replication of
data improves the performance, availability and reliability of the distributed database system.

 Reliability: If one or more sites containing the database fail, the copy of the database can
always be found at another site without network traffic delays.
 Fast Response: Every site that has a full copy of database can process queries locally,
thus queries can be processed rapidly.
 Possible Avoidance of Complicated Distributed Transaction Integrity Routines: Replicated
databases are usually refreshed at scheduled intervals, thus most forms of replication
are used when some relaxing of synchronization across database copies is acceptable.
 Node Decoupling: If some sites are down, busy, or disconnected, a transaction is handled
when the user desires. This is possible since each transaction may proceed without
coordination across the network.
 Reduced Network Traffic at Prime Time: In general, the updation of data happens during
prime business hours, and at this time the network traffic is highest and the demands for
rapid response greatest. Due to replication, the delayed updating.

Replication has the following disadvantages:

 Storage Requirements: Each site that has a full copy must have the same storage
capacity as if the data were stored centrally. Each copy of the database needs to be
updated on each site that holds a copy. This requires storage space and processing time.
 Complexity and Cost of Updating: Whenever a database is updated, it must be updated
at each site that holds a copy. Careful coordination is required in synchronizing the
updating in near real time.

Types of data replications:

 Full Replication
 No Replication
 Partial Replication.

FULL REPLICATION
In full replication, the replica of the whole database is stored at every site in the distributed
system. This means every relation is available to every user locally.

Advantages :

 The availability increases drastically as the system continue to operate as long as at least
one site is up.
 The performance increases since result of every query can be obtained locally from any
site.
 Queries can be processed rapidly.

Disadvantages :

 It slows down the update operations drastically, since the update must be performed on
every copy of the database to keep the copies consistent.
 The concurrency control and recovery techniques become more expensive.
 Since each site that has a full copy, it must have the same storage capacity that would be
required if the data were stored centrally.

NO REPLICATION
In no replication, each fragment of the database is stored at exactly one site. Thus all fragments
of the database are disjoint except the primary key.

Advantages :

 Updating is very easy since only at one place the data need to be updated.
 The concurrency control and recovery techniques are less expensive.
 Storage requirement is very-very less compared to full replication.

Disadvantages :

 The availability decreases drastically.


 The performance decreases since every query is not possible to execute locally.

PARTIAL REPLICATION
In partial replication, some fragments of the database may be replicated whereas others may
not. The copies of each fragment varies from one to total number of sites in the distributed
system.

Advantages :

 The availability of data is considerable.


 The performance is good.
 The queries are quite fast.

Disadvantages :

 Updating is more complex than no replication.


 The concurrency control and recovery techniques are more expensive than no
replication.
 The storage requirements are considerable.

You might also like