0% found this document useful (0 votes)
5 views

Lec 10 Distributed Databases System

The document provides an overview of distributed database systems, covering architecture types such as client-server, peer-to-peer, and multi-tier, as well as management techniques like data replication and partitioning. It details key components of a Distributed Database Management System (DDBMS) and discusses distributed query processing and fault tolerance mechanisms. Challenges associated with distributed databases, including coordination complexity and consistency versus availability trade-offs, are also highlighted.

Uploaded by

mhariskhan513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lec 10 Distributed Databases System

The document provides an overview of distributed database systems, covering architecture types such as client-server, peer-to-peer, and multi-tier, as well as management techniques like data replication and partitioning. It details key components of a Distributed Database Management System (DDBMS) and discusses distributed query processing and fault tolerance mechanisms. Challenges associated with distributed databases, including coordination complexity and consistency versus availability trade-offs, are also highlighted.

Uploaded by

mhariskhan513
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Distributed Databases System

Topics Covered
⚫ Distributed Database Architecture
⚫ Distributed Database Management Techniques
⚫ Data Replication and Partitioning
⚫ Components of Distributed Database
⚫ Distributed Query Processing and Fault Tolerance
Distributed Database
⚫ A distributed database is a collection of multiple,
logically interrelated databases distributed over a
computer network.
⚫ It appears to users as a single database, but the data is
actually stored across multiple physical locations,
which may be geographically dispersed.
⚫ The management system that coordinates and provides
access to this database is called a Distributed Database
Management System (DDBMS).
Distributed Database Architecture
1. Client-Server Architecture
⚫ Description: The system is divided into two
roles—clients (request services) and servers (provide
services).
⚫ Example: Clients send SQL queries to a central server
which processes and returns the result.
⚫ Use Case: Common in traditional DBMS setups with
centralized processing.
Distributed Database Architecture
2. Peer-to-Peer (P2P) Architecture
⚫ Description: All nodes (sites) are equal; each can act as
both client and server.
⚫ Advantages: High availability, fault tolerance, and
scalability.
⚫ Example: Blockchain databases, some NoSQL
systems.
Distributed Database Architecture
3. Multi-Tier Architecture
⚫ Description: Involves three or more
layers—presentation (UI), application (logic), and data
(storage).
⚫ Advantages: Modularity, better security, scalability.
⚫ Use Case: Web applications and enterprise distributed
systems.
Distributed Database Architecture
4. Federated Architecture (Heterogeneous DDB)
⚫ Description: Independent databases are integrated
while retaining autonomy.
⚫ Types:
⚫ Loosely Coupled: Minimal coordination; suitable for
dynamic environments.
⚫ Tightly Coupled: Centralized control over global schema.
⚫ Example: A university integrates data from different
departments.
Distributed Database Architecture
5. Cluster-Based Architecture
⚫ Description: Multiple servers (nodes) work together as a
single database system.
⚫ Advantages: High performance, failover support.
⚫ Example:
1. Apache Cassandra: NoSQL database with a peer-to-peer
clustered architecture used by Netflix, Facebook, etc.
2. Oracle RAC (Real Application Clusters): Traditional
RDBMS with a clustered setup for high availability and
load balancing.
3. Google Spanner: globally distributed, clustered relational
database used internally by Google and also offered as a
cloud service. It supports SQL queries, strong
consistency, and horizontal scaling.
Data Fault Best Use
Architecture Centralized? Scalability
Control Tolerance Case
Small
Central
Client-Server Yes Moderate Low distributed
Server
systems
Decentraliz
ed systems
Peer-to-Peer No Distributed High High
(e.g., P2P
sharing)
Tiered (DB Enterprise
Multi-tier Yes High Moderate
in backend) applications
Integration
Independen
Federated No Moderate Moderate of multiple
t systems
DBs
High-availa
Shared/Part
Cluster-based Semi Very High Very High bility &
itioned
cloud DBs
Key Components in Architecture
⚫ Distribution Transparency:
⚫ Users perceive the database as a single logical entity, unaware
of the actual physical distribution.
⚫ Includes:
⚫ Location transparency: Users don’t need to know where the data
resides.
⚫ Replication transparency: Users are unaware of data replication.
⚫ Fragmentation transparency: Users don’t see the data fragmentation
(horizontal/vertical).
⚫ Data Independence:
⚫ Logical and physical data independence is maintained, similar
to centralized databases.
⚫ Autonomy:
⚫ Each site can control its own data, providing local autonomy.
Key Components in Architecture
⚫ Concurrency Control:
⚫ Multiple transactions can occur simultaneously at
different sites without conflicts.
⚫ Reliability and Availability:
⚫ Distributed systems are more fault-tolerant; if one site
fails, others can continue to operate.
⚫ Scalability:
⚫ Easier to expand the database system by adding more
sites.
Distributed Database Management
Techniques
1. Data Replication
Purpose: Improve data availability and fault tolerance.
Types:
⚫ Master-Slave: One node writes; others replicate.
⚫ Multi-Master: All nodes can write; need conflict resolution.
Benefits:
⚫ Improved read performance
⚫ Fault tolerance
⚫ Load balancing
Challenges:
⚫ Data consistency
⚫ Synchronization overhead
Distributed Database Management
Techniques
2. Data Partitioning (Sharding)
Definition: Splitting a large database into smaller, faster, more manageable parts
(shards).
Types:
⚫ Horizontal Partitioning: Divide rows.
⚫ Vertical Partitioning: Divide columns.
⚫ Range-based, Hash-based, List-based sharding.
Benefits:
⚫ Scalability
⚫ Faster query performance
⚫ Resource optimization
Challenges:
⚫ Cross-shard queries
⚫ Complex joins
⚫ Data rebalancing
Distributed Database Management
Techniques
3. Allocation
Definition: allocation refers to the strategy used to distribute data
fragments or entire databases across multiple sites or nodes in a
distributed database system (DDBS). The goal is to optimize
performance, availability, reliability, and cost.
Types:
⚫ Centralized Allocation: All data is stored at a single central site.
⚫ Partitioned (Fragmented) Allocation: Database is divided into
fragments (horizontal, vertical, or mixed) and each fragment is stored
at a different site.
⚫ Replicated Allocation: Copies of the same data are stored at multiple
sites.
Benefits:
⚫ Optimize Performance
⚫ Availability
⚫ Reliability
Distributed Database Management
Techniques
Challenges:
⚫ Data Redundancy and Consistency
⚫ Optimal Data Placement: where to allocate data
fragments to minimize access time, communication
cost, and storage cost.
⚫ Load Balancing
⚫ Network Latency and Failures
⚫ Dynamic Access Patterns
⚫ Scalability
Components of a Distributed DBMS
⚫ Transaction Manager: Ensures consistency and
ACID properties across sites.
⚫ Query Processor: Decomposes queries and routes
subqueries to appropriate sites.
⚫ Communication Manager: Manages communication
between sites.
⚫ Concurrency Control Manager: Ensures correct
concurrent transaction execution.
⚫ Recovery Manager: Handles failures and restores the
system.
Distributed Query Processing and Fault
Tolerance
⚫ Distributed Query Processing is the process of
decomposing a high-level user query into subqueries,
executing them at the appropriate remote sites, and
then assembling the results to present a unified answer
to the user.
⚫ DQP refers to the methods and techniques used to
process a user's database query in a distributed
database system
⚫ The goal is to execute queries efficiently by
minimizing communication cost, response time, and
resource usage while ensuring correctness and
completeness of results.
Phases of Distributed Query Processing
1. Query Decomposition:
⚫ The high-level SQL query is parsed and transformed
into a relational algebra or logical representation.
⚫ It is analyzed for syntactic and semantic correctness.
2. Data Localization:
⚫ Identify where the required data (relations/fragments)
is stored.
⚫ Convert logical relations into physical fragments based
on fragmentation and allocation information.
Phases of Distributed Query Processing
3. Query Optimization:
⚫ Generate multiple query execution plans (QEPs).
⚫ Select the most cost-effective plan based on:
⚫ Communication cost
⚫ Local processing cost
⚫ Data transfer time
⚫ Join strategies
4. Local Optimization and Execution:
⚫ Each subquery is sent to its corresponding site.
⚫ Local DBMSs optimize and execute their subqueries.
Phases of Distributed Query Processing
⚫ Result Assembly:
⚫ Subquery results are transferred to a coordinating site.
⚫ Final result is constructed (e.g., through joins, unions,
aggregations).
⚫ Output is returned to the user.
Example
Assume relation Employee is horizontally fragmented across
Site A and Site B:
SELECT name FROM Employee WHERE salary >
50000;

⚫ Query Decomposition: Break the query into:


SELECT name FROM Employee_A WHERE salary >
50000; SELECT name FROM Employee_B WHERE salary
> 50000;

⚫ Execution: Each subquery runs locally at Site A and Site


B.
⚫ Result Assembly: Results from both sites are merged and
returned to the user.
Fault Tolerence
⚫ Fault Tolerance in distributed databases refers to the
system's ability to continue functioning correctly even
when one or more components fail.
⚫ The main goal is to ensure data integrity, availability,
and system reliability, despite failures in hardware,
software, or the network.
⚫ It ensures that transactions are processed correctly, and
the system can recover automatically or with minimal
manual intervention.
Mechanisms to Achieve Fault Tolerance
1. Replication:
⚫ Data is stored at multiple sites.
⚫ If one site fails, another replica can serve the data.
⚫ Must maintain data consistency through synchronization.
2. Commit Protocols (for transaction atomicity):
⚫ Ensure that either all parts of a distributed transaction commit, or
none do.
⚫ Two-Phase Commit (2PC):
⚫ Phase 1: Coordinator asks all sites to prepare.
⚫ Phase 2: Based on responses, coordinator tells them to commit or
abort.
⚫ Three-Phase Commit (3PC):
⚫ Adds a "pre-commit" phase to reduce uncertainty in the event of
failure.
Mechanisms to Achieve Fault Tolerance
3. Logging and Recovery:
⚫ Write-ahead logs (WALs) record actions before they're executed.
⚫ After failure, logs are used to redo or undo transactions to ensure
consistency.
⚫ Checkpointing periodically saves system state to reduce
recovery time.
4. Failover and Redundancy:
⚫ Automatic switching to a standby system or site when a failure
occurs.
⚫ May involve active-passive (hot standby) or active-active (load
sharing) configurations.
5. Timeouts and Retry Mechanisms:
⚫ Detect failures by expecting timely responses.
⚫ Retry failed communications or redirect requests.
Example
A transaction to transfer money between accounts in two different
sites is in progress.
⚫ Site A successfully debits the amount.
⚫ Before Site B can credit the amount, Site B crashes.

Without fault tolerance:


Data inconsistency arises—money is lost.

With fault tolerance:


⚫ The system detects the failure.
⚫ Logs at Site A allow rollback (undo debit), or
⚫ System retries when Site B recovers, or
⚫ Uses a backup replica of Site B to complete the credit.
Challenges
⚫ Complex coordination among sites
⚫ Trade-off between consistency and availability (CAP
theorem)
⚫ Maintaining performance under failure scenarios
⚫ Cost of redundant hardware and data replication

You might also like