Lec 10 Distributed Databases System
Lec 10 Distributed Databases System
Topics Covered
⚫ Distributed Database Architecture
⚫ Distributed Database Management Techniques
⚫ Data Replication and Partitioning
⚫ Components of Distributed Database
⚫ Distributed Query Processing and Fault Tolerance
Distributed Database
⚫ A distributed database is a collection of multiple,
logically interrelated databases distributed over a
computer network.
⚫ It appears to users as a single database, but the data is
actually stored across multiple physical locations,
which may be geographically dispersed.
⚫ The management system that coordinates and provides
access to this database is called a Distributed Database
Management System (DDBMS).
Distributed Database Architecture
1. Client-Server Architecture
⚫ Description: The system is divided into two
roles—clients (request services) and servers (provide
services).
⚫ Example: Clients send SQL queries to a central server
which processes and returns the result.
⚫ Use Case: Common in traditional DBMS setups with
centralized processing.
Distributed Database Architecture
2. Peer-to-Peer (P2P) Architecture
⚫ Description: All nodes (sites) are equal; each can act as
both client and server.
⚫ Advantages: High availability, fault tolerance, and
scalability.
⚫ Example: Blockchain databases, some NoSQL
systems.
Distributed Database Architecture
3. Multi-Tier Architecture
⚫ Description: Involves three or more
layers—presentation (UI), application (logic), and data
(storage).
⚫ Advantages: Modularity, better security, scalability.
⚫ Use Case: Web applications and enterprise distributed
systems.
Distributed Database Architecture
4. Federated Architecture (Heterogeneous DDB)
⚫ Description: Independent databases are integrated
while retaining autonomy.
⚫ Types:
⚫ Loosely Coupled: Minimal coordination; suitable for
dynamic environments.
⚫ Tightly Coupled: Centralized control over global schema.
⚫ Example: A university integrates data from different
departments.
Distributed Database Architecture
5. Cluster-Based Architecture
⚫ Description: Multiple servers (nodes) work together as a
single database system.
⚫ Advantages: High performance, failover support.
⚫ Example:
1. Apache Cassandra: NoSQL database with a peer-to-peer
clustered architecture used by Netflix, Facebook, etc.
2. Oracle RAC (Real Application Clusters): Traditional
RDBMS with a clustered setup for high availability and
load balancing.
3. Google Spanner: globally distributed, clustered relational
database used internally by Google and also offered as a
cloud service. It supports SQL queries, strong
consistency, and horizontal scaling.
Data Fault Best Use
Architecture Centralized? Scalability
Control Tolerance Case
Small
Central
Client-Server Yes Moderate Low distributed
Server
systems
Decentraliz
ed systems
Peer-to-Peer No Distributed High High
(e.g., P2P
sharing)
Tiered (DB Enterprise
Multi-tier Yes High Moderate
in backend) applications
Integration
Independen
Federated No Moderate Moderate of multiple
t systems
DBs
High-availa
Shared/Part
Cluster-based Semi Very High Very High bility &
itioned
cloud DBs
Key Components in Architecture
⚫ Distribution Transparency:
⚫ Users perceive the database as a single logical entity, unaware
of the actual physical distribution.
⚫ Includes:
⚫ Location transparency: Users don’t need to know where the data
resides.
⚫ Replication transparency: Users are unaware of data replication.
⚫ Fragmentation transparency: Users don’t see the data fragmentation
(horizontal/vertical).
⚫ Data Independence:
⚫ Logical and physical data independence is maintained, similar
to centralized databases.
⚫ Autonomy:
⚫ Each site can control its own data, providing local autonomy.
Key Components in Architecture
⚫ Concurrency Control:
⚫ Multiple transactions can occur simultaneously at
different sites without conflicts.
⚫ Reliability and Availability:
⚫ Distributed systems are more fault-tolerant; if one site
fails, others can continue to operate.
⚫ Scalability:
⚫ Easier to expand the database system by adding more
sites.
Distributed Database Management
Techniques
1. Data Replication
Purpose: Improve data availability and fault tolerance.
Types:
⚫ Master-Slave: One node writes; others replicate.
⚫ Multi-Master: All nodes can write; need conflict resolution.
Benefits:
⚫ Improved read performance
⚫ Fault tolerance
⚫ Load balancing
Challenges:
⚫ Data consistency
⚫ Synchronization overhead
Distributed Database Management
Techniques
2. Data Partitioning (Sharding)
Definition: Splitting a large database into smaller, faster, more manageable parts
(shards).
Types:
⚫ Horizontal Partitioning: Divide rows.
⚫ Vertical Partitioning: Divide columns.
⚫ Range-based, Hash-based, List-based sharding.
Benefits:
⚫ Scalability
⚫ Faster query performance
⚫ Resource optimization
Challenges:
⚫ Cross-shard queries
⚫ Complex joins
⚫ Data rebalancing
Distributed Database Management
Techniques
3. Allocation
Definition: allocation refers to the strategy used to distribute data
fragments or entire databases across multiple sites or nodes in a
distributed database system (DDBS). The goal is to optimize
performance, availability, reliability, and cost.
Types:
⚫ Centralized Allocation: All data is stored at a single central site.
⚫ Partitioned (Fragmented) Allocation: Database is divided into
fragments (horizontal, vertical, or mixed) and each fragment is stored
at a different site.
⚫ Replicated Allocation: Copies of the same data are stored at multiple
sites.
Benefits:
⚫ Optimize Performance
⚫ Availability
⚫ Reliability
Distributed Database Management
Techniques
Challenges:
⚫ Data Redundancy and Consistency
⚫ Optimal Data Placement: where to allocate data
fragments to minimize access time, communication
cost, and storage cost.
⚫ Load Balancing
⚫ Network Latency and Failures
⚫ Dynamic Access Patterns
⚫ Scalability
Components of a Distributed DBMS
⚫ Transaction Manager: Ensures consistency and
ACID properties across sites.
⚫ Query Processor: Decomposes queries and routes
subqueries to appropriate sites.
⚫ Communication Manager: Manages communication
between sites.
⚫ Concurrency Control Manager: Ensures correct
concurrent transaction execution.
⚫ Recovery Manager: Handles failures and restores the
system.
Distributed Query Processing and Fault
Tolerance
⚫ Distributed Query Processing is the process of
decomposing a high-level user query into subqueries,
executing them at the appropriate remote sites, and
then assembling the results to present a unified answer
to the user.
⚫ DQP refers to the methods and techniques used to
process a user's database query in a distributed
database system
⚫ The goal is to execute queries efficiently by
minimizing communication cost, response time, and
resource usage while ensuring correctness and
completeness of results.
Phases of Distributed Query Processing
1. Query Decomposition:
⚫ The high-level SQL query is parsed and transformed
into a relational algebra or logical representation.
⚫ It is analyzed for syntactic and semantic correctness.
2. Data Localization:
⚫ Identify where the required data (relations/fragments)
is stored.
⚫ Convert logical relations into physical fragments based
on fragmentation and allocation information.
Phases of Distributed Query Processing
3. Query Optimization:
⚫ Generate multiple query execution plans (QEPs).
⚫ Select the most cost-effective plan based on:
⚫ Communication cost
⚫ Local processing cost
⚫ Data transfer time
⚫ Join strategies
4. Local Optimization and Execution:
⚫ Each subquery is sent to its corresponding site.
⚫ Local DBMSs optimize and execute their subqueries.
Phases of Distributed Query Processing
⚫ Result Assembly:
⚫ Subquery results are transferred to a coordinating site.
⚫ Final result is constructed (e.g., through joins, unions,
aggregations).
⚫ Output is returned to the user.
Example
Assume relation Employee is horizontally fragmented across
Site A and Site B:
SELECT name FROM Employee WHERE salary >
50000;