Introduction to Distributed
Query Processing
Distributed query processing retrieves data across multiple
database nodes efficiently. It enables accessing data from different
locations seamlessly. For example, gathering sales data from
regional databases to create comprehensive reports.
Architectures for Distributed
Databases
Homogeneous Databases
Same DBMS across sites with uniform schema and query processing.
Heterogeneous Databases
Different DBMS types; integration challenges due to schema and query
differences.
Federated Database Systems
Loosely coupled, each site retains control, sharing data on demand.
Multi database Systems
Tighter integration, unified query engine operating on multiple databases.
Steps in Distributed Query
Processing
Query Decomposition
Break complex query into smaller subqueries.
Data Localization
Identify which data fragments are relevant.
Global Optimization
Choose plans reducing data transfer and cost.
Distributed Execution
Run subqueries on corresponding database nodes.
Query Decomposition and Localization
Transform SQL to Relational Algebra Fragmentation and Allocation
Convert queries into algebraic expressions for processing. Horizontal: Split rows between sites
Vertical: Split columns across sites
Supports systematic query breakdown and optimization.
Mixed: Combination of both
Allocate fragments strategically across nodes
Distributed Query
Optimization
Cost-Based Models
Use metrics like CPU, I/O, and transfer costs.
Minimize Data Transfer
Choose query plans reducing communication overhead.
Join Ordering
Optimize order of operations for efficiency.
Semi-Join Strategies
Reduce data sent by filtering before join.
Join Strategies in
Distributed Databases
Semi-Join Bloom Join Fragmentation
Join
Filters data to Uses probabilistic
minimize filtering with bloom Leverages data
transmission during filters for efficiency. locality to join
joins. fragments at their
sites.
Data Transfer Cost
Estimation
Factor Description
Network Bandwidth Limits speed of data transfer
between nodes
Latency Delay before data transfer
begins
CPU & I/O Costs Processing overhead at each
database site
Example: Transferring 10GB over 1Gbps network takes about 80
seconds.
Concurrency Control and Recovery
Distributed Transactions Two-Phase Commit (2PC) Failure Handling
Ensure consistency and ACID Coordinate commit operations to Manage site failures and network
properties across all sites. maintain atomicity. partitions effectively.
Challenges in Distributed
Query Processing
Data Heterogeneity Network Limitations
Conflicts in schema, data Latency and bandwidth
models, and query constraints affect
languages. performance.
Security Issues
Access control and data privacy across multiple sites.
Future Trends and
Conclusion
Cloud-Based Big Data & NoSQL
Databases
Handle massive, varied
Elastic scalable systems datasets beyond
with global reach. traditional DBMS.
Adaptive Query Processing
Dynamic optimization reacting to environment changes.
Distributed query processing enables scalable, efficient access to
decentralized data.