Ads Unit 3
Ads Unit 3
A Parallel Database Management System (PDBMS) is a type of database system that uses parallel
processing to improve the performance of database operations. It divides tasks like query processing,
data storage, and transaction management across multiple processors, disks, or machines to execute
operations concurrently. This improves efficiency, reduces response times, and enables handling
large volumes of data.
Parallel databases are essential for applications requiring high performance, such as data
warehousing, big data analytics, and real-time processing.
1. Data Parallelism:
o Each processor works on its portion of the data independently and simultaneously.
o Examples: Partitioning tables into chunks and processing each chunk on different
nodes.
2. Task Parallelism:
o For example, one processor executes a join operation while another processor
performs sorting.
3. Pipeline Parallelism:
o Operations are organized into stages of a pipeline, and each stage runs concurrently.
o For instance, the output of one operation (e.g., filtering) is passed directly to another
operation (e.g., aggregation) in parallel.
Parallel Query Processing involves executing database queries by leveraging multiple processors or
nodes to divide and conquer the workload. Key aspects include:
1. Partitioned Parallelism:
o Queries are divided into subqueries that are executed independently on different
data partitions.
o Example: Scanning rows from different partitions concurrently.
2. Inter-query Parallelism:
3. Intra-query Parallelism:
o A single query is broken into smaller sub-tasks, and these tasks are executed in
parallel.
4. Optimization:
o Efficient execution plans are crucial for parallel query processing to minimize
communication overhead and balance workload.
o Processors have their own memory but share access to the same disk storage.
4. Hybrid Architecture:
Relational operators (e.g., SELECT, JOIN, PROJECT, UNION) are fundamental to query processing in
relational databases. In PDBMS, these operators are parallelized to improve performance:
1. Parallel Selection:
o Filters rows based on a condition in parallel across different partitions of the data.
o Example: If data is partitioned across multiple nodes, each node evaluates the
selection condition on its portion.
2. Parallel Join:
o Techniques:
▪ Hash Partitioning: Divide both relations based on the hash values of join
keys.
▪ Broadcast Join: A smaller relation is sent to all nodes, and each node
performs the join locally.
▪ Pipeline Join: Intermediate results are streamed directly to the next join
operation.
3. Parallel Aggregation:
o Example: Each node calculates partial aggregates for its partition, and the final
aggregation is done by combining results.
4. Parallel Sorting:
o Data is divided and sorted in chunks across multiple processors, and then merged.
5. Parallel Projection:
Main Memory Database Management Systems (MMDBMS) store data entirely in RAM, reducing disk
I/O overhead and enabling faster processing. Parallelism in MMDBMS focuses on maximizing CPU
and memory utilization:
1. Thread-Level Parallelism:
2. Vectorized Execution:
4. Conflict-Free Locking:
5. NUMA-Aware Optimization:
Integrity constraints (e.g., primary keys, foreign keys, uniqueness) ensure data validity and
consistency. Parallel handling involves:
o Data is partitioned, and each processor checks constraints for its local partition.
o Example: For a uniqueness constraint, processors check locally and then merge
results to identify duplicates.
o Foreign key checks are split by data partition and executed concurrently.
o Parallel creation and validation of indexes enforce constraints like primary keys.
4. Batch Updates:
Integrated I/O parallelism optimizes data retrieval and storage across multiple disks or nodes:
1. Striping:
o Data is divided into fixed-size chunks and distributed across multiple disks.
o While one processor performs I/O, another handles computation tasks, reducing idle
time.
3. Distributed Caching:
o Frequently accessed data is cached across multiple nodes to reduce I/O overhead.
4. Asynchronous I/O:
5. Load Balancing:
o I/O requests are balanced across all available storage resources to prevent
bottlenecks.
Parallel query processing divides a query into smaller tasks or subqueries that can be executed
simultaneously across multiple processors or nodes. The goal is to improve performance, reduce
query execution time, and ensure efficient utilization of resources.
1. Inter-Query Parallelism
• Use Case: Efficient in multi-user environments where users submit separate queries
simultaneously.
• Example:
o Query 1: SELECT AVG(salary) FROM employees;
2. Intra-Query Parallelism
• Definition: A single query is divided into smaller tasks or subqueries that are executed
concurrently.
• Subcategories:
o Intra-Operation Parallelism:
o Inter-Operation Parallelism:
3. Intra-Operation Parallelism
• Focuses on breaking a single database operation (e.g., scan, join, aggregation) into smaller
tasks.
• Examples:
o Parallel Aggregation:
o Parallel Sorting:
▪ Divide data into chunks, sort them in parallel, and merge results.
Parallel query optimization identifies the most efficient plan for executing a query in a parallel
environment. Key considerations include balancing workload, minimizing communication overhead,
and exploiting parallelism effectively.
▪ Break down the query into sub-operations that can be executed in parallel.
o Partitioning Strategy:
▪ Decide how to distribute data across nodes (e.g., hash, range, or round-robin
partitioning).
o Plan Generation:
▪ Create multiple parallel execution plans considering costs like I/O, CPU, and
network communication.
o Plan Selection:
2. Load Balancing:
o Avoids situations where some processors are idle while others are overloaded.
o Techniques:
Join operations are computationally expensive and benefit greatly from parallelism. Techniques
include:
1. Partitioned Join:
2. Broadcast Join:
o A smaller table is replicated and sent to all processors, while the larger table is
partitioned.
o Each processor joins its local partition with the broadcasted table.
3. Pipelined Join:
o Intermediate results of one join are passed directly to the next join operation
without waiting for the first to complete.
4. Sort-Merge Join:
o Data is sorted in parallel across partitions, and the merge phase is distributed.
The quality of parallel query optimization can be evaluated using the following metrics and methods:
1. Execution Time:
o Measure the total query execution time for optimized and non-optimized plans.
2. Speedup:
3. Scale-Up:
o Example: Doubling the data and processors should result in similar execution times.
4. Resource Utilization:
o Assess the degree to which all processors or nodes are utilized during query
execution.
5. Communication Overhead:
o Evaluate the time spent on data transfer between nodes versus computation.
6. Load Balancing: