Chapter 4 - Distributed Database System
Chapter 4 - Distributed Database System
Instructor: Melaku M.
Target Group: G3 SE
Outline
❖ Database systems that run on each site are independent of each other.
❖Data integrity − The need for updating data in multiple sites pose problems of data integrity.
❖Are aware of each other and agree to cooperate in processing user requests.
❖Sites may not be aware of each other and may provide only limited
facilities for cooperation in transaction processing.
Key Features of Distributed Database
A. Autonomy
• Determines extent to which individual nodes can operate
independently.
• Design autonomy: Independence of data model usage and
transaction management techniques among nodes
• Communication autonomy: Determines the extent to which each
node can decide on sharing information with other nodes
• Execution autonomy
Cont’d
B. Distributed Data Storage
❖Full replication of a relation is the case where the relation is stored at all sites. Fully
redundant databases are those in which every site contains a copy of the entire database
❖Types of replication:
✓Reduced data transfer: relation r is available locally at each site containing a replica
of r.
✓Fault Tolerance: If one node fails, data is still available on other nodes.
✓Improved Performance: Queries can access the nearest replica, reducing latency.
✓Consistency Management: Ensure all replicas are updated during write operations.
❖Types of Fragmentation:
• Horizontal Fragmentation
• Vertical Fragmentation
D. Data Fragmentation(cont’d)
Horizontal Fragmentation
❖Each fragment contains a subset of rows, and the union of all fragments
reconstructs the original table.
❖Example:
• Employee_US = SELECT * FROM Employee WHERE Country = 'US';
❖Each fragment contains specific columns, and the join of all fragments
reconstructs the original table.
❖Example:
• Employee_Personal = SELECT ID, Name, Address FROM Employee;
• Employee_Job = SELECT ID, Salary, Department FROM Employee;
D. Data Fragmentation(cont’d)
Mixed (Hybrid) Fragmentation
❖Example:
First, apply horizontal fragmentation (split by country), then apply
vertical fragmentation (split by columns) to each fragment.
Horizontal Fragmentation of account Relation
Vertical Fragmentation of employee_info Relation
Benefits of Fragmentation:
• Improved Performance: Queries can access only the relevant fragments
instead of the entire dataset.
❖The primary goal is to ensure that the database system is efficient, scalable,
reliable, and capable of providing high performance while addressing
challenges like data distribution, replication, and consistency.
3. Data Locality
❖Placing data closer to the users or applications that frequently access it to minimize
network latency.
Key Considerations
4. Transaction Management: Ensuring data consistency and preventing conflicts
when multiple transactions access and modify data concurrently.
❖Consistency: Maintaining the validity of data across all nodes after a transaction is completed.
❖Isolation: Preventing interference between concurrent transactions, ensuring that each transaction
sees a consistent view of the data.
❖Durability: Guaranteeing that once a transaction is committed, it will not be lost due to failures.
✓not acceptable to have a transaction committed at one site and aborted at another.
• Commit/Rollback Phase: If all participants vote yes, the coordinator instructs them to
commit; otherwise, it instructs them to rollback.
❖Optimistic Concurrency Control: Assumes that conflicts are rare and only checks
for conflicts at the end of a transaction. If conflicts occur, the transaction is aborted
and retried.
Cont’d
4. Distributed Deadlock Detection:
• Algorithms to detect and resolve deadlocks, where two or more
transactions are waiting for each other to release resources.
• Common approaches include centralized and distributed deadlock
detection algorithms.
Failures in a Distributed System
❖Types of Failure:
– Transaction failure
– Node failure
– Media failure
– Network failure
Distributed Query Processing
❖Distributed Query Processing refers to the process of executing a query in a distributed
database system where data is stored/spread across multiple sites or nodes, connected via a
network.
❖The goal of distributed query processing is to retrieve results efficiently while minimizing costs
such as communication overhead, data transfer, response time and maximize performance by
strategically distributing the processing workload among the available nodes.
❖For centralized systems, the primary criterion for measuring the cost of a particular strategy is
the number of disk accesses.
❖ In a distributed system, other issues must be taken into account, the cost of a data transmission
over the network and other.
Key Components of Distributed Query Processing
A. Query Decomposition: The input query (usually in SQL) is decomposed into smaller
subqueries or operations, which can be executed independently or in parallel on different
nodes.
B. Data Localization: Determines where the data required by the query resides. Subqueries
are directed to the appropriate nodes storing the relevant data.
C. Query Optimization: Focuses on reducing the cost of query execution. Choosing the most
efficient execution plan involves considering various factors, such as data locality, amount
of data transfer, communication costs, and available processing power at each node.
D. Query Execution: Executes subqueries on the distributed nodes. Collects and aggregates
the results from all the nodes to form the final result.
Query Optimization Techniques/Common Approaches
1. Join Optimization: Minimize the cost of joining tables across nodes by:
• Reducing data transfer.
❖Ship copies of all three relations to site SI and choose a strategy for processing the entire locally at site
SI .
❖Ship a copy of the account relation to site S2 and compute temp1 = account ⋈ depositor at S2. Ship
temp1 from S2 to S3, and compute temp2 = temp1 ⋈ branch at S3. Ship the result temp2 to SI.
• Query:
SELECT A.name, B.salary
FROM A, B
WHERE A.id = B.id AND B.salary > 50000;
Example of Distributed Query Processing
1.Decomposition:
Break the query into subqueries:
•Retrieve rows from B where salary > 50000 (processed locally on Node 2).
•Perform a join between A and the filtered rows of B.
2.Optimization:
•Push the salary > 50000 condition to Node 2 to reduce transferred data.
•Use a semi-join to minimize data movement during the join operation.
3.Execution:
•Node 2 sends only the filtered rows of B to Node 1.
•Node 1 performs the join and returns the final result.