0% found this document useful (0 votes)
1 views10 pages

Ddbms

The document discusses the impact of network latency on distributed query performance, highlighting issues such as longer response times and slower join processes. It proposes optimization techniques like local aggregation and smarter query execution to mitigate these effects, recommending a plan that minimizes network transfer costs. Additionally, it covers various maintenance techniques for materialized views, including deferred, immediate, trigger-based, and log-based maintenance, and explains the efficiency of delta processing in distributed environments.

Uploaded by

NASSIRU NGAKONDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views10 pages

Ddbms

The document discusses the impact of network latency on distributed query performance, highlighting issues such as longer response times and slower join processes. It proposes optimization techniques like local aggregation and smarter query execution to mitigate these effects, recommending a plan that minimizes network transfer costs. Additionally, it covers various maintenance techniques for materialized views, including deferred, immediate, trigger-based, and log-based maintenance, and explains the efficiency of delta processing in distributed environments.

Uploaded by

NASSIRU NGAKONDA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

QUESTION ONE

(a)To Discuss the impact of network latency on distributed query performance and how
optimization techniques can mitigate this.
Network latency refers to the delay that occurs when data is transmitted from one location to
another within a network. When latency is high, it results in slower data transfer and delayed
responses. In contrast, low latency means that data can move more quickly, leading to faster
communication between systems.

Impact of Network Latency on Distributed Query Performance

• Longer Response Times: When a query is executed across multiple systems, data often
needs to travel from one node to another. If there's a delay in the network, the overall
query takes more time to finish because of the slow data transfer between those nodes.
• Slower Join Processes: In distributed systems, joining tables that are stored on different
machines is common. These operations depend on moving data between nodes. When
latency is high, this movement becomes slower, which affects the speed of the join.
• Problems with Real Time and Live Applications: Some applications, like dashboards
or online transaction systems, depend on quick data access. If the network is slow, users
may notice delays when trying to view or interact with real time data.
• Poor Query Optimization: Sometimes, the system that plans how a query should run
does not consider the delays in the network. This can result in inefficient plans that
move too much data between nodes, making the query slower.
• More Errors and Timeouts: A slow network may cause delays in communication that
lead to packet loss or timeouts. This can force the system to resend data, adding even
more delay and reducing overall performance.

Ways to Reduce the Effects of Network Latency

• Keeping Data Close to Where It’s Needed: To cut down on delays, it's helpful to
organize data so that related pieces are stored together on the same server. This reduces
the need for sending data between different machines. Also, by making queries aware
of how data is divided (sharded), systems can avoid contacting all nodes unnecessarily.
• Smarter Query Execution: Some parts of a query like filtering or summarizing data can
be done early, right where the data is stored. This cuts down the amount of information
that needs to be sent over the network. In addition, query planners can be designed to
consider the cost of network delays when deciding how to execute a query efficiently.
• Using Caching to Save Time: Frequently requested results can be stored temporarily,
so the system doesn’t have to run the same query again or pull the same data from
remote servers. Tools like Redis and Memcached can help by acting as shared memory
storage across the network.
• Reducing the Amount of Network Traffic: Combining many small data requests into
one larger one helps to reduce the time lost in sending multiple messages. Also,
compressing the data before it’s sent helps lower the amount of information traveling
over the network, which speeds things up.
• Investing in Faster Network Gear: Using modern, high-speed network hardware like
fiber optics or faster Ethernet can reduce the actual physical delay of moving data
between systems.
• Running Tasks in Parallel and Without Waiting: If a system can handle multiple tasks
at once and doesn’t wait for one task to finish before starting the next, it can greatly
reduce how long users wait for results. This approach makes better use of time,
especially when dealing with multiple remote nodes.

a)

i. To propose and describe three alternative execution plans for this query, considering:
Shipping all data to Site A, Performing local aggregation and sending partial results.
Shipping all data to a different site (e.g., Site B or C)

Three Alternative Execution Plans

Plan 1: Ship All Data to Site A (Centralized Aggregation)


In this plan, all the records from Site B (Asia) and Site C (America) are sent over the network
to Site A (Europe), where the aggregation (SUM of Amount) is performed. This means full
data transfer of raw records from B and C to A.

Plan 2: Perform Local Aggregation and Send Partial Results to Site A


Here, each site computes the SUM(Amount) locally. Then, only the partial results (which are
very small, just the SUM values) are transferred to Site A. This plan reduces data transfer
significantly since only numeric summaries are moved.

Plan 3: Ship All Data to a Different Site (example., Site B)


Instead of Site A, all data from Site A and Site C is moved to Site B (Asia), and the
aggregation is performed at Site B. Similar to Plan 1 but with a different central site for
aggregation.

ii. Network Transfer Cost Calculations

Transfer Cost Formula = Number of Records × Avg. Record Size × Transfer Cost per Byte
(0.01 units/byte)

Plan 1: Ship all to Site A

• Site B to A:
30,000 records×100 bytes×0.01=30,000 units
• Site C to A:
10,000×150×0.01=15,000 units
• Total Cost = 30,000 + 15,000 = 45,000 units

Plan 2: Local Aggregation, Send Partial Results

• Only one SUM value per site is sent. Assume 8 bytes per result.
• Site B to A: 8 bytes×0.01=0.08 units
• Site C to A: 8×0.01=0.08 units
• Total Cost = 0.08 + 0.08 = 0.16 units

Plan 3: Ship all to Site B

• Site A to B:
20,000×120×0.01=24,000 units
• Site C to B:
10,000×150×0.01=15,000 units
• Total Cost = 24,000 + 15,000 = 39,000 units

iii. Recommended Strategy


Recommended Plan: Plan 2 Perform Local Aggregation, Send Partial Results

Reason:

• It has by far the lowest network transfer cost (only 0.16 units) compared to Plan 1
(45,000 units) and Plan 3 (39,000 units).
• Since local aggregation cost is considered negligible, this plan is most efficient.
• It also reduces network congestion and improves response time in distributed
environments.

b) The 2-Step Algorithm in Distributed Query Optimization

In distributed query optimization, the 2 step algorithm is used to improve how a query runs
across multiple sites. It involves creating two types of plans: a static plan and a dynamic plan.
Each one has its own role in making sure the query runs efficiently.

1. Static Plan

• When it's generated:


The static plan is created before the query runs at compile time.
• What it does:
It gives a global overview of how the query should be executed across all the
distributed sites. It includes choosing the best sites to access data, how to join tables,
the order of operations, and how to reduce data transfer.
• What influences it:

• Metadata about data distribution


• Estimated sizes of tables/fragments
• Network costs

• Role:
It helps to plan ahead and decide the most cost-effective strategy for executing the
query, even before it starts running.

2. Dynamic Plan
When it's generated:
The dynamic plan is made at runtime (while the query is being executed).

What it does:
It adjusts the query plan based on the real-time conditions of the system, like current network
delays, actual data sizes, or server load.

What influences it:

Actual runtime data values

Current network speed or congestion

Server availability or performance

Role:
It makes on-the-fly improvements to the static plan if needed. This ensures the query remains
efficient even if something unexpected happens during execution.

How They Work Together

The static plan sets the foundation, providing a well-thought-out strategy.

The dynamic plan adjusts as needed, helping the query adapt to real-time changes.

Together, they balance planning and flexibility, leading to faster and more efficient query
execution in distributed systems.

Conclusion (Student tone):


In short, the static plan is like making a smart route before a trip, while the dynamic plan is
like using GPS to avoid traffic during the trip. Both are important for getting the best
performance in a distributed query system.
QUESTION TWO

❖ Apart from Incremental Maintenance (Differential Updates) techniques for


materialized view maintenance which is typically the best choice when performance
and network efficiency are a priority; Deferred (Lazy) Maintenance, Immediate (Eager)
Maintenance, Trigger-Based Maintenance and Log-Based Maintenance (Change Data
Capture). Using internet and books, describe each of these techniques briefly.
ANSWERS

1. Deferred (Lazy) Maintenance

Deferred or lazy maintenance delays updating the materialized view until it is actually
queried by a user. Changes made to the base tables are not immediately reflected in the view,
meaning the view may contain outdated information until a request for it is made.

2. Immediate (Eager) Maintenance

Immediate, or eager, maintenance ensures the materialized view is updated instantly


whenever the base tables are modified. Any INSERT, UPDATE, or DELETE operation on
the base tables triggers a synchronous update to the view, maintaining real-time consistency
between the view and the underlying data.

3. Trigger-Based Maintenance

Trigger-based maintenance uses database triggers to automatically update the materialized


view when specific events (like insertions, updates, or deletions) occur on the base tables.
These triggers are predefined procedures that execute in response to changes, allowing the
view to be updated either immediately or in a deferred manner depending on the
configuration

4. Log-Based Maintenance (Change Data Capture - CDC)

Log-based maintenance, also known as Change Data Capture (CDC), tracks changes in the
base tables by reading the database's transaction logs instead of directly querying the tables.
This technique allows changes to be captured and applied to the materialized view
asynchronously, making it highly efficient and minimally invasive to the primary operations
of the database.
❖ Incremental view maintenance and triangle counting are highly complementary in
distributed graph-processing systems, particularly when working with materialized
views that are frequently updated. Triangle counting exemplifies a practical scenario
where full recomputation after each update would be computationally expensive.
Instead, incremental update techniques—such as delta processing—allow efficient
maintenance by applying only the changes triggered by recent updates. This approach
is especially effective in large-scale, dynamic, and distributed environments. In such
systems, delta processing is often employed to maintain views involving join
operations, particularly in cases of light-light or heavy-heavy interactions. Explain
how delta processing functions under these join interaction patterns and why it is a
suitable technique for efficient view maintenance in distributed environments.

ANSWERS
Delta processing .is a technique used in incremental view maintenance to update materialized views
efficiently by applying only the changes (called deltas) resulting from recent updates in the base data,
rather than recomputing the entire view.
(1) Light-Light Interactions.This pattern occurs when both vertices involved in the join have low
degrees (i.e., few connections). For example, when a new edge connects two less-connected nodes,
only a small number of potential triangles can be formed or destroyed. Delta processing handles this
efficiently by only checking for triangles that directly involve the updated edge and the immediate
neighbors of those two vertices. Because the scope of the update is small, the computation and
communication overhead is minimal.
(2)Heavy-Heavy Interactions.In this case, the join involves two high-degree vertices, which
participate in many relationships or edges. A single update (such as adding or removing an edge
between these two nodes) can potentially affect a large number of triangles. Even though the impact is
larger, delta processing remains efficient by avoiding a full graph scan. Instead, it focuses only on the
subsets of the graph connected to the updated edge. Optimization techniques like indexing,
neighbourhood caching, or partial recomputation are often used to limit the scope of computation.

QUESTION THREE

❖ (a)Given the CUSTOMER table:


CUSTOMER(CustID, Name, Region, CreditLimit)
Write SQL queries to create horizontal fragments based on the Region attribute such that:
❖ Fragment 1 contains customers from 'East',
ANSWER
CREATE TABLE CUSTOMER_EAST AS
SELECT * FROM CUSTOMER
WHERE Region = 'East';

❖ Fragment 2 contains customers from 'West'


ANSWER
CREATE TABLE CUSTOMER_EAST AS
SELECT * FROM CUSTOMER
WHERE Region = 'west';

❖ Fragment 3 contains customers from all other regions


ANSWER
CREATE TABLE CUSTOMER_OTHERS AS
SELECT * FROM CUSTOMER
WHERE Region NOT IN ('East', 'West');

(b) Given the EMPLOYEE table:


EMPLOYEE(EmpID, Name, Address, Department, Salary)
Write SQL queries to create vertical fragments as follows:
❖ Fragment 1 should contain EmpID, Name, and Address
ANSWER
CREATE TABLE EMPLOYEE_ONE AS
SELECT EmpID, Name, Address
FROM EMPLOYEE;

❖ Fragment 2 should contain EmpID, Department, and Salary

ANSWER
CREATE TABLE EMPLOYEE_TWO AS
SELECT EmpID, Department, Salary
FROM EMPLOYEE;

(c) Given the vertically fragmented tables:


• EmpPersonal(EmpID, Name, Address)
• EmpJob(EmpID, Department, Salary)
Write an SQL query to reconstruct the original EMPLOYEE table using a join.
ANSWER
SELECT *
FROM EmpPersonal
JOIN EmpJob ON EmpPersonal.EmpID=EmpJob.EmpID;

(d) Under what conditions would horizontal fragmentation be more advantageous than
vertical fragmentation? Explain with an example scenario.
ANSWER

WHEN,The application frequently accesses specific rows based on some criteria (like
location, department, region), rather than accessing just a subset of columns within the table,
WHEN users or applications work only with certain subsets of rows,WHEN data is
geographically distributed, and each site handles a specific region's data,WHEN you want to
reduce access time by storing only relevant rows at each location

EXAMPLE

Table: CUSTOMER(CustID, Name, Region, CreditLimit)

Suppose a national company operates in three regions: East, West, and south, and each
branch office only deals with customers in its own region.

❖ Horizontal Fragmentation:

• CUSTOMER_EAST: Contains only customers from the East.


• CUSTOMER_WEST: Contains only customers from the West.
• CUSTOMER_SOUTH: Contains only customers from the south region.
➢ Advantage:

• The East branch can query only the CUSTOMER_EAST fragment.


• This improves performance (fewer rows to scan), localizes data, and improves
autonomy.
• Also, it reduces network overhead in distributed systems.

From above explanation we can conclude by saying, Horizontal fragmentation is more


advantageous when row-level filtering (e.g., by region, department, location)
dominates access patterns — especially in distributed or region-based systems.

You might also like