Ddbms
Ddbms
(a)To Discuss the impact of network latency on distributed query performance and how
optimization techniques can mitigate this.
Network latency refers to the delay that occurs when data is transmitted from one location to
another within a network. When latency is high, it results in slower data transfer and delayed
responses. In contrast, low latency means that data can move more quickly, leading to faster
communication between systems.
• Longer Response Times: When a query is executed across multiple systems, data often
needs to travel from one node to another. If there's a delay in the network, the overall
query takes more time to finish because of the slow data transfer between those nodes.
• Slower Join Processes: In distributed systems, joining tables that are stored on different
machines is common. These operations depend on moving data between nodes. When
latency is high, this movement becomes slower, which affects the speed of the join.
• Problems with Real Time and Live Applications: Some applications, like dashboards
or online transaction systems, depend on quick data access. If the network is slow, users
may notice delays when trying to view or interact with real time data.
• Poor Query Optimization: Sometimes, the system that plans how a query should run
does not consider the delays in the network. This can result in inefficient plans that
move too much data between nodes, making the query slower.
• More Errors and Timeouts: A slow network may cause delays in communication that
lead to packet loss or timeouts. This can force the system to resend data, adding even
more delay and reducing overall performance.
• Keeping Data Close to Where It’s Needed: To cut down on delays, it's helpful to
organize data so that related pieces are stored together on the same server. This reduces
the need for sending data between different machines. Also, by making queries aware
of how data is divided (sharded), systems can avoid contacting all nodes unnecessarily.
• Smarter Query Execution: Some parts of a query like filtering or summarizing data can
be done early, right where the data is stored. This cuts down the amount of information
that needs to be sent over the network. In addition, query planners can be designed to
consider the cost of network delays when deciding how to execute a query efficiently.
• Using Caching to Save Time: Frequently requested results can be stored temporarily,
so the system doesn’t have to run the same query again or pull the same data from
remote servers. Tools like Redis and Memcached can help by acting as shared memory
storage across the network.
• Reducing the Amount of Network Traffic: Combining many small data requests into
one larger one helps to reduce the time lost in sending multiple messages. Also,
compressing the data before it’s sent helps lower the amount of information traveling
over the network, which speeds things up.
• Investing in Faster Network Gear: Using modern, high-speed network hardware like
fiber optics or faster Ethernet can reduce the actual physical delay of moving data
between systems.
• Running Tasks in Parallel and Without Waiting: If a system can handle multiple tasks
at once and doesn’t wait for one task to finish before starting the next, it can greatly
reduce how long users wait for results. This approach makes better use of time,
especially when dealing with multiple remote nodes.
a)
i. To propose and describe three alternative execution plans for this query, considering:
Shipping all data to Site A, Performing local aggregation and sending partial results.
Shipping all data to a different site (e.g., Site B or C)
Transfer Cost Formula = Number of Records × Avg. Record Size × Transfer Cost per Byte
(0.01 units/byte)
• Site B to A:
30,000 records×100 bytes×0.01=30,000 units
• Site C to A:
10,000×150×0.01=15,000 units
• Total Cost = 30,000 + 15,000 = 45,000 units
• Only one SUM value per site is sent. Assume 8 bytes per result.
• Site B to A: 8 bytes×0.01=0.08 units
• Site C to A: 8×0.01=0.08 units
• Total Cost = 0.08 + 0.08 = 0.16 units
• Site A to B:
20,000×120×0.01=24,000 units
• Site C to B:
10,000×150×0.01=15,000 units
• Total Cost = 24,000 + 15,000 = 39,000 units
Reason:
• It has by far the lowest network transfer cost (only 0.16 units) compared to Plan 1
(45,000 units) and Plan 3 (39,000 units).
• Since local aggregation cost is considered negligible, this plan is most efficient.
• It also reduces network congestion and improves response time in distributed
environments.
In distributed query optimization, the 2 step algorithm is used to improve how a query runs
across multiple sites. It involves creating two types of plans: a static plan and a dynamic plan.
Each one has its own role in making sure the query runs efficiently.
1. Static Plan
• Role:
It helps to plan ahead and decide the most cost-effective strategy for executing the
query, even before it starts running.
2. Dynamic Plan
When it's generated:
The dynamic plan is made at runtime (while the query is being executed).
What it does:
It adjusts the query plan based on the real-time conditions of the system, like current network
delays, actual data sizes, or server load.
Role:
It makes on-the-fly improvements to the static plan if needed. This ensures the query remains
efficient even if something unexpected happens during execution.
The dynamic plan adjusts as needed, helping the query adapt to real-time changes.
Together, they balance planning and flexibility, leading to faster and more efficient query
execution in distributed systems.
Deferred or lazy maintenance delays updating the materialized view until it is actually
queried by a user. Changes made to the base tables are not immediately reflected in the view,
meaning the view may contain outdated information until a request for it is made.
3. Trigger-Based Maintenance
Log-based maintenance, also known as Change Data Capture (CDC), tracks changes in the
base tables by reading the database's transaction logs instead of directly querying the tables.
This technique allows changes to be captured and applied to the materialized view
asynchronously, making it highly efficient and minimally invasive to the primary operations
of the database.
❖ Incremental view maintenance and triangle counting are highly complementary in
distributed graph-processing systems, particularly when working with materialized
views that are frequently updated. Triangle counting exemplifies a practical scenario
where full recomputation after each update would be computationally expensive.
Instead, incremental update techniques—such as delta processing—allow efficient
maintenance by applying only the changes triggered by recent updates. This approach
is especially effective in large-scale, dynamic, and distributed environments. In such
systems, delta processing is often employed to maintain views involving join
operations, particularly in cases of light-light or heavy-heavy interactions. Explain
how delta processing functions under these join interaction patterns and why it is a
suitable technique for efficient view maintenance in distributed environments.
ANSWERS
Delta processing .is a technique used in incremental view maintenance to update materialized views
efficiently by applying only the changes (called deltas) resulting from recent updates in the base data,
rather than recomputing the entire view.
(1) Light-Light Interactions.This pattern occurs when both vertices involved in the join have low
degrees (i.e., few connections). For example, when a new edge connects two less-connected nodes,
only a small number of potential triangles can be formed or destroyed. Delta processing handles this
efficiently by only checking for triangles that directly involve the updated edge and the immediate
neighbors of those two vertices. Because the scope of the update is small, the computation and
communication overhead is minimal.
(2)Heavy-Heavy Interactions.In this case, the join involves two high-degree vertices, which
participate in many relationships or edges. A single update (such as adding or removing an edge
between these two nodes) can potentially affect a large number of triangles. Even though the impact is
larger, delta processing remains efficient by avoiding a full graph scan. Instead, it focuses only on the
subsets of the graph connected to the updated edge. Optimization techniques like indexing,
neighbourhood caching, or partial recomputation are often used to limit the scope of computation.
QUESTION THREE
ANSWER
CREATE TABLE EMPLOYEE_TWO AS
SELECT EmpID, Department, Salary
FROM EMPLOYEE;
(d) Under what conditions would horizontal fragmentation be more advantageous than
vertical fragmentation? Explain with an example scenario.
ANSWER
WHEN,The application frequently accesses specific rows based on some criteria (like
location, department, region), rather than accessing just a subset of columns within the table,
WHEN users or applications work only with certain subsets of rows,WHEN data is
geographically distributed, and each site handles a specific region's data,WHEN you want to
reduce access time by storing only relevant rows at each location
EXAMPLE
Suppose a national company operates in three regions: East, West, and south, and each
branch office only deals with customers in its own region.
❖ Horizontal Fragmentation: