Query Optimization in Distributed Systems
Query Optimization in Distributed Systems
The tables required in a global query have fragments distributed across multiple sites.
The local databases have information only about local data. The controlling site uses the
global data dictionary to gather information about the distribution and reconstructs the
global view from the fragments.
If there is no replication, the global optimizer runs local queries at the sites where the
fragments are stored. If there is replication, the global optimizer selects the site based
upon communication cost, workload, and server speed.
The global optimizer generates a distributed execution plan so that least amount of data
transfer occurs across the sites. The plan states the location of the fragments, order in
which query steps needs to be executed and the processes involved in transferring
intermediate results.
The local queries are optimized by the local database servers. Finally, the local query
results are merged together through union operation in case of horizontal fragments and
join operation for vertical fragments.
For example, let us consider that the following Project schema is horizontally fragmented
according to City, the cities being New Delhi, Kolkata and Hyderabad.
PROJECT
Suppose there is a query to retrieve details of all projects whose status is “Ongoing”.
In order to get the overall result, we need to union the results of the three queries as follows −
Query trading.
Reduction of solution space of the query.
A distributed system has a number of database servers in the various sites to perform the
operations pertaining to a query. Following are the approaches for optimal resource utilization −
Operation Shipping − In operation shipping, the operation is run at the site where the data is
stored and not at the client site. The results are then transferred to the client site. This is
appropriate for operations where the operands are available at the same site. Example: Select
and Project operations.
Data Shipping − In data shipping, the data fragments are transferred to the database server,
where the operations are executed. This is used in operations where the operands are distributed
at different sites. This is also appropriate in systems where the communication costs are low,
and local processors are much slower than the client server.
Hybrid Shipping − This is a combination of data and operation shipping. Here, data fragments are
transferred to the high-speed processors, where the operation runs. The results are then sent to
the client site.
Query Trading
In query trading algorithm for distributed database systems, the controlling/client site for a
distributed query is called the buyer and the sites where the local queries execute are called
sellers. The buyer formulates a number of alternatives for choosing sellers and for reconstructing
the global results. The target of the buyer is to achieve the optimal cost.
The algorithm starts with the buyer assigning sub-queries to the seller sites. The optimal plan is
created from local optimized query plans proposed by the sellers combined with the
communication cost for reconstructing the final result. Once the global optimal plan is
formulated, the query is executed.
Optimal solution generally involves reduction of solution space so that the cost of query and data
transfer is reduced. This can be achieved through a set of heuristic rules, just as heuristics in
centralized systems.
Perform selection and projection operations as early as possible. This reduces the data
flow over communication network.
Simplify operations on horizontal fragments by eliminating selection conditions which are
not relevant to a particular site.
In case of join and union operations comprising of fragments located in multiple sites,
transfer fragmented data to the site where most of the data is present and perform
operation there.
Use semi-join operation to qualify tuples that are to be joined. This reduces the amount of
data transfer which in turn reduces communication cost.
Merge the common leaves and sub-trees in a distributed query tree.