Distributed Database Management Notes - 3
Distributed Database Management Notes - 3
This chapter introduces the basic concepts of distributed query processing. A query
involves the retrieval of data from the database. In this chapter, different phases and sub
phases of distributed query processing are briefly discussed with suitable examples. An
important phase in distributed query processing is distributed query optimization.
Different query optimization strategies, distributed cost model, cardinalities of
intermediate result relations are focused in this chapter. Finally, efficient algorithms for
centralized and distributed query optimization have also been presented in this chapter.
The organization of this chapter is as follows. Section 11.1 introduces the fundamentals
of query processing and objectives of distributed query processing are presented in
Section 11.2. The different steps for distributed query processing such as query
transformation, query fragmentation, global query optimization, and local query
optimization are briefly discussed in Section 11.3. In Section 11.4, join strategies in
fragmented relations have been illustrated. The algorithms for global query optimization
are represented in Section 11.5.
208
DBMS, query processing can be divided into four main steps such as query
decomposition (consisting of scanning, parsing and validation), query optimization,
code generation and query execution. In query decomposition step, the query parser
checks the validity of the query and then translates it into an internal form usually a
relational algebraic expression or something equivalent. The Query Optimizer examines
all relational algebraic expressions that are equivalent to the given query and chooses the
optimum one that is estimated to be the cheapest. The Code Generator generates the code
for the access plan selected by the query optimizer and the query processor actually
executes the query. The steps of query processing in centralized DBMS are illustrated in
figure 11.1.
Query Decomposition
(Scanning, Parsing, and System
Validating) Catalog
Execution Plan
Code Generation
Generated Code
Query Output
209
The query optimization is a critical performance issue in centralized database
management system, because the main difficulty is to choose one execution strategy that
minimizes the consumption of computing resources. Another objective of query
optimization is to reduce the total execution time of the query which is the sum of the
execution times of all individual operations that make up the query. There are two main
techniques for query optimization in centralized DBMS, although the two methods are
usually combined in practice. The first approach uses heuristic rules that order the
operations in a query while the second approach compares different strategies based on
their relative costs and selects one which requires minimum computing resources. In a
distributed system, several additional factors further complicate the process of query
execution. In general, the relations are fragmented in a distributed database, therefore, the
distributed query processor transform a high-level query on a logical distributed database
(entire relation) into a sequence of database operations (relational algebra) on relation
fragments. The data accessed by the query in a distributed database must be localized so
that the database operations can be performed on local data or relation fragments. Finally,
the query on fragments must be extended with communication operations, and
optimization should be done with respect to a cost function that minimizes the use of
computing resources such as disk I/Os, CPUs and communication networks.
Example 11.1
Let us consider the following two relations which are stored in a centralized DBMS
“Retrieve the names of all employees whose department location is ‘inside’ the
campus”
where, empid and deptno are primary key for the relation Employee and Department
respectively and deptno is a foreign key of the relation Employee.
Two equivalent relational algebraic expressions that correspond to the above SQL
statement are as follows:
210
In the first relational algebraic expression the projection and the selection operation has
been performed after calculating the Cartesian product of two relations Employee and
Department, while in the second expression the Cartesian Product has been performed
after performing the selection and the projection operation from individual relations.
Obviously, the use of computing resources is lesser in the second expression. Thus, in a
centralized DBMS, it is easier to choose the optimum execution strategy based on a
number of relational algebraic expressions that are equivalent to the same query. In
distributed context, the query processing is significantly more difficult because the choice
of optimum execution strategy depends on some other factors such as data transfer
among sites and the selection of best site for query execution. The problem of distributed
query processing is discussed below with the help of an example.
Example 11.2
Let us consider the same example 11.1 in a distributed database environment where the
Employee and Department relations are fragmented and stored into different sites. For
simplicity, let us assume that the Employee relation has horizontally fragmented into two
partitions EMP1 and EMP2 which are stored at site1 and site2 respectively and the
Department relation has horizontally fragmented into two relations DEPT1 and DEPT2
which are stored in site3 and site4 respectively as listed below.
Further assume that the above query is generated at Site5 and the result is required at that
site. To execute the query into the distributed environment, two different strategies are
depicted in figure 11.2.
Site5
(i) Strategy A
211
Site5
Site1 Site2
E1 E2
E1 = EMP1 ⋈ deptno (D1) E2 = EMP2 ⋈deptno(D2)
Site3 D1 D2 Site4
(ii) Strategy B
In the first strategy, all data are transferred into Site5 before processing the query. In the
second strategy, selection operations are performed individually into fragmented relations
DEPT1 and DEPT2 at Site3 and Site4 respectively, and then the resultant data D1 and D2
has been transferred at Site1 and Site2 respectively. After evaluating the join operations
D1 with EMP1at Site1 and D2 with EMP2 at Site2, the resultant data has been transmitted
to Site5 and the final project operation has been performed.
To calculate the costs of the above two different execution strategies, let us assume that
the cost of accessing a tuple from any relation is 1 unit and the cost of transferring a tuple
between any two sites is 10 units. Further consider that the number of tuples in Employee
and Department relations are 1000 and 20 respectively, among which the location of 8
departments are inside the campus. For the sake of simplicity, assume that the tuples are
uniformly distributed among sites and the relations Employee and Department are locally
clustered on attributes ‘deptno’ and ‘location’ respectively.
(i) The cost for transferring 10 tuples of DEPT1 from Site3 and 10 tuples of DEPT2
from Site4 to Site5 = (10 + 10) * 10 = 200
(ii) The cost for transferring 500 tuples of EMP1 from Site1 and 500 tuples of EMP2
from Site2 to Site5 = (500 + 500) * 10 = 10000
(iii) The cost of producing selection operations from DEPT1 and DEPT2 = 20 * 1 = 20
212
(iv) The cost of performing join operation with EMP1 and EMP2 with resultant
selected data from Department relation = 1000 * 8 * 1 = 8000
(v) The cost for evaluating projection operation at Site5 to retrieve the employee
names = 8 * 1 = 8
The cost of performing the projection operation and the selection operation on a tuple is
same because in both cases it is equal to the cost of accessing a tuple. Obviously, the
execution strategy B is much more beneficial than the strategy A. Furthermore, the
slower communication between the sites and the higher degree of fragmentation may
increases the difference among the alternative query processing execution strategies.
Distributed query processing involves the retrieving of data from physically distributed
databases which provides the view of a single logical database to users. It has a number
of objectives as listed below.
• In distributed query processing, the total cost for executing a distributed query should
be minimized. If no relation is fragmented in a distributed DBMS, the query
execution involves only a local processing cost. On the other hand, if relations are
fragmented, a communication cost is incurred in addition with the local processing
cost. In this case, the aim of distributed query processing is to minimize the total
213
execution cost of the query which includes the total processing cost (sum of all local
processing costs in participating sites) of the query and the communication cost. The
local processing cost of a distributed query is evaluated in terms of the number of
disk accesses (I/O cost) and CPU cost. The CPU cost is incurred when performing
data operations in the main memory in participating sites. The I/O cost can be
minimized by using efficient buffer management technique. In a distributed query
execution, the communication cost is required to exchange data between participating
sites. Hence, the communication cost depends on several factors such as the amount
of data transfer between participating sites, the selection of best site for query
execution, the number of message transfer between participating sites, and the
communication network. In case of high-speed wide area networks (with a bandwidth
few kilobytes per second), the communication cost is the dominant factor and the
optimization of CPU cost and I/O cost can be ignored in such cases. The optimization
of local processing cost is of greater significance in case of local networks.
In a distributed database system, data distribution details are hidden from the users, thus,
it provides data distribution transparency. The distributed query processor is responsible
for transforming the query on global relations into a number of subqueries on fragmented
relations at local databases. Therefore, distributed query processing is more complicated
than query processing on centralized DBMS, and it requires some extra steps in addition
with all the steps of a centralized query processing technique. Whenever a distributed
query is generated at any site of a distributed system, it follows a sequence of phases to
retrieve the answer of the query. These are query decomposition, query fragmentation,
global query optimization and local query optimization as illustrated in figure 11.3.
The first three phases in distributed query processing are controlled by the site where the
query is generated. The local query optimization is done at local participating sites and all
these optimized local queries are executed on local databases. Finally, the results of local
queries are sent back to the site of the distributed system where the query is generated.
The details of all these phases are described in the next section.
214
Query in a high-level language on global relations
Fragmented Query
Global System
Global Query Optimization
Catalog
FigureQuery
11.3.1 11.3 Decomposition
Phases in Distributed Query Processing
Query decomposition is the first phase in distributed query processing. The objective of
this phase is to transform a query in high-level language on global relations into a
relational algebraic query on global relations. The information required for the
transformation is available in global conceptual schema describing the global relations. In
this phase, the syntactical and semantic correctness of the query is also checked. In the
context of both centralized DBMS and distributed DBMS, the query decomposition phase
is the same. The four successive steps of query decomposition are normalization,
analysis, simplification and query restructuring which are briefly discussed in the
following sections.
215
11.3.1.1 Normalization
In normalization step, the query is converted into a normalized form to facilitate further
processing in an easier way. A given query can be arbitrarily complex depending on the
predicates (WHERE clause in SQL) specified in the query. In this step, the complex
query is generally converted into one of two possible normal forms by using a few
transformation rules, which are listed below.
• P1 ∧ P2 ⇔ P2 ∧ P1
• P1 ν P2 ⇔ P2 ν P1
• P1 ∧ (P2 ∧ P3) ⇔ (P1 ∧ P2) ∧ P3
• P1 ν (P2 ν P3) ⇔ (P1 ν P2) ν P3
• P1 ∧ (P2 ν P3) ⇔ (P1 ∧ P2) ν (P1 ∧ P3)
• P1 ν (P2 ∧ P3) ⇔ (P1 ν P2) ∧ (P1 ν P3)
• ¬(P1 ∧ P2) ⇔ ¬P1 ν ¬P2
• ¬(P1 ν P2) ⇔ ¬P1 ∧ ¬P2
• ¬(¬P1) ⇔ P1
The two possible normal forms are conjunctive normal form and disjunctive normal
form.
Conjunctive Normal Form: In conjunctive normal form, preference is given to the AND
(∧ predicate) operator and it is a sequence of conjunctions that are connected with the ∧
(AND) operator. Each conjunction contains one or more terms connected by the ν (OR)
operator. For instance,
where Pi represents a simple predicate. In this normal form, a query can be processed as
independent conjunctive subqueries connected by union operations.
216
Example 11.3
“Retrieve the names of all employees whose designation is Manager and department
name is Production or Printing”.
Hence, in the above disjunctive normal form, each disjunctive connected by ν (OR)
operator can processed as independent conjunctive subqueries.
11.3.1.2 Analysis
The objective of the analysis step is to reject normalized queries that are incorrectly
formulated or contradictory. The query is lexically and syntactically analyzed in this step
by using the compiler of high-level query language in which the query is expressed. In
addition, this step verifies whether the relations and attributes specified in the query are
defined in the global conceptual schema or not. It is also checked in the analysis step that
the operations to database objects specified in the given query are correct for the object
type. When one of the above incorrectness is detected, the query is returned to the user
with an explanation; otherwise, the high-level query has been transformed into an internal
form for further processing. The incorrectness in the query is detected based on the
corresponding query graph or relation connection graph, which can be constructed as
follows:
217
• A node is created in the query graph for the result and for each base relation specified
in the query.
• An edge between two nodes is drawn in the query graph for each join operation and
for each project operation in the query. An edge between two nodes that are not result
nodes represents a join operation, while an edge whose destination node is the result
node represents a project operation.
• A node in the query graph which is not result node is labeled by a select operation or
a self-join operation specified in the query.
The relation connection graph is used to check the semantic correctness of the subset of
queries that do not contain disjunction and negation. Such a query is semantically
incorrect if its relation connection graph is not connected. A join graph for a query is a
subgraph of the relation connection graph which represents only join operations specified
in the query and it can be derived from the corresponding query graph. The join graph is
useful in query optimization phase. A query is contradictory if its normalized attribute
connection graph [Rosenkrantz & Hunt, 1980] contains a cycle for which the valuation
sum is negative. The construction of normalized attribute connection graph for a query
is mentioned in the following.
• A node is created for each attribute referenced in the query and an additional node is
created for a constant 0.
• A directed edge between two attribute nodes is drawn to represent each join operation
in the query. Similarly, a directed edge between an attribute node and a constant 0
node is drawn for each select operation specified in the query.
• A weight is assigned to edges depending on the inequality condition mentioned in the
query. A weight v is assigned to the directed edge a1→a2, if there is an inequality
condition in the query that satisfies a1 ≤ a2 + v. Similarly, a weight –v is assigned to
the directed edge 0→a1, if there is an inequality condition in the query that represents
a1 ≥ v.
Example 11.4
and the query “Retrieve the names, addresses and course names of all those student
whose year of admission is 2008 and course duration is 4 years”.
Select sname, address, course-name from Student, Course where year = 2008 and
duration = 4 and Student.course-id = Course.course-id.
218
Hence, course-id is a foreign key of the relation Student. The above SQL query can be
syntactically and semantically incorrect for several reasons. For instance, the attributes
sname, address, course-name, year, duration are not declared in the corresponding
schema or the relations Student, Course are not defined in the global conceptual schema.
Furthermore, if the operations “=2008” and “=4” are incompatible with data types of the
attributes year and duration respectively, then the above SQL query is incorrect.
The query graph and the join graph for the above SQL query is depicted in the following
figures.
Student.course-id = Course.course-id
sname, address
course-name
Result
Student.course-id = Course.course-id
Studen
Course
t
In the above SQL query, if the join condition between two relations (that is
Student.course-id = Course.course-id) is missing, then there should be no line between
the nodes representing the relations Student and Course in the corresponding query graph
(figure 11.4(a)). Hence, the SQL query is semantically incorrect since the relation
connection graph is disconnected. In this case, either the query is rejected or an implicit
Cartesian product between the relations is assumed.
219
Example 11.5
Let us consider the query “Retrieve all those student names who are admitted into courses
where the course duration is greater than 3 years and less than 2 years”, that involves the
relations Student and Course.
Select sname from Student, Course where duration > 3 and Student.course-id =
Course.course-id and duration < 2.
The normalized attribute connection graph for the above query is illustrated in figure
11.5.
2
duratio
0
n
-3
0
Student. Course.
course- course-
id id
0
In the above normalized attribute connection graph, there is a cycle between the nodes
duration and 0 with a negative valuation sum, which indicates that the query is
contradictory.
11.3.1.3 Simplification
In this step, all redundant predicates in the query are detected and common sub
expressions are eliminated in order to transform the query into a simpler and efficient
computed form. This transformation must achieve the semantic correctness. Typically,
view definitions, access restrictions and integrity constraints are considered in this step,
some of which may introduce redundancy in the query. The well-known idempotency
rules of Boolean algebra are used in order to eliminate redundancies from the given
query, which are listed below.
• P∧P⇔P
• PνP⇔P
220
• P ∧ true ⇔ P
• P ∧ false ⇔ false
• P ν true ⇔ true
• P ν false ⇔ P
• P ∧ (∼P) ⇔ false
• P ν (∼P) ⇔ true
• P ∧ (P ν Q) ⇔ P
• P ν (P ∧ Q) ⇔ P
Example 11.6
Let us consider the following view definition and query on the view that involves the
relation Employee (empid, ename, salary, designation, deptno).
Create view V1 as select empid, ename, salary from Employee where deptno = 10;
Select empid, ename, salary from Employee where (deptno = 10 and salary > 10000)
and deptno = 10;
Hence, the predicates are redundant and the WHERE condition reduces to “deptno = 10
and salary > 10000”.
In this step, the query in high-level language is rewritten into equivalent relational
algebraic form. This step involves two sub steps. Initially, the query is converted into
equivalent relational algebraic form and then the relational algebraic query is restructured
to improve performance. The relational algebraic query is represented by query tree or
operator tree which can be constructed as follows:
In relational data model, the conversion from SQL query to relational algebraic form can
be done in an easier way. The leaf nodes in the query tree are created from the FROM
clause of SQL query. The root node is created as a project operation involving the result
attributes from the SELECT clause specified in SQL query. The sequence of relational
algebraic operations, which depends on the WHERE clause of SQL query, is directed
221
from leaves to the root of the query tree. After generating the equivalent relational
algebraic form from the SQL query, the relational algebraic query is restructured by using
transformation rules from relational algebra which are listed below. In listing these
rules, three different relations R, S, T are considered, where the relation R is defined over
the attributes A = {A1, A2, …., An} and the relation S is defined over the attributes B=
{B1, B2,………., Bn}.
R х S ⇔ S х R and R ⋈ S ⇔ S ⋈ R
Similarly, R u S ⇔ S u R and R ∩ S ⇔ S ∩ R
This rule is not applicable to set difference and semijoin operations of relational
algebra.
(ii) Associativity of binary operators – Cartesian product and natural join operation
are always associative:
(R х S) х T ⇔ R х (S х T) and
(R ⋈ S) ⋈ T ⇔ R ⋈ (S ⋈ T)
Where, A, A′ and A″ are set of attributes defined on relation R and A′ and A″ are
subsets of A and A″ is a subset of A′.
222
σp(Ai) (R х S) ⇔ (σp(Ai) (R)) х S and
Similarly, selection operation can be commuted with union and set difference
operations if both relations are defined over the same schema:
Similarly, projection operation can be commuted with union and set difference
operations:
Proofs of the above transformation rules are available in [Aho et. al, 1982]. By applying
these rules, a large number of equivalent query trees can be reconstructed for a given
query among which the most cost-effective one is chosen at the optimization phase.
However, in this approach generation of excessive large number of operator trees for the
same query is not realistic. Therefore, the above transformation rules are used in a
methodical way in order to reconstruct the query tree for a given query. The
transformation rules are used in the following sequence.
• Unary operations in the query are separated first in order to simplify the query.
• Unary operations on the same relation are grouped so that common expressions are
computed only once [rule no. (iii)].
• Unary operations are commuted with binary operations [rule no. (v) & (vi)].
• Binary operations are ordered.
Example 11.7
Let us consider the following SQL query which involves the relations Student (s-id,
sname, address, course-id, year) and Course (course-id, course-name, duration,
course-fee, intake-no, coordinator):
223
Select sname, course-name from Student, Course where s-id > 0 and year = 2007
and Student.course-id = Course.course-id and duration = 4 and intake-no = 60.
The query tree for the above SQL query is depicted in the figure 11.6.
Πsname, course-name
σduration = 4 ∧ intake-no = 60
⋈Student.course-id = Course.course-id
Student Course
224
Πsname, course-name
⋈Student.course-id = Course.course-id
Student Course
Hence, unary operations specified in the query are separated first and all unary operations
on the same relation are group together in order to reconstruct the query tree. However,
the query tree in the figure 11.7 does not necessarily imply the optimal tree.
225
An easier way to generate fragment query is to replace the global relations at the leaves
of the query tree or the operator tree of the distributed query with their reconstruction
rules. The relational algebraic query tree generated by applying the reconstruction rules is
known as generic tree. Hence, the approach is not too efficient because more
simplifications and reconstructions are possible in generic trees. Therefore, in order to
generate simpler and optimized query from generic query reduction techniques are used
where the reduction techniques are dependent on the types of fragmentation. Reduction
techniques for different types of fragmentation are illustrated with examples in the next
section.
Example 11.8
Let us consider the following SQL query that involves the relation Employee (empid,
ename, salary, designation, deptno):
Assume that the Employee relation is partition into two horizontal fragments EMP 1 and
EMP2 depending on the selection predicates as mentioned below:
Now, the relation Employee can be reconstructed from its horizontal fragments EMP 1 and
EMP2 by using the following reconstruction rule.
Therefore, in the generic tree of the above SQL query the leaf node corresponds to the
Employee relation can be replaced by the reconstruction rule EMP1 u EMP2. Hence, the
selection predicate contradicts the definition of horizontal fragment EMP1, thereby
produces empty relation. This operation can be eliminated from the generic tree as shown
in figure 11.8.
226
σsalary >10000 σsalary >10000
Example 11.9
Let us assume that the relation Employee (empid, ename, salary, designation, deptno)
is horizontally fragmented into two partitions EMP1 and EMP2 and the relation
Department (deptno, dname, location) is horizontally fragmented into two relations
DEPT1 and DEPT2 respectively. These horizontal fragmented relations are defined in the
following:
The reconstruction rules for the above horizontal fragments are as follows:
Let us consider the following SQL query in a distributed environment which involves the
relations Employee and Department.
The generic query tree and the reduced query tree for the above query are depicted in
figure 11.9.
227
⋈Employee.deptno = Department.deptno
∪
∪
⋈Employee.deptno = Department.deptno
⋈Employee.deptno = Department.deptno
DEPT1
EMP1
EMP2 DEPT2
228
Hence, the commutativity of join operation with union operation is very important at
distributed DBMS, because it allows a join operation of two relations is to be
implemented as a union operation of partial join operations, where each part of the union
operation can be executed in parallel.
The vertical fragmentation partitions a relation based on projection operation, thus, the
reconstruction rule for vertical fragmentation is join operation. The reduction technique
for vertical fragmentation involves the removal of vertical fragments that have no
attributes in common except the key attribute.
Example 11.10
Let us assume that the Employee relation is vertically fragmented into two relations
EMP1 and EMP2 as defined below:
In this query, the projection operation on relation EMP2 is redundant, since the attributes
ename and salary are not part of EMP2. The generic query tree of the above query is
illustrated in figure 11.10 (a). By commutating the projection operation with join
operation and removing the vertical fragment EMP2 the reduced query tree is produced as
shown in figure 11.10(b).
∏ename,salary ∏ename,salary
⋈empid
EMP1
EMP1 EMP2
229
11.3.2.3 Reduction for Derived Fragmentation
The derived fragmentation partitions a child relation horizontally based on the same
selection predicate of its parent relation. Derived fragmentation is used to facilitate the
join between fragments. The reconstruction rule for derived fragmentation is join
operation over Union operation. The reduction technique for derived fragmentation
involves the commutativity of join operation with union operation and removal of useless
join operations and selection operations.
Example 11.11
Let us assume that the relation Department (deptno, dname, location) is horizontally
fragmented into two relations DEPT1 and DEPT2 respectively. These horizontal
fragmented relations are defined in the following:
Further assume that the fragmentation of Employee relation is derived from Department
relation.
Thus, EMPi = Employee⊲deptno DEPTi i=1,2
Let us consider the following SQL query in a distributed environment which involves the
relations Employee and Department.
The generic query tree for the derived fragmentation is depicted in figure 11.11(a).
⋈Department.deptno = Employee.deptno
⋈Department.deptno = Employee.deptno
σ deptno >10
U
EMP2 DEPT2
U EMP1 EMP2
The search space is defined as the set of equivalent query tree for a given query which
can be generated by using transformation rules. In query optimization, join trees are
particularly important, because it determines the join order of relations involved in a
given query that effect the performance of query processing. If a given query involves
many operators and many relations, then the search space is very high, because it
231
contains a large number of equivalent query trees for the given query. This is more
expensive than the actual execution time and therefore query optimizers typically impose
some restriction on the size of the search space to be considered. Most of query
optimizers use heuristics rules that order the relational algebraic operations (selection,
projection and Cartesian product) in the query. Another restriction is imposed on the
shape of the join tree. There are two different kinds of join trees, known as linear join
trees and bushy join trees. In a linear join tree, at least one operand of each operator
node is a base relation. On the other hand, a bushy join tree is more general and may have
operators with no base relations as operands. Linear join trees reduce the size of the
search space whereas bushy join trees facilitate parallelism. The example of linear join
tree and bushy join tree are illustrated in the following figure.
⋈
R5
⋈
R4 ⋈
⋈
R3 ⋈ ⋈
R1 R2
R1 R2 R3 R4
(a) An Example of Linear Join Tree (b) An Example of Bushy Join Tree
There are three different kinds of optimization strategies, known as static optimization
strategy, dynamic optimization strategy and randomized optimization strategy, which are
described in the following.
(a) Static Optimization Strategy - In static optimization strategy, the given query is
parsed, validated, and optimized once. This strategy is similar to the approach taken
by a compiler for a programming language. The advantages of static optimization
strategy are the run-time overhead is removed, and there may be more time available
to evaluate a large number of execution strategies, thereby increases the chances of
finding a more optimum strategy. The disadvantage with the static optimization
strategy is that the optimal execution strategy which is selected at compile-time may
no longer be optimal at run-time.
232
(b) Dynamic Optimization Strategy – The most popular optimization strategy is
dynamic optimization strategy, which is deterministic. In dynamic optimization
strategy, the execution plan starts from base relations and proceeds by joining one
more relation at each step until the complete plan is obtained. The advantage of
dynamic optimization strategy is that it is almost comprehensive and ensures the best
plan. The disadvantages are that the performance of the query is affected, because the
query has to be parsed, validated and optimized each time before execution.
Moreover, it is necessary to reduce the number of execution strategies to be analyzed
to achieve an acceptable overhead, which may have the effect of selecting a less than
optimum strategy. This strategy is suitable when the number of relations involved in
the given query is small.
(c) Randomized Optimization Strategy - The recent optimization strategy is
randomized optimization strategy, which reduces the optimization complexity but
does not guarantee the best execution plan. The randomized optimization strategy
allows the query optimization to trade optimization tune for execution time
[Lanzelotte et al., 1993]. In this strategy, one or more start plans are built by a greedy
strategy, and then it tries to improve the start plan by visiting its neighbors. A
neighbor is obtained by applying a random transformation to a plan. One typical
example of random transformation is exchanging two randomly chosen operand
relations of the plan. The randomized optimization strategy provides better
performance than other optimization strategies as soon as the query involves more
than several relations.
The distributed cost model includes cost functions to predict the cost of operators,
database statistics, base data, and formulas to calculate the sizes of intermediate results.
Cost Functions
In a distributed system, the cost of processing a query can be expressed in terms of the
total cost measure or the response time measures [Yu and Chang, 1984]. The total cost
measure is the sum of all cost components. If no relation is fragmented in the distributed
system and the given query includes selection and projection operations, then the total
cost measure involves the local processing cost only. However, when join and semijoin
operations are executed, communication costs between different sites may be incurred in
addition to the local processing cost. Local processing costs are usually evaluated in
terms of the number of disk accesses and CPU processing time, while communication
costs are expressed in terms of the total amount of data transmitted. For geographically
dispersed computer networks, communication cost is normally the dominant
consideration, but local processing cost is of greater significance for local networks.
Thus, most early distributed DBMSs designed for wide area networks have ignored the
local processing cost and concentrate on minimizing the communication cost. Therefore,
the total cost measure can be represented by using the following formula.
233
where, TCPU is the CPU processing cost per instructions, insts represents the total number
of CPU instructions, TI/O is the I/O processing cost per I/O operation, ops represents the
total number of I/O operations, C0 is the start-up cost of initiating transmission, C1 is a
proportionality constant, and X is the amount of data to be transmitted. For wide area
networks, the above formula is simplified as follows.
The response time measure is the time from the initiation of the query to the time when
the answer is produced. The response time measure must consider the parallel local
processing costs and the parallel communication costs. A general formula for response
time can be expressed as follows.
where, seq_insts represents the maximum number of CPU instructions that can be
performed sequentially, seq_ops represents the maximum number of I/O operations that
can be performed sequentially, and seq_X indicates the amount of data that can be
transmitted sequentially. If the local processing cost is ignored, then the above formula
simplified into
In this case, any processing and communication that is done in parallel is ignored. The
following example illustrates the difference between total cost measure and response time
measure.
Example 11.12
Let us consider that P amount data is to be transmitted from site 1 to site 2, and Q amount
of data is to be transmitted from site 3 to site 2 for the execution of a query as shown in
the figure 11.13.
P
Site1
Site2
Site3
Q
234
Hence, the total cost measure = C0 + C1 * P + C0 + C1 * Q = 2C0 + C1 (P + Q).
because the data transmission is done in parallel. In this case, the local processing cost is
ignored. The response time measure can be minimized by increasing the degree of
parallel execution, while the total cost measure can be minimized by improving the
utilization of resources.
Database Statistics
The cost of an execution strategy depends on the size of the intermediate result relations
that are produced during the execution of a query. In distributed query optimization, it is
very important to estimate the size of the intermediate results of relational algebraic
operations in order to minimize the data transmission cost, because the intermediate
result relations are transmitted over the network. This estimation is done based on the
database statistics of base relations stored on system catalog and formulas to predict the
cardinalities of the results of the relational algebraic operations. Typically, it is expected
that a distributed DBMS will hold the following information in its system catalog to
predict the size of intermediate result relations.
In addition with above information, sometimes the system catalog also holds the join
selectivity factor for some pairs of relation. The join selectivity factor of two relations R
and S is a real value between 0 and 1, and can be defined as follows.
235
1. Selection Operation.
where SFS (F) is dependent on the selection formula. The SFS(F)s can be calculated as
follows.
SFS (p(Ai) ∨ p(Aj)) = SFS (p(Ai) + SFS (p(Aj)) – (SFS (p(Ai)) * SFS (p(Aj)))
In this case, Ai and Aj are two different attributes, and p(Ai) and p(Aj) denotes the
selection predicates.
2. Projection Operation
4. Join Operation
There is no general way to evaluate the cardinality of the join operation with additional
information about the join operation. Typically, the upper bound of the cardinality of the
join operation is the cardinality of the Cartesian product. However, there is a common
case in which the evaluation is simple. If the relation R with its attribute A is joined with
the relation S with its attribute B via an equi-join operation, where A is a key of the
236
relation R, and B is a foreign key of relation S, then the cardinality of the join operation
can be evaluated as follows.
card(R⋈A=BS) = card(S).
In the above evaluation, it is assumed that the each tuple of relation R participates in the
join operation; therefore, the above estimation is an upper bound. In other cases, it can be
calculated as
card(R⋈S) = SFJ * card(R) * card(S)
5. Semijoin Operation
The selectivity factor of the semijoin operation of the relation R by the relation S selects
the fraction of tuples of R that join with tuples of S. An approximation for the semijoin
selectivity factor is represented as follows [Hevner and Yao, 1979].
The above formula depends on only the attribute A of the relation S. Thus, it is often
called the selectivity factor of attribute A of S and is denoted by SFSJ (S.A). Now, the
cardinality of semijoin operation can be expressed by the following formula.
This formula is applicable for common case where the attribute R.A is a foreign key of
the relation S. In this case, the semijoin selectivity factor is 1, because ∏A(S) =
card(dom[A]).
6. Union Operation
It is very difficult to evaluate the cardinality of union operation. The simple formulas for
evaluating the upper bound and lower bound cardinalities of union operation are listed in
the following.
card(R∪S) (upper bound) = card(R) + card(S)
card(R∪S) (lower bound) = max {card(R), card(S)}
Like union operation, the formulas for evaluating the upper bound and lower bound
cardinalities of set difference operation, denoted by card(R-S), are card(R) and 0
respectively.
237
In the context of distributed query processing, local query optimization is utmost
important, because the distributed query is fragmented into several subqueries and each
of which is processed at local site in a centralized way. Moreover, the query optimization
techniques for centralized DBMS are often extended for using in distributed DBMS.
There are several query optimization techniques for centralized DBMS. One such
centralized query optimization technique used by popular relational database system,
INGRES [Stonebraker et al., 1976] has been introduced in the following section.
INGRES uses a dynamic query optimization strategy that partitions the high-level query
into smaller queries recursively. In this approach, a query is first decomposed into a
sequence of queries having a unique relation in common. Each of these single relation
queries are then processed by a one-variable query processor (OVQP). The OVQP
optimizes the access to a single relation by selecting, based on the selection predicates,
the best access method to that relation. For instance, if there is a selection predicate in the
query of the form <A = value>, an index available on attribute A will be used. This
algorithm first executes the unary operations and tries to minimize the sizes of
intermediate results before performing binary operations.
To perform decomposition, two basic techniques are used in INGRES algorithm, known
as detachment and substitution. The query processor uses the detachment technique to
partition a given query into several smaller queries based on a common relation. For
example, using detachment technique, the SQL query q1 can be decomposed into two
smaller queries q11 and q12 as follows.
where Ai and Ai` represent lists of attributes of relation Ri, P1 is a selection predicate
involving the attributeA1` from the relation R1 and P2 is a selection predicate involving
attributes of relations R1, R2,……..Rn. This step is necessary to reduce the size of
relations before performing binary operation. Detachment technique extracts the selection
operations, which are usually the most selective ones.
Multi-relation queries that cannot be further detached are called irreducible. A query is
said to be irreducible if and only if the corresponding query graph is a chain with two
nodes. Irreducible queries are converted into single relation queries by tuple substitution.
238
In tuple substitution, for a given n-relation query, the tuples of one variable are
substituted by their values, thereby generating a set of (n-1) variable queries. Tuple
substitution can be implemented in the following way. Assume that the relation R is
chosen for tuple substitution in an n-relation query q1. For each tuple in R, the attributes
refers to q1 are replaced by their actual values in tuple and thereby producing a query q1`
with n-1 relations. Therefore, the total number of queries produced by tuple substitution
is card(R). An example of INGRES algorithm is illustrated in the following.
Example 11.13
Let us consider the following SQL query that involves three different relations
Student(sreg-no, sname, street, city, course-id), Course(course-id, cname, duration,
fees), and Teacher(T-id, name, designation, salary, course-id).
Using detachment technique, the above query can be replaced by the following queries,
q1 and q2, where Course1 is an intermediate relation.
q2: Select sname, name from Student, Course1, Teacher where Student.course-id
= Course1.course-id and Course1.course-id = Teacher.course-id.
Similarly, the successive detachment of q2 may generate the queries q21 and q22 as
follows.
q21: Select name, Teacher.course-id into Teacher1 from Teacher, Course1 where
Teacher.course-id = Course1.course-id.
Assume that in query q22, three tuples are selected where course-id are C01, C03, and
C06. The tuple substitution of Teacher1 relation produces three one-relation subqueries
which are listed below.
239
The ordering of join operation is very important both in the context of centralized and
distributed query optimization. The algorithms represented in this section do not consider
explicitly the fragmentation of relation. Thus, the generic term ‘relation’ is used here to
denote either fragmented or nonfragmented relations. There are two basic approaches for
join strategies in fragmented relations. One is simple join strategy which tries to
optimize the ordering of join operations directly. Another alternative approach is
semijoin strategy, which replaces join operations by semijoin operations in order to
minimize communication costs.
The objective of simple join strategy is to optimize the ordering of join operations of
relations in order to minimize the query processing cost. For the sake of simplicity, only
specific join queries are considered here whose operand relations are stored at different
sites, and it is assumed that relation transfers among different sites of the distributed
system are to be done in a set-at-a-time basis rather than a tuple-at-a-time basis. Finally,
the transfer time for generating the data at a result site is ignored here.
Let us consider a simple query that involves the joining of two relations R, and S which
are stored at different sites. In performing R ⋈ S, the smaller relation is to be transfer to
the site of larger relation. Therefore, it is necessary to calculate the size of the relations R
and S. If the query involves one more relation T, then in that case the obvious choice is to
transfer the intermediate result R ⋈ S or the relation T which one is smaller.
The difficulty with the simple join strategy is that the join operation may reduce or
increase the size of the intermediate results. Hence, it is necessary to estimate the size of
results of join operations. One solution is to estimate the communication costs of all
alternative strategies and select the best one for which the communication cost is
minimize. However, the number of alternative strategies will increase as the number of
relation increases.
The main drawback of simple join strategy is that the entire operand relation is to be
transmitted between the sites. The objective of the semijoin strategy is to minimize the
communication cost by replacing join operations of a query with semijoin operations.
The join operation R ⋈ S between two relations R and S over the attribute A which are
stored at different sites of the distributed system can be replaced by semijoin operations
as follows.
R ⋈ AS ⇔ (R ⋉A S) ⋈ AS ⇔ R ⋉A S ⋈ A(S ⋉A R) ⇔ R ⋉A S ⋈ A S ⋉A R.
It is necessary to estimate the cost of above semijoin operations to understand the benefits
of semijoin operation over join operation. The local processing costs are not considered
here for simplicity. The join operation R ⋈ AS and the semijoin operation (R ⋉A S) ⋈
AS can be implemented in the following way assuming the relations R and S are stored at
site1 and site2 respectively, and size(R) < size(S).
240
R⋈ AS:
2. The relation R is to be transferred at site2.
3. The join operation is performed at site2.
(R ⋉A S) ⋈ AS:
1. The projection operation ∏A(S) is performed at site2 and result is sent to
site1.
2. The site1 computes R ⋉A S, say T.
3. The result T is transferred to site2.
4. The join operation T ⋈ AS is computed at site2.
The communication cost for the above join operation is C0 + C1 * size(R), while the
communication cost for the semijoin operation is 2C0 + C1* (size(∏A(S)) + size(R ⋉A S).
Thus, the second operation is beneficial if size(∏A(S)) + size(R ⋉A S) is less than
size(R), because C0 is negligible compared to the amount of data transfer.
Generally, the semijoin operation is useful to reduce the size of operand relations
involved in multiple join queries. The optimal semijoin program is called full reducer,
which reduces a relation more than others [Chiu and Ho, 1980]. The determination of full
reducers is a difficult task. One solution is to evaluate the size of reduction for all
possible semijoin strategies and select the best one. Full reducers cannot be found in the
group of queries that have cycles in their join graph, known as cyclic queries. For other
group of queries, called tree queries, full reducers exist, but the number of alternative
semijoin strategies increases as the number of relations increases which complicates the
above solution.
The semi-join strategy is beneficial if a few numbers of tuples participate in the join
operation, while the simple join strategy is beneficial if most of tuples participate in the
join operation, because semi-join strategy involves an additional data transfer cost.
Example 11.14
Let us consider the following join operation that involves three different relations
Student, Course, and Teacher, and over the attribute course-id.
course-id course-id
Site1
Student Teacher Site3
241
If the above join operation is replaced by semijoin operations, then the number of
operations will be increased but possibly on smaller operands. In this case, instead of
sending the entire Student relation to site2, only the values for the join attribute course-id
of Student relation will be sent to site2. If the length of the join attribute is significantly
less than the length of the entire tuple, then semijoin has good selectivity in this case, and
it reduces the communication cost significantly. This is also applicable for performing
join operation between the relations Course and Teacher. However, the semijoin operation
may increase the local processing time, because one of the two operand relations must be
accessed twice.
This section introduces three global query optimization algorithms, namely, distributed
INGRES algorithm, R* algorithm and SDD-1 algorithm. All these algorithms are
different from one another in a major way depending on several factors such as objective
function, optimization strategy, network topology and so on. The comparison between
these three algorithms is shown in the following.
242
SDD-1 Total cost Static General Msg. transfer Yes No
measure cost
R* Total cost Static General or Msg. transfer no No
measure local and amount of
data transfer
cost, I/O cost,
CPU cost
For a given query, all monorelation queries (unary operations such as selection and
projection) can be detached first and these are processed locally. Then the reduction
algorithm [Wong and Youssefi, 1976] is applied to the original query and it produces two
different kinds of subqueries, irreducible subqueries and monorelation subqueries.
Reduction is a technique that separates all irreducible subqueries and monorelation
subqueries from the original query by detachment technique. Since monorelation
subqueries are already processed locally, no action is taken for monorelation subqueries.
Assume that the reduction technique produces a sequence of irreducible subqueries
q1→q2→…..→qn, with at least one relation is common between two consecutive
subqueries. It is mentioned in [Wong and Youssefi, 1976] that such a sequence is unique.
Based on the sequence of irreducible subqueries and the size of each fragment, one
subquery is chosen from the list, say qi, which has at least two variables. For processing
of the subquery qi, initially the best strategy is determined for the subquery. This strategy
is expresses by a list of pairs (F, S), where F denotes a fragment which is to be transfer to
the processing site S. After transferring all fragments to the corresponding processing
sites, finally the subquery qi is executed. This procedure is repeated for all irreducible
subqueries in the sequence q1→q2→…..→qn, and the algorithm terminates. This
algorithm is represented in the following.
243
Begin
Step1: Detach all monorelation queries from the given query q and execute by OVQPs
(One-Variable Query Processors) at local sites same as centralized INGRES
algorithm.
Step2: Apply reduction technique to q, which will produce a list of monorelation
subqueries and a sequence of irreducible subqueries q1→q2→…..→qn.
Step3: Ignore all monorelation subqueries (since they are already processed by OVQPs
at local sites).
Step4: for I = 1 to n, Repeat Step5 to Step8 [n is the total number of irreducible
subqueries].
Step5: Choose the irreducible subquery qi involving the smallest fragments.
Step6: Determine the best strategy, pairs of (F, S), for qi. [F represents a fragment and S
represents the processing sites.]
Step7: For each pair (F, S), transfer the fragment F to the corresponding processing site
S.
Step8: Execute the query qi. Check for the termination of for loop.
Step9: Output the result.
End.
The query optimization is basically done in step5 and step6 in distributed INGRES
algorithm. In this algorithm, subqueries are produced depending on several components
and their dependency order. Since the relation involved in a subquery may be fragmented
and stored at different sites, the subquery cannot be further subdivided. The main
difficulty in step6 is to determine how to execute the subquery by selecting the fragments
that will be transferred to sites where the processing will take place. For a n-relation
subquery, fragments from n-1 relation must be transferred to the site(s) of fragments of
the remaining relation, and then replicated there. Further, the remaining relations may be
divided into several equal fragments in order to increase parallelism. This approach is
called fragment-and-replicate, which performs a substitution of fragments rather than of
tuples like centralized INGRES algorithm. The selection of remaining relations and the
number of processing sites depends on the objective function and the communication
scheme. The choice of number of processing sites is a trade-off between the total time
measure and the response time measure. If the number of sites increases, the response
time decreases by parallelism, but the total time increases which leads to higher
communication cost. The cost of producing the result is ignored here.
244
The objective of distributed R* algorithm [Selinger and Adiba, 1980] is to reduce the
total cost measure which includes the local processing costs and communication costs.
Distributed R* algorithm uses an exhaustive search approach to select the best strategy.
Although exhaustive search incurs an overhead, but it can be repay if the query is
executed frequently. The implemented version of distributed R* algorithm does not
support fragmentation or replication, thus, it involves relations as basic units. This
algorithm chooses one site as master site where the query is initiated. The query
optimizer at master site can take all intersite decisions, such as the selection of the
execution sites, the fragments which will be used, and the method for transferring data.
The other participating sites that stored the relations involved in the query, called
apprentice sites, can make the remaining local decisions and generate local access plans
for the query execution. Distributed R* algorithm is implemented in the following way.
Distributed R* Algorithm:
In this algorithm, based on database statistics and formulas used to estimate the size of
intermediate results, and access path information, the query optimizer determines the join
ordering, the join algorithms and the access path for each fragment. In addition, it also
determines the sites of join results and the method of data transfer between sites. For
performing the join between two relations, there may be three candidate sites. These are
the site for the first relation, the site for the second relation, and the site for the result
relation. Two intersite data transfer techniques are supported by distributed R* algorithm,
called ship-whole and fetch-as-needed. In ship-whole technique, the entire relation is
moved to the join site and stored in a temporary relation before performing the join
operation. If the join algorithm is merge join, the incoming tuples are processed in a
pipeline mode as they arrived at the join site. In this case, the relation does not require to
store in the temporary relation. The fetch-as-needed technique is same as semijoin
operation of the internal relation with external node. In this method, the external relation
is sequentially scanned and for each tuple the join value is sent to the site of the internal
245
relation. The internal tuples that match with the value of the external tuple is selected
there and then sends the selected tuples to the site of the external relation.
Ship-whole technique requires larger data transfer but fewer messages transfer than fetch-
as-needed technique. Obviously, ship-whole technique is beneficial for smaller relation.
On the other hand, the fetch-as-technique is beneficial if the relation is larger and the join
operation has good selectivity. In fetch-as-needed technique, there are four possible join
strategies for joining of the external relation with the internal relation over an attribute
value. These join strategies and their corresponding costs are calculated in the following,
where LC denotes the local processing cost that involves CPU cost and I/O cost, CC
represents the communication cost, and A denotes the average number of tuples of
internal relation S that matches with the value of one tuple of external relation R.
Strategy 1: Ship the entire external relation to the site of the internal relation.
In this case, the tuples of external relation R can be joined with the tuples of internal
relation S, as they arrive. Therefore,
Strategy 2: Ship the entire internal relation to the site of the external relation.
In this case, the tuples of internal relation S cannot be joined with the tuples of external
relation R, as they arrive. The entire internal relation S is to be stored in a temporary
relation. Hence,
Strategy 3: Fetch tuples of the internal relation as needed for each tuple of the
external relation.
In this case, for each tuple in R, the join attribute value is sent the site of the internal
relation S. Then the A number of tuples from S that matches with this value are retrieved
and sent to the site of R to be joined as they arrive. Therefore,
Total cost measure = LC (retrieving card(R) tuples from R) + CC (length (join attribute
value)) * card(R) + LC (retrieving A tuples from S) * card(R) + CC (A * length (S)) *
card(R).
Strategy 4: Move both relations to a third site and compute the join there.
In this case, the internal relation is first moved to the third site and stored in a temporary
relation. Then, the external relation is moved to third site and the join is performed as
they arrive. Hence,
246
The cost of producing final result is not considered here. In case of distributed R*
algorithm, the complexity increases as the number of relation increases, since it uses the
exhaustive search approach based on several factors such as join order, join algorithm,
access path, data transfer mode, result site etc.
SDD-1 query optimization algorithm [Bernstein et al., 1981] is derived from the first
distributed query processing algorithm ‘hill-climbing’. In hill-climbing algorithm,
initially a feasible query optimization strategy is determined and this strategy is refined
recursively until no more cost improvements are possible. The objective of this algorithm
is to minimize an arbitrary function which includes the total time measure as well as
response time measure. This algorithm does not support fragmentation and replication,
and does not use semijoin operation. The ‘hill-climbing’ algorithm can works in the
following way, where the input to the algorithm is query graph, location of relations
involved in the query and relation statistics.
First, this algorithm selects an initial feasible solution which is a global execution
schedule that includes all intersite communication. It is obtained by computing the cost of
all the execution strategies that transfer all the required relations to a single candidate
result site and selecting the minimum cost strategy. Assume this initial solution is S. The
query optimizer splits S into two strategies, S1 followed by S2, where S1 consists of
sending one of the relations involved in the join operation to the site of the other relation.
These two relations are joined locally and the resulting relation is sent to the chosen
result site. If the addition of cost execution strategies S1 and S2 and the cost of local join
processing cost is less than S, then S is replaced by the schedule of S1 and S2. This
process is continued recursively until no more beneficial strategy is obtained. It is to be
noted that if n-way join operations is involved in the given query, then S will be divided
into n sub schedules instead of two.
The main disadvantage of this algorithm is that it involves higher initial cost, which may
not produce better strategies at all. Moreover, the algorithm gets stuck at a local
minimum cost solution and fails to achieve the global minimum cost solution.
In SDD-1 algorithm also, one initial solution is selected and it is refined recursively. One
post optimization phase is added here to improve the total time of the selected solution.
The main step of this algorithm consists of determining and ordering semijoin operations
in order to minimize cost. There are four phases in SDD-1 algorithm, known as
247
initialization, selection of beneficial semijoin operations, result site selection, and
post-optimization. In initialization phase, an execution strategy is selected that includes
only local processing. In this phase, a set of beneficial semijoins BS are also produced for
using in the next phase. The second phase selects the beneficial semijoin operations from
BS recursively and modifies the database statistics and BS accordingly. This step is
terminated when all semijoin operations in BS are appended to the execution strategy.
The execution order of semijoin operations is determined by the order in which the
semijoins are appended to the execution strategy. In third phase, the result site is decided
based on the cost of data transfer to each candidate site. Finally, in post-optimization
phase, the semijoin operations are removed from the execution strategy which involves
only relations stored at result site. This is necessary because the result site is chosen after
all semijoin operations have been ordered.
• Query processing involves the retrieval of data from the database both in the context
of centralized and distributed DBMS. A Query processor is a software module that
performs processing of queries. A distributed query processing involves four phases.
These are query decomposition, query fragmentation, global query optimization and
local query optimization.
• The objective of query decomposition phase is to transform a query in high-level
language on global relations into a relational algebraic query on global relations. The
four successive steps of query decomposition are normalization, analysis,
simplification and query restructuring.
• In normalization step, the query is converted into a normalized form to facilitate
further processing in an easier way. The two possible normal forms are conjunctive
normal form and disjunctive normal form.
• The objective of the analysis step is to reject normalized queries that are incorrectly
formulated or contradictory.
• In simplification step, all redundant predicates in the query are detected and common
sub expressions are eliminated in order to transform the query into a simpler and
efficient computed form.
• In query restructuring step, the query in high-level language is rewritten into
equivalent relational algebraic form.
• In query fragmentation phase a relational algebraic query on global relations is
converted into an algebraic query expressed on physical fragments, called fragment
query, considering data distribution in distributed databases.
• A query typically has many possible execution strategies, and the process of choosing
a suitable one for processing a query is known as query optimization. Both
centralized query optimization and distributed query optimization are very important
in the context of distributed query processing.
• The optimal query execution strategy is selected by a software module, known as
query optimizer, and it can be represented by three components. These are search
space, cost model and optimization strategy.
248
• The search space is defined as the set of equivalent query tree for a given query which
can be generated by using transformation rules.
• There are three different kinds of optimization strategies, known as static
optimization strategy, dynamic optimization strategy and randomized optimization
strategy.
• In a distributed system, the cost of processing a query can be expressed in terms of
the total cost measure or the response time measures.
249
Exercises
250
(vii) The normalized attribute connection graph is used in
a. Normalization step
b. Simplification step
c. Query restructuring step
d. Analysis step.
(xii) Which of the following is not included in the query execution cost of a distributed
DBMS?
a. CPU cost
b. I/O cost
c. Storage cost
d. Communication cost.
(xiii) Which of the following component are used to represent a query optimizer?
a. Cost model
b. Search space
c. Optimization strategy
d. All of these.
251
a. In a linear join tree, at least one operand of each node is a base relation
b. In a bushy join tree, all operands must be a base relation.
c. Bushy join trees reduces the size of the search space
d. Linear join trees facilitate parallelism.
252
a. Total cost measure
b. Response time measure
c. Both total cost and response time measure
d. None of these.
253
Select ENAME, PNAME from EMP, ASG, PROJ where EMP.ENO = ASG.ENO
and ASG.PNO = PROJ.PNO and (TITLE = ‘ELECT.ENG’ or ASG.PNO < ‘P3’).
Draw the generic query tree (canonical tree). Transform the generic query tree to
an optimized reduced query tree.
SELECT ENO FROM ASG WHERE RESP = ‘Analyst’ AND NOT (PNO = ‘P2’
OR DUR = 12) AND PNO ≠ ‘P2’ AND DUR = 12, considering
ASG (ENO, PNO, RESP, DUR).
5. Consider the following relations EMP (eno, ename, title) and ASG (eno, pno,
resp, dur). Find the names of employees who are managers of any project in SQL-
like syntax and relational algebra. Is the query optimized? If not optimize it.
254
EMP1 = σPNO ≤’P3’ (EMP)
EMP2 = σPNO >’P3’ (EMP)
Draw the generic query tree for the following query and convert into a reduced
query on fragments.
Draw the generic query tree for the following query and convert into a reduced
query on fragments.
Using INGRES algorithm, detachment and substitute for the query “Retrieve
names of employees working on the ORACLE project” considering there are four
different values for ENO, which are E1, E2, E3, and E4.
255