Query Proceessing
Query Proceessing
System
Issue No. 03 - May (1979 vol. 5)
ISSN: 0098-5589
pp: 177-187
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TSE.1979.234179
S.B. Yao
A.R. Hevner , Department of Computer Science, Purdue University
ABSTRACT
Query processing in a distributed system requires the transmission f data between
computers in a network. The arrangement of data transmissions and local data
processing is known as a distribution strategy for a query. Two cost measures, response
time and total time are used to judge the quality of a distribution strategy. Simple
algorithms are presented that derive distribution strategies which have minimal response
time and minimal total time, for a special class of queries. These optimal algorithms are
used as a basis to develop a general query processing algorithm. Distributed query
examples are presented and the complexity of the general algorithm is analyzed. The
integration of a query processing subsystem into a distributed database management
system is discussed.
INDEX TERMS
system modeling, Computer network, database, distributed database systems,
distributed processing, distribution strategy, heuristic algorithms, query processing,
redundant data, relational data model
o The total cost that will be incurred in processing the query. It is the dome of
all times incurred in processing the operations of the query at various sites
and intrinsic communication.
o The resource time of the query. This is the time elapsed for executing the
query. Since operations can be executed in parallel at different sited, the
response time of a query may be significantly less than its cost.
Query Decomposition
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema describing the
global relations. However, the information about data distribution is not used here but in the next
layer. Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps. First, the calculus query is rewritten in
a normalized form that is suitable for subsequent manipulation. Normalization of a query generally
involves the manipulation of the query quantifiers and of the query qualification by applying logical
operator priority.
Second, the normalized query is analyzed semantically so that incorrect queries are detected and
rejected as early as possible. Techniques to detect incorrect queries exist only for a subset of
relational calculus. Typically, they use some sort of graph that captures the semantics of the query.
Third, the correct query (still expressed in relational calculus) is simplified. One way to simplify a
query is to eliminate redundant predicates. Note that redundant queries are likely to arise when a
query is the result of system transformations applied to the user query. Such transformations are used
for performing semantic data control (views, protection, and semantic integrity control).
Fourth, the calculus query is restructured as an algebraic query. That several algebraic queries can
be derived from the same calculus query, and that some algebraic queries are “better” than others.
The quality of an algebraic query is defined in terms of expected performance. The traditional way to
do this transformation toward a “better” algebraic specification is to start with an initial algebraic query
and transform it in order to find a “good” one. The initial algebraic query is derived immediately from
the calculus query by translating the predicates and the target statement into relational operators as
they appear in the query. This directly translated algebra query is then restructured through
transformation rules. The algebraic query generated by this layer is good in the sense that the worse
executions are typically avoided. For instance, a relation will be accessed only once, even if there are
several select predicates. However, this query is generally far from providing an optimal execution,
since information about data distribution and fragment allocation is not used at this layer.
Data Localization
The input to the second layer is an algebraic query on global relations. The main role of the second
layer is to localize the query’s data using data distribution information in the fragment schema. We
saw that relations are fragmented and stored in disjoint subsets, called fragments, each being stored
at a different site. This layer determines which fragments are involved in the query and transforms the
distributed query into a query on fragments. Fragmentation is defined by fragmentation predicates
that can be expressed through relational operators. A global relation can be reconstructed by applying
the fragmentation rules, and then deriving a program, called a localization program, of relational
algebra operators, which then act on fragments. Generating a query on fragments is done in two
steps. First, the query is mapped into a fragment query by substituting each relation by its
reconstruction program (also called materialization program). Second, the fragment query is simplified
and restructured to produce another “good” query. Simplification and restructuring may be done
according to the same rules used in the decomposition layer. As in the decomposition layer, the final
fragment query is generally far from optimal because information regarding fragments is not utilized.
Query optimization consists of finding the “best” ordering of operators in the query, including
communication operators that minimize a cost function. The cost function, often defined in terms of
time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and
communication costs. Nevertheless, a typical simplification made by the early distributed DBMSs, as
we mentioned before, was to consider communication cost as the most significant factor. This used to
be valid for wide area networks, where the limited bandwidth made communication much more costly
than local processing. This is not true anymore today and communication cost can be lower than I/O
cost. To select the ordering of operators it is necessary to predict execution costs of alternative
candidate orderings. Determining execution costs before query execution (i.e., static optimization) is
based on fragment statistics and the formulas for estimating the cardinalities of results of relational
operators. Thus the optimization decisions depend on the allocation of fragments and available
statistics on fragments which are recorder in the allocation schema.
An important aspect of query optimization is join ordering, since permutations of the joins within the
query may lead to improvements of orders of magnitude. One basic technique for optimizing a
sequence of distributed join operators is through the semijoin operator. The main value of the semijoin
in a distributed system is to reduce the size of the join operands and then the communication cost.
However, techniques which consider local processing costs as well as communication costs may not
use semijoins because they might increase local processing costs. The output of the query
optimization layer is a optimized algebraic query with communication operators included on
fragments. It is typically represented and saved (for future executions) as a distributed query
execution plan.