0% found this document useful (1 vote)
331 views

Query Processing

The document discusses query processing and optimization. It defines query processing as the activities involved in extracting data from a database, including decomposition, optimization, code generation, and execution. Decomposition involves scanning, parsing, validating, and forming a query tree. Optimization involves choosing an efficient execution strategy. Code generation produces code to execute the optimized plan. Execution runs the query code to produce results. Query optimization aims to select the most efficient evaluation plan from possible strategies, as strategy choice can substantially impact performance.

Uploaded by

musikmania
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
331 views

Query Processing

The document discusses query processing and optimization. It defines query processing as the activities involved in extracting data from a database, including decomposition, optimization, code generation, and execution. Decomposition involves scanning, parsing, validating, and forming a query tree. Optimization involves choosing an efficient execution strategy. Code generation produces code to execute the optimized plan. Execution runs the query code to produce results. Query optimization aims to select the most efficient evaluation plan from possible strategies, as strategy choice can substantially impact performance.

Uploaded by

musikmania
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 15

Query Processing & Optimization

What is query processing? Query Processing refers to the range of activities involved in extracting data from a database. What are the different phases of a query processing? Explain them briefly Query processing can be divided into four main phases: o Decomposition (scanning, parsing, validating, forming query tree), o Optimization, o Code generation and o Execution Decomposition (scanning, parsing, validating, forming query tree)The scanner(lexical analyzer) identifies the query tokenssuch as SQL keywords, attribute names, and relation namesthat appear in the text of the query, whereas the parser (syntactic analyzer) checks the query syntax to determine whether it is formulated according to the syntax rules (rules of grammar) of the query language. The query must also be validated by checking that all attribute and relation names are valid and semantically meaningful names in the schema of the particular database being queried. An internal representation of the query is then created, usually as a tree data structure called a query tree. OptimizationThe DBMS must then devise an execution strategy for retrieving the results of the query from the database files. A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimization. The term optimization is actually a misnomer because in some cases the chosen execution plan is not the optimal (or absolute best) strategyit is just a reasonably efficient strategy for executing the query. The query optimizer module has the task of producing a good execution plan. Code generationThe code generator generates the code to execute that plan. Execution The runtime database processor has the task of running (executing) the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime error results, an error message is generated by the runtime database processor.

The aims of query decomposition


To transform a high-level query into a relational algebra query

To check that the query is syntactically and semantically correct

What are the typical phases of query decomposition? What is the difference between conjunctive and disjunctive normal forms?
The typical stages of query decomposition are: Analysis, Normalization, Semantic analysis, Simplification, and Query restructuring Analysis The query is lexically and syntactically analyzed using the techniques of programming language compilers. [In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer, or scanner. In computer science parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (for example, words), to determine its grammatical structure with respect to a given (more or less) formal grammar.] On completion of the analysis, the high-level query has been transformed into some internal representation (query tree) that is more suitable for processing. For example, a SQL query is translated into an equivalent extended relational algebra expression.
Root Intermediate operations Leaves

Normalization The input query may be arbitrarily complex, depending on the facilities provided by the language. It is the goal of normalization to transform the query to a normalized form to facilitate further processing. With relational languages such as SQL, the most important transformation is that of the query qualification (the WHERE clause), which may be an arbitrarily complex. There are two possible normal forms for the predicate, one giving precedence to the AND ( ) and the other to the OR ( ). The conjunctive normal form is a conjunction ( The disjunctive normal form is a disjunction ( predicate) ) as follows: predicate) of disjunctions predicates ( of conjunctions predicates ( ) as follows: (p11 p12 p1n) (pm1 pm2 (p11 p12 p1n) (pm1 pm2 pmn) pmn) Where Pij is simple predicate. Where Pij is simple predicate. The conjunctive normal form is more practical since query qualifications typically include more AND than OR predicates. However, it leads to predicate replication for queries involving many disjunctions and few conjunctions, a rare case. In the disjunctive normal form, the query can be processed as independent conjunctive subqueries linked by unions (corresponding to the disjunctions). However this form may lead to replicated join and select predicates. The reason is that predicates are very often linked with the predicates by AND. The use of following rule, with p1 as a join or select predicate, would result in replicating p1. p1 (p2 p3) (p1 p2) (p1 p3)

Conjunctive normal form

Disjunctive normal form

Example:- Consider the following Engineering Database: EMP(ENO, ENAME, TITLE) PROJ(PNO, PNAME, BUDGET) ASG(ENO, PNO, RESP, DUR) Let us consider the following query on the engineering database : "Find the names of employees who have been working on project P1 for 12 or 24 months" The query expressed in SQL is SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND ASG.PNO = 'Pl' AND DUR = 12 OR DUR = 24; The qualification in conjunctive normal form is EMP.ENO = ASG.ENO ASG.PNO = Pl (DUR = 12 DUR = 24)

The qualification in disjunctive norm form is (EMP.ENO = ASG.ENO ASG.PNO = Pl DUR = 12) (EMP.ENO = ASG.ENO ASG.PNO = Pl DUR = 24) In the latter form, treating the two conjunctions independently may lead to redundant work.

Semantic analysisSemantic analysis enables rejection of normalized queries for which further processing either impossible or unnecessary. The main reasons for rejection are that the query is type incorrect or semantically incorrect. When one of these cases is detected, the query is simply returned to the user with an explanation. Otherwise query processing is continued.

A query is type incorrect if any of its attribute or relation names are not defined in the global schema, or if operations are being applied to attributes of the wrong type. A query is semantically incorrect if components of it do not contribute in any way to the generation of the result.

SimplificationIn this step the query is simplified by eliminating redundancies, e.g., redundant predicates. Redundancies are often due to semantic integrity constraints expressed in the query language. Queries on views are expanded into queries on relations that satisfy certain integrity and security constraints. Query restructuringThe last step of query decomposition rewrites the query in relational algebra. This is typically divided into the following two substeps: (1) straightforward transformation of the query from high level language into relational algebra, and (2) restructuring of the relational algebra query to improve performance. For the sake of clarity it is customary to represent the relational algebra query graphically by an operator tree. An operator tree is a tree in which a leaf node is a relation stored in the database, and a nonleaf node is an intermediate relation produced by a relational algebra operator. The sequence of operations is directed from the leaves to the root, which represents the answer to the query.

What is Query Optimization?


Query optimization is the process of selecting the most efficient query-evaluation plan from among the many strategies usually possible for processing a given query, especially if the query is complex.

The difference in cost (in terms of evaluation time) between a good strategy and a bad strategy is often substantial, and may be several orders of magnitude. Hence, it is worthwhile for the system to spend a 3

substantial amount of time on the selection of a good strategy for processing a query, even if that query is executed only once.
We do not expect users to write their queries so that they can be processed efficiently. Rather, we expect the system to construct a query-evaluation plan that minimizes the cost of query evaluation. This is where query optimization comes into play.

One aspect of optimization occurs at the relational-algebra level, where the system attempts to find an expression that is equivalent to the given expression, but more efficient to execute. Another aspect is selecting a detailed strategy for processing the query, such as choosing the algorithm to use for executing an operation, choosing the specific indices to use, and so on. Example of Query Optimization BANKING SCHEMA branch (branch_name, branch_city, assets) customer (customer_name, customer_street, customer_city) loan (loan_number, branch_name, amount) borrower (customer_name, loan_number) account (account_number, branch_name, balance) depositor (customer_name, account number) Consider the relational-algebra expression for the query "Find the names of all customers who have an account at any branch located in Brooklyn." on above BANKING schema One of the correct answer is: customer_name (branch_city= Brooklyn (branch (account depositor)))
This expression constructs a large intermediate relation, branch account depositor. However, we are interested in only a few tuples of this relation (those pertaining to branches located in Brooklyn), and in only one of the six attributes of this relation. Since we are concerned with only those tuples in the branch relation that pertain to branches located in Brooklyn, we do not need to consider those tuples that do not have branch_city = "Brooklyn". By reducing the number of tuples of the branch relation that we need to access, we reduce the size of the intermediate result. Our query is now represented by the relational-algebra expression which is equivalent to our original algebra expression, but which generates smaller intermediate relations.

customer-name ((branch-city= Brooklyn (branch)) (account depositor))

Figure 2 depicts the initial and transformed expressions. Given a relational-algebra expression, it is the job of the query optimizer to come up with a queryevaluation plan that computes the same result as the given expression, and is the least costly way of generating the result (or, at least, is not much costlier than the least costly way). To choose among different query-evaluation plans, the optimizer has to estimate the cost of each evaluation plan. Computing the precise cost of evaluation of a plan is usually not possible without actually evaluating the plan. Instead, optimizers make use of statistical information about the relations, such as relation sizes and index depths, to make a good estimate of the cost of a plan. Disk access, which is slow compared to memory access, usually dominates the cost of processing a query. To find the least-costly query-evaluation plan, the optimizer needs to generate alternative plans that produce the same result as the given expression, and to choose the least costly one. Generation of query-evaluation plans involves two steps: (1) Generating expressions that are logically equivalent to the given expression and (2) Annotating the resultant expressions in alternative ways to generate alternative query evaluation plans. The two steps are interleaved in the query optimizersome expressions are generated and annotated, then further expressions are generated and annotated, and so on.

Estimating Statistics of Expression Results

The cost of an operation depends on the size and other statistics of its inputs. Given an expression such as a (b c) to estimate the cost of joining a with (b c), we need to have estimates of statistics such as the size of b c. The estimates are not very accurate, since they are based on assumptions that may not hold exactly. A query evaluation plan that has the lowest estimated execution cost may therefore not actually have the lowest actual execution cost. However, real-world experience has shown that even if estimates are not precise, the plans with the lowest estimated costs usually have actual execution costs that are either the lowest actual execution costs, or are close to the lowest actual execution costs.

Compare static and dynamic query optimization techniques


A query may be optimized at different times relative to the actual time of query execution. Optimization can be done statically before executing the query or dynamically as the query is executed.

Static query optimization technique


Static query optimization is done at query compilation time. Thus the cost of optimization may be amortized over multiple query executions. Therefore, this timing is appropriate for use with the exhaustive search method. Since the sizes of the intermediate relations of a strategy are not known until run time, they must be estimated using database statistics.

Dynamic query optimization technique


Dynamic query optimization proceeds at query execution time. The main shortcoming is that query optimization, an expensive task, must be repeated for each execution of the query. Therefore, this approach is best for ad-hoc queries. At any point of execution, the choice of the best next operation can be based on accurate knowledge of the results of the operations executed previously. 5

Errors in these estimates can lead to the choice of suboptimal strategies.

Therefore, database statistics are not needed to estimate the size of intermediate results. However, they may still be useful in choosing the first operations. The main advantage over static query optimization is that the actual sizes of intermediate relations are available to the query processor, thereby minimizing the probability of a bad choice.

Hybrid query optimization attempts to provide the advantages of static query optimization while avoiding the issues generated by inaccurate estimates. The approach is basically static, but dynamic query optimization may take place at run time when a high difference between predicted sizes and actual size of intermediate relations is detected.

Equivalence Rules An equivalence rule says that expressions of two forms are equivalent. We can replace an expression of the first form by an expression of the second form, or vice versathat is we can replace an expression of the second form by an expression of the first formsince the two expressions would generate the same result on any valid database. The optimizer uses equivalence rules to transform expressions into other logically equivalent expressions.
We now list a number of general equivalence rules on relational-algebra expressions. We use , 1, 2, and so on to denote predicates, L1, L2, L3, and so on to denote lists of attributes, and E, El, E2, and so on to denote relational-algebra expressions. A relation name r is simply a special case of a relational-algebra expression, and can be used wherever E appears. 1. Conjunctive selection operations can be deconstructed into a sequence of individual selections. This transformation is referred to as a cascade of . 1 2(E)= 1(2(r)) 2. Selection operations are commutative. 1(2(r))= 2(1(r)) 3. Only the final operations in a sequence of projection operations are needed, the others can be omitted. This transformation can also be referred to as a cascade of . L1 (L2(...(Ln(E)))) = L1(E) 4. Selections can be combined with Cartesian products and theta joins. a. (El E2) = El E2 This expression is just the definition of the theta join. b. 1 (El 2 E2) = El 1 2E2 5. Theta-join operations are commutative. El E2 = E2 E1 Actually, the order of attributes differs between the left-hand side and hand side, so the equivalence does not hold if the order of attributes is into account A projection operation can be added to one of the sides of the equivalence to appropriately reorder attributes, but for simplicity we omit the projection and ignore the attribute order in most of our examples. Recall that the natural-join operator is simply a special case of the theta-join operator; hence, natural joins are also commutative.

6.a. Natural-join operations are associative.

(El E2) E3 = E1 (E2 E3)

Theta joins are associative in the following manner: (El 1E2) 2 3 E3 = E1 1 3 (E2 2 E3) .
where 2 involves attributes from only E2 and E3. Any of these conditions may be empty; hence, it follows that the Cartesian product () operation is also associative. The commutativity and associativity of join operations are important for join reordering in query optimization.

7. The selection operation distributes over the theta-join operation under the following two conditions: a. It distributes when all the attributes in selection condition 0 involve only the attributes of one of the expressions (say, El) being joined. 0 (El E2) = (0 (El )) E2 b. It distributes when selection condition 1 involves only the attributes of El and 2 involves only the attributes of E2. 1 2 (El E2) = (1 (El )) (2 (E2 )) 8. The projection operation distributes over the theta-join operation under the following conditions. a. Let L1 and L2 be attributes of El and E2, respectively. Suppose that the join condition involves only attributes in L1 U L2. Then, L1UL2(E1 E2)= L1(E1) L2(E2) b. Consider a join El E2. Let L1 and L2 be sets of attributes from El and E2, respectively. Let L3 be attributes of E1 that are involved in join condition , but are not in L1 U L2, and let L4 be attributes of E2 that are involved in join condition , but are not in L1 U L2. Then, L1UL2(E1 E2)= L1UL2 ((L1UL3(E1)) (L2 UL4(E2))) 9. The set operations union and intersection are commutative. E1 E2 = E2 E1 El E2 = E2 E1 Set difference is not commutative. 10. Set union and intersection are associative. (El E2) E3 = E1 (E2 E3) (El E2) E3 = El (E2 E3) 11. The selection operation distributes over the union, intersection, and set-difference operations. p(El - E2) = p(E1) - p(E2) Similarly, the preceding equivalence, with - replaced with either or , also holds. Further, p (E1 -E2) = p (El) - E2 The preceding equivalence, with - replaced by , also holds, but does not hold if - is replaced by .

12. The projection operation distributes over the union operation. L (El E2) = (L (El)) (L (E2)) This is only a partial list of equivalences. More equivalences involving extended relational operators, such as the outer join and aggregation are possible. Examples of Transformations
Illustration of equivalence rule using Banking Schema BANKING SCHEMA

branch (branch_name, branch_city, assets) customer (customer_name, customer_street, customer_city) loan (loan_number, branch_name, amount) borrower (customer_name, loan_number) account (account_number, branch_name, balance) depositor (customer_name, account number) Consider the relational-algebra expression for the query "Find the names of all customers who have an account at any branch located in Brooklyn." on above BANKING schema One of the correct answer is: customer-name (branch-city= Brooklyn (branch (account depositor))) was transformed into the following expression, customer-name ((branch-city= Brooklyn (branch))(account depositor)) which is equivalent to our original algebra expression, but generates smaller intermediate relations. We can carry out this transformation by using rule 7.a (The selection operation distributes over the theta-join operation). Remember that the rule merely says that the two expressions are equivalent; it does not say that one is better than the other. Multiple equivalence rules can be used, one after the other, on a query or on parts of the query. As an illustration, suppose that we modify our original query to restrict attention to customers who have a balance over $1000. The new relational-algebra query is customer-name (branch-city= Brooklyn balance > 1000(branch(account depositor))) We cannot apply the selection predicate directly to the branch relation, since the predicate involves attributes of both the branch and account relation. However, we can first apply rule 6.a (associativity of natural join) to transform the join branch (account depositor) into (branch account) depositor: customer-name (branch-city= Brooklyn balance > 1000((branch account) depositor)) Then, using rule 7.a, we can rewrite our query as customer-name ((branch-city= Brooklyn balance > 1000(branchaccount)) depositor) Let us examine the selection subexpression within this expression. Using rule 1, we can break the selection into two selections, to get the following subexpression: branch-city= Brooklyn (balance > 1000(branch account)) Both of the preceding expressions select tuples with branch_city = "Brooklyn" and balance > 1000. However, the latter form of the expression provides a new opportunity to apply the "perform selections early" rule, resulting in the subexpression branch-city= Brooklyn(branch) (balance > 1000(account) Following figure depicts the initial expression and the final expression after all these transformations. We could equally well have used rule 7.b to get the final expression directly, without using rule 1 to break the selection into two selections. In fact, rule 7.b can itself be derived from rules 1 and 7.a.

Minimal Equivalence Rule A set of equivalence rules is said to be minimal if no rule can be derived from any combination of the others. An expression equivalent to the original expression may be generated in different ways; the number of different ways of generating an expression increases when we use a nonminimal set of equivalence rules. Query optimizers therefore use minimal sets of equivalence rules. Now consider the following form of our example query: customer-name ((branch-city= Brooklyn (branch)) account) depositor) When we compute the subexpression (branch-city= Brooklyn (branch)) account) we obtain a relation whose schema is. (branch-name, branch-city, assets, account-number, balance) We can eliminate several attributes from the schema, by pushing projections based on equivalence rules 8.a and 8.b. The only attributes that we must retain are those that either appear in the result of the query or are needed to process subsequent operations. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result. Thus, we reduce the size of the intermediate result. In our example, the only attribute we need from the join of branch and account is account-number. Therefore, we can modify the expression to customer-name (account_number(branch-city= Brooklyn (branch)) account)) depositor) The projection account_number reduces the size of the intermediate join results.

Join Ordering

A good ordering of join operations is important for reducing the size of temporary results; hence, most query optimizers pay a lot of attention to the join order. As mentioned in rule 6.a, the natural-join operation is associative. Thus, for all relations, r1, r2 and r3 ( r1 r2)r3 = r1 (r2r3) Although these expressions are equivalent, the costs of computing them may differ. Consider again the expression customer-name ((branch-city= Brooklyn (branch)) account) depositor) We could choose to compute account depositor first, and then to join the result with branch_city = "Brooklyn" (branch) However, account depositor is likely to be a large relation, since it contains one tuple for every account. In contrast, branch_city = "Brooklyn" (branch) account is probably a small relation. To see that it is, we note that, since the bank has a large number of widely distributed branches, it is likely that only a small fraction of the bank's customers have accounts in branches located in Brooklyn.

Thus, the preceding expression results in one tuple for each account held by a resident of Brooklyn. Therefore, the temporary relation that we must store is smaller than it would have been had we computed account depositor first.
There are other options to consider for evaluating our query. We do not care about the order in which attributes appear in a join, since it is easy to change the order before displaying the result. Thus, for all relations r1 and r2, rl r2 = r 2 r 1 That is, natural join is commutative (equivalence rule 5). Using the associativity and commutativity of the natural join (rules 5 and 6), we can consider rewriting our relationalalgebra expression as custorner_name (((branch_city= "Brooklyn" (branch)) depositor) account) That is, we could compute ( branch_city =Brooklyn" (branch)) depositor first, and, after that, join the result with account. However, that there are no attributes in common between Branchschema and Depositor-schema, so the join is just a Cartesian product. If there are b branches in Brooklyn and d tuples in the depositor relation, this Cartesian product generates b * d tuples, one for every possible pair of depositor tuple and branches (without regard for whether the account in depositor is maintained at the branch). Thus, it appears that this Cartesian product will produce a large temporary relation. As a result, we would reject this strategy. 6. Consider the following relation algebra: MCA 2ND YEAR 2009 customer_name (branch_city= kolkata balance > 1000 (Branch (AccountDeposit)) Draw Expression Tree of the above relational algebra and optimize the expression tree.

Expression Tree of the above relational algebra

10

Optimization of initial expansion Tree Initial Expression is customer_name (branch_city= kolkata balance > 1000 (Branch (AccountDeposit))) ------------(1) By using associativity rule of Natural-join operations intial expression it is changed into. customer_name (branch_city= kolkata balance > 1000 ((BranchAccount)Deposit))-----------(2) Here the selection operation distributes over the theta-join operation because branch_city and balance are attributes of BranchAccount. By using this expression (2) is changed into expression (3) customer_name ((branch_city= kolkata balance > 1000 (BranchAccount))Deposit)-----------(3) Again the selection operation distributes over the theta-join operation because branch_city is an Attribute of branch and balance is an attribute of Account. So expression 3 can be changed into expression 4 customer_name (((branch_city= kolkata (Branch)) (balance > 1000 (Account)))Deposit)-----------(4) The only attributes that we must retain are those that either appear in the result of the query or are needed to process subsequent operations. By eliminating unneeded attributes, we reduce the number of columns of the intermediate result. Thus, we reduce the size of the intermediate result. So expression 4 can be changed into expression 5 customer_name (account_number(((branch_name(branch_city= kolkata (Branch))) (branch_name, account_number(balance > 1000 (Account))))Deposit))-----------(4) After optimization the expression Tree is customer_name

account_number

branch_name branch_city= kolkata branch_name, account_number

Depositor

balance > 1000

Branch

Account

Differentiate between cost based query optimization and heuristic based query optimization 11

Cost based query optimization A cost based query optimizer estimates and compares the costs of executing a query using different execution strategies, and it then chooses the strategy with the lowest cost estimate. This approach is generally referred to as cost-based query optimization. For this approach to work, accurate cost estimates are required so that different strategies can be compared fairly and realistically. In addition, the optimizer must limit the number of execution strategies to be considered; otherwise, too much time will be spent making cost estimates for the many possible execution strategies. The cost functions used in query optimization are estimates and not exact cost functions, so the optimization may select a query execution strategy that is not the optimal (absolute best) one. Cost Components for Query Execution The cost of executing a query includes the following components: 1. Access cost to secondary storage. This is the cost of transferring (reading and writing) data blocks between secondary disk storage and main memory buffers. This is also known as disk I/O (input/output) cost. The cost of searching for records in a disk file depends on the type of access structures on that file, such as ordering, hashing, and primary or secondary indexes. In addition, factors such as whether the file blocks are allocated contiguously on the same disk cylinder or scattered on the disk affect the access cost. 2. Disk storage cost. This is the cost of storing on disk any intermediate files that are generated by an execution strategy for the query. 3. Computation cost. This is the cost of performing inmemory operations on the records within the data buffers during query execution. Such operations include searching for and sorting records, merging records for a join or a sort operation, and performing computations on field values. This is also known as CPU (central processing unit) cost. 4. Memory usage cost. This is the cost pertaining to the number of main memory buffers needed during query execution. 5. Communication cost. This is the cost of shipping the query and its results from the database site to the site or terminal where the query originated. In distributed databases, it would also include the cost of transferring tables and results among various computers during query

Heuristic based query optimization A heuristic based query optimizer would use some rules of thumb without finding out whether the cost is reduced by this transformation. This approach is generally referred to as heuristic-based query optimization.

A drawback of cost-based optimization is the cost of optimization itself. Although the cost of query processing can be reduced by clever optimizations, costbased optimization is still expensive. Hence, many systems use heuristics to reduce the number of choices that must be made in a cost-based fashion. Some systems even choose to use only heuristics, and do not use cost-based optimization at all.

Outline of a Heuristic Algebraic Optimization Algorithm 1. Break up any SELECT operations with conjunctive conditions into a cascade of SELECT operations. This permits a greater degree of freedom in moving SELECT operations down different branches of the tree. 2. Move each SELECT operation as far down the query tree as is permitted by the attributes involved in the select condition. If the condition involves attributes from only one table, which means that it represents a selection condition, the operation is moved all the way to the leaf node that represents this table. If the condition involves attributes from two tables, which means that it represents a join condition, the condition is moved to a location down the tree after the two tables are combined. 3. Rearrange the leaf nodes of the tree using the following criteria: First, position the leaf node relations with the most restrictive SELECT operations so they are executed first in the query tree representation. The definition of most restrictive SELECT can mean either the ones that produce a relation with the fewest tuples or with the smallest absolute size. Second, make sure that the ordering of leaf nodes does not cause CARTESIAN PRODUCT operations; for example, if the two relations with the most restrictive SELECT do not have a direct join condition between them, it may be

12

evaluation. For large databases, the main emphasis is often on minimizing the access cost to secondary storage. Simple cost functions ignore other factors and compare different query execution strategies in terms of the number of block transfers between disk and main memory buffers. For smaller databases, where most of the data in the files involved in the query can be completely stored in memory, the emphasis is on minimizing computation cost. In distributed databases, where many sites are involved, communication cost must be minimized also. It is difficult to include all the cost components in a (weighted) cost function because of the difficulty of assigning suitable weights to the cost components. That is why some cost functions consider a single factor only disk access.

desirable to change the order of leaf nodes to avoid Cartesian products. 4. Combine a CARTESIAN PRODUCT operation with a subsequent SELECT operation in the tree into a JOIN operation, if the condition represents a join condition. 5. Break down and move lists of projection attributes down the tree as far as possible by creating new PROJECT operations as needed. Only those attributes needed in the query result and in subsequent operations in the query tree should be kept after each PROJECT operation.

6. Identify subtrees that represent groups of operations that can be executed by a single algorithm.

An example of Heuristic Based Query Optimization


Use the following COMPANY schema: COMPANY SCHEMA Employee (Fname, Minit, Lname, Ssn, Bdate, Address, Sex, Salary, Super_ssn, Dno) Department (Dname, Dnumber, Mgr_ssn, Mgr_start_date) Dept_locations (Dnumber, Dlocation) Works on (Essn, Pno, hours) Project (Pname, Pnumber, Plocations, Dnum) Dependent(Essn, Dependent_Name, Sex, Bdate, Relationship) Consider the query: Find the last names of employees born after 1957 who work on a project named Aquarius. This query can be specified in SQL as follows: SELECT Lname FROM EMPLOYEE, WORKS_ON, PROJECT WHERE Pname=Aquarius AND Pnumber=Pno AND Essn=Ssn AND Bdate > 1957-12-31;

Steps in converting a query tree during heuristic optimization


(a) Initial (canonical) query tree for above SQL query is shown in Figure a:

13

(b) Moving SELECT operations down the query tree Executing this tree directly first creates a very large file containing the CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON, and PROJECT files. That is why the initial query tree is never executed, but is transformed into another equivalent tree that is efficient to execute. This particular query needs only one record from the PROJECT relationfor the Aquarius projectand only the EMPLOYEE records for those whose date of birth is after 1957-12-31. After applying steps 1 and 2 of the Heuristic Algebraic Optimization Algorithm the initial query tree is transformed into a tree shown in Figure b.

(c) Applying the more restrictive SELECT operation first

A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT relations in the tree. The reason behind that PNAME=Aquarius is a more testrictive select operation than BDATE>1957-12-31. It is unrealistic too many projects with the same name Aquarius. After applying steps 3 of the Heuristic
Algebraic Optimization Algorithm the query tree b is transformed into a tree shown in Figure c.

14

(d) Replacing CARTESIAN PRODUCT and SELECT with JOIN operations

We can further improve the query tree by replacing any CARTESIAN PRODUCT operation that is followed by a join condition with a JOIN operation. After applying steps 4 of the Heuristic Algebraic Optimization Algorithm
the query tree c is transformed into a tree shown in Figure d.

(e) Moving PROJECT operations down the query tree Another improvement is to keep only the attributes needed by subsequent operations in the intermediate relations, by including PROJECT () operations as early as possible in the query tree. After applying steps 5 of the Heuristic Algebraic Optimization Algorithm the query tree d is transformed into a tree shown in Figure e.

15

You might also like