Genetic Algorithm: An Adequate Search Technique in Query Optimization For Emerging Database Applications
Genetic Algorithm: An Adequate Search Technique in Query Optimization For Emerging Database Applications
Web Site: www.ijaiem.org Email: [email protected], [email protected] Volume 2, Issue 4, April 2013 ISSN 2319 - 4847
Genetic Algorithm: an Adequate Search Technique in Query Optimization for Emerging Database Applications
Swapnil H. Chandane1, Prof. Mahip M. Bartere2
1,2
ABSTRACT
Query optimization is a complex task. It is search for best solution from among the semantically equivalent solutions that can be generated for any given query. It therefore seems logical to consider query optimization in term of search algorithm. Now a day's we have various search algorithmss that have been applied to find an optimal or best fitted for query execution. As query getting more and more complex the search complicity is even increasing. Available query optimization techniques are inadequate to support some of emerging database application. In this days genetic algorithm becoming a solid comparator for various algorithms that are used and accepted method for difficult optimization problem. This work is the review of studies widely carried out on the application of genetic algorithm to database query optimization. From studies reviewed it turn out that genetic algorithm are viable alternative to existing query optimizers for optimization of very large queries.
Keywords: Query parser, Query optimizer, Query evaluator, Sailors-Reserves schema, Sailors-Reserve schema, genetic algorithm
1. INTRODUCTION
At present we are in the age of information technology, databases have become a necessary and fundamental tool for managing and exploiting the power of information. Because the amount of data in a database grows from larger to largest as time passes, one of the most important characteristics of a database is its ability to maintain a consistent and acceptable level of performance. The principal mechanism through which a database maintains an optimal level of performances is known as the database query optimizer; without a well-designed query optimizer, even small databases would be noticeably sluggish. The query optimizers for some of the most popular commercial-quality databases are estimated to have required about 50 man-years of development. It should therefore go without saying that the specific processes involved in designing the internal structure of a real-world optimizer can be overwhelmingly complex. Nevertheless, because of the optimizer's paramount importance to the robustness and flexibility of a database, it is worthwhile to engage in a survey of the theory behind the rudimentary components of a basic, cost-based query optimizer. Throughout this paper, the canonical Sailors-Reserves schema will be used to provide concrete examples of how an optimizer generates, evaluates, and selects query evaluation plans. This schema models the structure of the data kept at a hypothetical watercraft rental service that allows sailors with various attributes to make reservations for boats on different days. The specific instance of the Sailors-Reserves schema that will be used here is defined as follows: Sailors(sid: integer, sname: string, rating: integer, age: real) Reserves(sid: integer, bid: integer, day: date, rname: string)
2. QUERY PROCESSING
Whenever a SQL query is issued, the query is first parsed and then presented to the database's query optimizer before being executed. In order to visualize what the main components of a database query optimizer are and how these components interact in order to produce a query plan that is ready for evaluation, it may be helpful to consider the following figure:
Page 22
Figure 1 Query processing This paper will focus specifically on some of the details involved in generating query evaluation plans and in estimating the costs of such plans. Before proceeding any further, let us define exactly what constitutes a query evaluation plan: in general, a query evaluation plan is a tree with relational operators at the intermediate nodes and relations at the leaf nodes. In broad terms, the purpose of a database query optimizer is to find a good evaluation plan for any given query. Typically, an optimizer will consider only a subset of all the possible plans because the number of possible plans can be extremely large. Thus, considering each plan in turn and executing the most optimal one would actually be more time consuming than considering only one plan and executing it--even if the plan was exceedingly sub-optimal. As a matter of practice, then, many query optimizers are designed simply to avoid the poorest of evaluation plans. Although the actual implementation of a database's query optimizer varies from system to system, in theory, optimizing a SQL query involves three basic steps. First, the SQL must be rewritten in terms of relational algebra. Specifically, the query is treated as a collection of projections (p), selections (s), and Cartesian products (x), with any remaining operators carried out on the result of the given p-s-x expression. Next, once such a relational algebra expression has been formed, the optimizer must enumerate various alternative plans for evaluating the expressions. Again, a typical optimizer does not consider every possible evaluation plan of a given expression since this would require excessive overhead, thereby rendering any possible timesaving optimizations moot. Finally, the optimizer must estimate the cost associated with each of the enumerated plans, and choose the plan with the lowest estimated cost (or at least avoid the plans with the highest estimated costs) [2] [8]. Now let us consider the query optimization process with a concrete example of a SQL query example using our Sailors-Reserves schema: SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.rating > 5 We can represent the previous query in relational algebra as follows: p[sname](s[bid=100^rating>5](Reserves(join[sid=sid])Sailors)) This expression is straightforward to evaluate: first we join the Reserves and Sailors relations along the sid attribute, then we select only those tuples that satisfy the conditions that bid=100 and rating>5, and finally we project the sname attribute of the resulting tuples. In terms of an extended relational algebra tree, the above expression evaluates to:
Page 23
Figure 3 Query Evaluation plan Note that this plan is semantically equivalent to the plan in Figure 2. In terms of minimizing cost, on the other hand, Figure 3 is superior since we perform our selections on the Sailors and Reserves relations before they are joined. Thus, in most cases, plans in which selections (and projections--indeed any operator that reduces the number of tuples in a relation) are pushed ahead of joins should be constructed whenever possible. As a final heuristic, an optimizer must consider whether indexes are available for any fields in the given relations of the query, and if so, how these indexes should interact with the heuristics of pipelining and pushing selections [4][6][7]. For example, suppose that there exists an index on the bid attribute of the Reserves relation, and an index on the sid attribute of the Sailors relation. Given these two indexes, consider the following query valuation plan:
Page 24
Figure 4 Query valuation plan Here, we are able to perform the selection bid=100 using the index on bid to retrieve only those tuples that match the given criterion. For each selected tuple, we next retrieve the matching tuples from the Sailors relation using the index on sid. As Figure 4 illustrates, the selected Reserves tuples are not materialized, and the join is pipelined. For each tuple resulting from the join, we perform the selection using rating > 5 and then pipeline the result set into the projection. Perhaps the most noticeable aspect of this plan is that the selection rating > 5 is not pushed ahead of the join--a clear violation of one of our basic heuristics. The reason for this violation is that since there is no given index on the rating attribute of Sailors, if the selection were performed before the join, the selection would involve scanning through every tuple of the Sailors relation. Making matters worse, once the selection has been performed, we no longer have an index on the sid field of the result of the selection. This, in turn, would increase the cost of the subsequent join. Thus, pushing selections ahead of joins is a good general heuristic, but it is not always the best approach; in most cases, the existence of a useful index on an attribute takes precedence over pushing selections. 3.1 Heuristic optimizer cost estimation Now we have investigated three of the basic heuristics that optimizers use in order to enumerate plans that are likely not to be exorbitantly costly, it is worthwhile to explore how an optimizer estimates the cost of each plan. The term cost is very broad, and the specific formula used to estimate the cost of a plan varies from system to system. However, for the purposes of this survey, we will use a cost model that incorporates the amount of disk I/O (usually measured as the number of memory pages that must be fetched from the disk) and the amount of time used by the CPU while performing the various necessary calculations (such as comparing two pieces of data or copying data from one location into another). Of the two factors that comprise our cost model, the amount of page I/Os is by far the most significant. Because executing page I/Os involves the physical movements of the mechanical read-write head around the disk (thereby incurring rotational latency and seek time), even a small number of page I/Os can take a relatively long time, making the cost estimate very large. Nonetheless, for the sake of accuracy, page I/Os and CPU usage both are usually taken into account when estimating the cost of a plan. Given this cost model above, for an optimizer to assign a cost to a plan, it must be able to make two estimations for each part of the relational algebra expression tree. First, for each node in the tree, the optimizer must estimate the cost of performing the corresponding relational operation. The cost of the operation is determined most notably by whether pipelining is used or whether temporary relations must be created to pass the output of an operator to its parent. In addition, for each node in the tree, the optimizer must estimate the size of the result, and whether or not it is sorted. It is important to realize that the result that must be estimated is the input for the operation corresponding to the parent of the node being considered. Consequently, the size and sort order are essential since they will in turn affect the estimation of the size, sort order, and cost for the parent. To accurately perform both of these important estimations, the optimizer must have an intimate knowledge of the various parameters of the input relations--knowledge such as the number of memory pages required for a relation, or the availability of indexes on certain fields. Fortunately, (for most production-level databases at least) such metadata is readily available since it is maintained in the database's system catalogs. However, for this reason it can be difficult to give a general formula that calculates cost because the basis of any given cost estimation depends on what statistics are actually stored in the system catalogs--and the contents of a database's catalogs varies depending on the system. As an example of how the contents of the system catalogs can affect how the cost model is applied, let us examine how a typical optimizer estimates the size of the result computed by an operator on a given input. Consider a basic query of the following form: SELECT attri FROM relation list WHERE clause
Page 25
Page 26
Finishing the crossover on the other two strings in the mating pool gives us the data about the second generation of our population. Clearly, even after just a single generation, our population has improved in terms of total, average, and maximum fitness, and it does not seem presumptuous to suggest that after several more generations the population can be expected to stabilize and converge around the string 11111, which is the optimal answer for which we were searching. The mechanisms of reproduction and crossover are surprisingly simple to implement; they involve nothing more than random number generation, string copying, and some partial string exchanging. Nonetheless, the combined emphasis of reproduction and the structured--though randomized--information exchange of crossover give genetic algorithms most of their power. However, if fitness-based reproduction combined with crossover provides the bulk of a genetic algorithms processing power, then what is the purpose of the mutation operator? Many computer scientists (and even biologists) disagree as to the degree of importance of the mutation operator, but classically, it plays a decidedly secondary role in the algorithmic process. Mutation is, in fact, needed because even though reproduction and crossover effectively search through and recombine specific members the population, occasionally they may become overzealous and lose some potentially useful genetic material. In our previous example, this genetic material corresponds to various arrangements of 1's and 0's at particular locations. Therefore, in artificial genetic systems, the mutation operator protects against the irrevocable loss of important genetic material that may eventually be needed to lead the algorithm to the globally optimal solution. The mutation operator, in practical terms, is implemented by the occasional random alteration of the value of a string position. In the previous example, this equates simply to occasionally changing a 1 to a 0 or vice versa. Although significant research has gone into abstracting other genetic operators and reproductive schemes from the biological realm, the three operators discussed above--reproduction, crossover, and mutation--have continually proved to be both computationally simple and statistically effective in attacking the majority of important optimization problems [3]. 3.3 Specific characteristics and snag of genetic query optimization 3.3.1 Usage of a steady state GA (Replacement of the least fit individuals in a population, not whole-generational replacement) allows fast convergence towards improved query plans. This is essential for query handling with reasonable time
Page 27
4.
CONCLUSION
To conclude, we reflect on the fact that the purpose of this survey was twofold. One purpose was to explore the fundamental concepts behind the two topics at hand--database query optimization and genetic algorithms--and then to condense and explain these key concepts. The other purpose was, given this newly acquired information, to look toward the possibility of integrating these two topics into a challenging and substantive capstone project. To this end, it is encouraging to see that these two seemingly disparate topics do indeed have a point of intersection. Specifically, GAs are especially well suited to tackling the inherently complex process of searching for and selecting an optimal plan out of the population of relational algebra trees that are generated by the query optimizer. Henceforth, our focus on query optimization and genetic algorithms must be shifted from information gathering to integration and implementation in the context of a large-scale development project. We have to find a compromise for the parameter settings to satisfy two competing demands: Optimality of the query plan Computing time
In the current implementation, the fitness of each candidate join sequence is estimated by running the standard planner's join selection and cost estimation code from scratch. To the extent that different candidates use similar subsequences of joins, a great deal of work will be repeated. This could be made significantly faster by retaining cost estimates for sub-joins. The problem is to avoid expending unreasonable amounts of memory on retaining that state. At a more basic level, it is not clear that solving query optimization with a GA algorithm designed for TSP is appropriate. In the TSP case, the cost associated with any substring (partial tour) is independent of the rest of the tour, but this is certainly not true for query optimization. Thus it is questionable whether edge recombination crossover is the most effective mutation procedure.
References
[1] Beasley, David, David R. Bull, Ralph R. Martin. "An Overview of Genetic Algorithms: Part 1, Fundamentals." University Computing 15, 2 (February 1993), pp. 170-181. [2] Freytag, Johann Christoph "A Rule-Based View of Query Optimization" Proceedings of ACM-SIGMOD, 1987, pp. 173-180.S. Zhang, C. Zhu, J. K. O. Sin, and P. K. T. Mok, A novel ultrathin elevated channel low-temperature poly-Si TFT IEEE Electron Device Lett, vol. 20, pp. 569571, Nov. 1999. [3] Goldberg, David E. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, Massachusetts: Addison-Wesley, 1989.R. E. Sorace, V. S. Reinhardt, and S. A. Vaughn, High-speed digital-to-RF converter, U.S. Patent 5 668 842, Sept. 16, 1997. [4] Yao, S. Bing. "Optimization of Query Algorithms", ACM Transactions on Database Systems 4, 2 (June 1979), pp. 133155M. Shell (2002) IEEE transaction homepage on CTAN. [Online] Available: http://www.ctan.org/tex-archive/macros/latex/contrib/supported/IEEE transaction/ [5] Holland, John H. "Genetic Algorithms." Scientific American 267, 1 January 1992, pp. 66-72. J. Geralds, "Sega Ends Production of Dreamcast" vnunet.com, para. 2, Jan. 31, 2001. Online available: http://nl1.vnunet.com/news/1116995. Accessed on Sept. 12, 2004]. [6] Kim, Won. "On Optimizing an SQL-like Nested Query" ACM Transactions on Database Systems 7, 3 (September 1982), pp. 443-469. [7] Lohman, Guy M, Dean Daniels, Laura M. Haas, Ruth Kistler, Patricia G. Selinger "Optimization of Nested Queries in a Relational Database" IMB Research Laboratory RJ4260, San Jose, Calif, April 1983. [8] McGoveran, David. "Evaluating Optimizers" Database Programming and Design, January 1990, pp. 38-49 A. Karnik, Performance of TCP congestion control with rate feedback: TCP/ABR and rate adaptive TCP/IP, M. Eng. thesis, Indian Institute of Science, Bangalore, India, January 1999.
Page 28