Genetic Algorithms and The Search For Optimal Database Index Selection
Genetic Algorithms and The Search For Optimal Database Index Selection
Abstract
The problem of the search for an optimum database index selection problem is an NP-com-
plete problem. Genetic algorithms have been shown to be robust algorithms for searching
large spaces for optimal objective function values. Genetic algorithms use historical infor-
mation to speculate about new areas in the search space with expectod improved perfor-
mance. The feasibility of the application of genetic algorithms to the optimal database in-
dex selection is studied in this paper.
1. Introduction
In a database file with records that have several attributes, an index mechanism to the file
will decrease the cost of transactions to the file. The problem of determining an optimum
index, the optimum index selection problem (OISP), has been shown to be NP-complete.
[Comer 78, Piatetsky 83].
Genetic Algorithms (GA's) are designed for searching large spaces for optimal function
values [Holland 75]. GA's are based on the mechanics of natural selection and combine
survival of the fittest among string structures with randomized information exchange to
form a search algorithm. Even though GA's are random, they are not a simple random
walk. They use historical information to speculate about new areas in the search space with
expected improved performance [Goldberg 89]. GA's are part of the rule discovery
mechanism used by a machine learning system called classifier systems [Booker 87,
Holland 86].
In this paper, an explanation of the theory behind GA's is described, the difficulty of
Optimum Index Selection is discussed, a machine learning model that learns about database
indexes is presented, and an implementation of such a model using GA's is shown.
2. Genetic Algorithms
The work on genetic algorithms is based on the area of adaptation in natural and artificial
systems [Holland 75]. Adaptation is defined as the change that any process undergoes
where the structures of the process are progressively modified to improve the performance
of the process.
250
The set A is the space of structures that is being searched for optimal structures. The set of
operators f~ is the way to generate new structures during the search and is defined as: o
£2, o: A---)P, where P is some set of probability distributions over A. The set I is often
defined as a payoff function, an objective function that evaluates a generated member ofA.
It represents an environmental response. T is the strategy for searching the problem space.
T is a function IxA-~f~. Given a member of A and the current environmental response at
time t, T determines wha[ operator from f2 is to be applied to A at time t.
The operators of f~ are reproduction, crossover and mutation. The operators reproduction
and crossover are applied in one functional step. In crossover, two chromosomes are
selected randomly based on their observed payoff from the current population A. If the
length of chromosome strings is N, then a random number from 1 to N is selected. This is
the crossover point. The mutation operator introduces a random change at a random point
in a chromosome. Mutation is an operator that normally has a low utilization probability.
251
Genetic algorithms differ from more traditional search methods in the following ways
[Goldberg 89]:
• GA's work with a coding of the parameter set, not the parameters themselves.
• GA's search from a population of points, not a single point.
• GA's use payoff (objective function) information, not auxiliary knowledge.
• GA's use probabilistie transition rule, not detemainistic rule.
3. Schema
From the description of genetic algorithms above, there is nothing that suggests anything
more than an exhaustive search of the structure space. In the reproductive phase, the most
fit chromosome string in the initial random population will come to dominate, unless by
mutation or crossover this chromosome is destroyed. This seems to indicate something
better than trial and error.
The question of what information contained in a given population of strings guides the
search for improvement is answered with the concept of schema.
A schema [Holland 75] is a similarity template that describes a subset of strings with
similarities at certain slrihg positions. A schema can be thought of as a pattern matching
device that behaves in the following fashion: A schema matches a particular string if at
every location in the schema a "1" matches a "1" in the string, a "0" matches a "0" in the
string, and a # matches either. For example, consider the strings and schemata of length
4. The schema #1111 matches the strings { 0I 111, 11111 }, the schema #11# describes a
string set { 0110,0111, 1110, 1111 }.
The number of possible schemata is based on the cardinality of the alphabet Y. and the
length 1 of the string and schema being considered. The schemata space is given by:
(cardinal.ity(~) + 1)/. In general, a population of structures of size n contains between 21
and n*2 t schemata.
Since the schemata grows exponentially with the length of the chromosome, it is not
possible to keep a list of the possible schemata and their average measure of fitness. The
power of the genetic algorithm is derived from the ability to manipulate a large flow of
information in a manner that is algorithmicaUy feasible. This is accomplished by the
252
Reproduction has a very simple effect in the schemata population. Since more highly fit
strings have a higher probability of selection, a chromosome with a good schema will have
a high reproductive rate, and with that an ever increasing number of chromosomes will
occur in the population tl3.atcarry this schema.
Crossover leaves the schema intact if it doesn't cut through the schema, but will affect the
schema if it does. For example, consider the schemata 1#,,,~9, and ##11#, crossover will
very likely affect the former while the latter is not likely to be destroyed. As a result,
schemata of short defining length are most likely unaffected by the crossover operator and
reproduced at a good rate by the reproduction operator.
Mutation being a low probability operator does not affect the particular schema very
frequently. Mutation has the effect of introducing change to particular gene positions that
may have become fixed in value and help to prevent loss of potentially important genetic
material.
With the genetic operators, highly short-defining-length schemata, called building blocks
[Goldberg 89], are propagated from generation to generation by exponentially increasing
samples of the observed best. It is this propagation of building blocks combined with the
crossover operator that gives the genetic algorithm its ability to achieve improved
performance.
Given a file on secondary storage where records in the file have several attributes, it is often
necessary to build an index mechanism to decrease the cost of accessing the file. The index
selection is dependent on the usage of the file and may not be static. The usage demands on
the file change with the set of queries using the information in the file. For the purposes of
this paper, it is assumed that the type of queries being made to the file are unknown.
Formally the Optimum Index Selection Problem (OISP) is defined as: Given a file F with
n records and k attributes, and an integer p, does there exist an indexing set for F with size
no more than p?
The OISP problem has been shown to be an NP-Complete problem for files of degree d, d
___2 [Comer 78]. A program to solve OISP on a file of k attributes might require 2k steps (or
worse). And by adding just 1 more attribute to the file it may take twice as long. This
program will be practical only for small values of k.
In traditional database management systems (DBMS), the user query is sent directly to the
Query Processor unit which determines the best access method (i.e., an access method
which minimizes the transaction cost) for processing the query. To determine the best ac-
cess method, this unit will look at the existing set of indexes which is specified by the user
at the time the database is created. The problem with this strategy is that through the life
cycle of the database, the user pattern of the request might change. Therefore, the existing
set of indexes may need to be changed dynamically with the change in the users request
pattern. Here we will present a model of a Learning System which observes the pattern of
the user request and decides on which attributes indexing is profitable. This information is
then passed to the Query Processor unit. Figure 1 demonstrates this concept.
DBMS
~ Query Processor
I AccessMethod I [
.
,I
[Indexin~Schema~
I Query System
i- LearningSystem ~ 1
Figure 1
6. An implementation
In order to utilize GA's in an empirical approximation for a solution of the optimal index
set problem, a computer model of a library database was created. The following tables
describe the physical and logical specifications of the database, where:
The genetic code utilized by the system was based on a simple five bit long chromosome
string. This string has a one-to-one relationship between a chromosome bit, an allele and
the status of an index to an attribute such that chromosome string 01011 would indicate the
254
presence of an index on attributes ISBN, Subject and Author. Note that the user query is
also represented a five bit long chromosome string. For example, the query "Find the ISBN
of the book tided 'Introduction to Databases' by I.D. Ullman," is represented by 00101.
To model the performance of the GA, the fitness function that assigns a reward to the
proper chromosome is defined as follows:
Payoff = 1-(number_of_no_index._matches*PENALTY)- (error indexes * .5 * PENALTY),
where PENALTY = 1/# of attributes, number_of_no_index_matches is the number of
attributes in the user query which have no index. Error_indexes is the number of attributes
which have indexes, but are not included in the query. For example, for the index schema
01011 and the query 00101, the number_of..no_index_matches=l (no index for the
attribute Title), and the error_indexes=2 (unused indexes on ISBN and Subject attributes).
Note that Payoff function-is maximum if number_of_no_index_matches and error_indexes
are both zero.
A query is generated against the database with variable probabilities and these probabilities
are entered at runtime. A query generated using the probabilities may access more than
one attribute in the database. For the above database we have considered the following set
of probabilities for the attributes: P(Classifieation)=P(ISBN)=P(Author)=P(Subjec0--0.01,
and P(Title)=0.96.
The computer model used the previous parameters to generate a simulation for the
utilization of such a database. Every time a chromosome string was evaluated, a query was
generated by the system and then matched against the alleles of the chromosome. This is
how an objective value was assigned to the chromosome string based on the fitness function
described above.
The result of the above experiment along with others where the probability attributes are
changed are shown in Figure 2. The X-axis is the number of generations, and the Y-axis is
the average Payoff for each set of secondary indexes.
0.6
0,g'
O
O"1
t~
:>
,< / PcA.=o01
mO5. ~ 5 = / P(3)=0.05 P(4)=0.30
0.8. .~ / P(5)=0.30 P(6-10)=0.0
f PopulationSize = 30 0,4
~ PopulationSize = 30
/ X-Over Prob = 0,6
X-OverProb= 0.6
J MutationProb = 0.0 MutationProb = 0.01
0.7 ..... , , 0.3 = • "' '4
0 1o 20 10 20
Generation Generation
Figure 2
255
7. Conclusions
The application of genetic algorithms to the search of optimal indexes was explored in this
study. The results provide some insight into the application of adaptive, if not machine
learning systems to dynamic database systems. One can envision a database system
attached to a GA like system to perform physical and logical optimization of the database.
This system may use the GA's as the basis for the discovery of rules that affect the database
performance. Given the length of genetic memory needed to maintain appropriate
reasoning about indexes, the encoding of chromosome structure must be made more
sophisticated. The results obtained in this study should motivate further study of the
application of adaptive planning and machine learning to database optimization.
References
[Booker 87], Booker, L.B., Goldberg, D.E., Holland, J.H., Classifier Systems and Genetic
Algorithms. Cognitive Science and Machine Learning Laboratory, Technical
Report No. 8. The University of Michigan, Ann Arbor, MI.
[Comer 78], Comer, D. The difficulty of optimum index selection. ACM Transactions on
Database Systems, 3(4), 440-445.
[Goldberg 89], Goldberg, D.E. Genetic Algorithms: In search optimization & machine
learning. Reading, MA: Addison-Wesley.
[Grefenstette 85], Grefenstette, J.J., Gopala, B.J., & D.V. Gueht, Genetic Algorithms for
the Traveling Salesman Problem. In Proceedings of an International Conference on
Genetic Algorithms and Their Applications (pp 136-140), Carnegie-Mellon
University, Pittsburgh, PA.
[Holland 75], Holland, J.H. Adaptation in natural and artificial systems. The University of
Michigan Press, Ann Arbor, MI.
[Holland 86], Holland, J.H. Escaping brittleness: The possibilities of general purpose
learning algorithms applied to parallel rule-based systems. In R. S. Michalsld, J.G.
Carbonell, & T. M. Mitchell (Eds.), Machine Learning H (pp. 593-623). Los
Altos, CA: Morgan Kaufmann.