Cost Estimation
in
Query OptimizationCost Estimation in Query Optimization
* The main aim of query optimization is to
choose the most efficient way of implementing
the relational algebra operations at the lowest
possible cost.
* The query optimizer should not depend solely
on heuristic rules, but, it should also estimate
the cost of executing the different strategies
and find out the strategy with the minimum
cost estimate.* The cost functions used in query optimization are
estimates and not exact cost functions.
* The cost of an operation is heavily dependent on
its ‘selectivity, that |s, the proportion of select
operation(s) that forms the output.
* In general the differentyalgorithmsyarewsuitable
for low or high selectivity queries.
* In order for query optimizer to choose suitable
algorithm for an operation an estimate of the
cost of executing that algorithm must be
provided* The cost of an algorithm is depend of a
cardinality of its input.
* To estimate the cost of different query execution
strategies, the query tree is viewed as containing
a series of basic operations which are linked in
order to perform the query.
* It is also important to know the expected
cardinality of an operation’s output because this
forms the input to the next operation.Cost Components of Query Execution
The cost of executing the query includes the
following components:
— Access cost to secondary storage.
— Storage cost.
— Computation cost.
— Memory uses cost.
— Communication cost.Importance of Access cost
Out of the above five cost components, the most
important is the secondary storage access cost.
The emphasis of the cost minimization depends
onthe size and type of database applications.
For example in smaller database the emphasis is
on the minimizingwcomputinguicost as because
most of the data in the files involve in the query
can be completely store in the main memory.
For largé database, the main emphasis is on* For distributed database, the communication cost
is minimized as because many sites are involved
for the data transfer.
* To estimate the cost of various execution
strategies, we must keep track of any information
that is needed for the cost function.
>This information may be stored in database —
catalog, where it is accessed by the query
optimizer.Information in system Catalogue
The number of tuples in relation as R [nTuples(R)].
The average record size in relation R.
The number of blocks required to store relation R as
[nBlocks(R)].
The blocking factors in relation R (that is the number of
tuples of R that fit into one block) as [bFactor(R)].
Primary access method for each file.
Primary access attributes for each file.
The number of level of each multilevel index | (primary,
secondary or clustering) as [nLevelsA(|)].The number of first level index blocks as [nBlocksA (I)].
The number of distinct values that are appear for
attribute A in relation R as [nDistinctA(R)].
The minimum and maximum possible values for
attribute A in relation R as [minA(R), maxA(R)].
The selectivity of an attribute, which is the fraction of
records satisfying an equality condition on the
attribute.
The selection cardinality of given attribute Ain relation
Ras [SCA(R)].
The selection cardinality is the average number of
tuples that satisfied an equality condition on attribute
A.Cost functions for SELECT Operation
* Linear Search:
— [nBlocks(R)/2], if the record is found.
— [nBlocks(R)], if no record satisfied the condition.
* Binary Search :
2 [log2(nBlocks(R))], if equality condition is on key attribute,
because SCA(R) = 1 in this case.
o [log2(nBlocks(R))] + [SCA(R)/bFactor(R)] — 1, otherwise.* Equity condition on Primary key
— [nLevelA(1) + 1]
* Equity condition on Non-Primary key :-
— [nLevelA(|) + 1] + [nBlocks(R)/2]Cost functions for JOIN Operation
* Join operation is the most time consuming
operation to process.
* An estimate for the size (number of tuples) of the
file that results after the JOIN operation is
required to develop reasonably accurate cost
functions for JOIN operations.
* The JOIN operations define the relation
containing tuples that satisfy a specific predicate
F from the Cartesian product of two relations R
and S.Different strategies for JOIN operations
Strategies Cost Estimation
Block nested-loop JOIN a) nBlocks(R) + (nBlocks(R) * nBlocks(S))
If the buffer has only one block
b) nBlocks(R) + [ nBlocks(S) * ( nBlocks(R)/(nBuffer-2) ) ]
If (nBuffer-2) blocks is there for R
cc) nBlocks(R) + nBlocks(S)
Ifall blocks of R can be read into database buffer
Indexed nested-loop a) nBlocks(R) + nTuples(R) * (nLevel,(l) + 1)
JOIN Ifjoin attribute Ain Sis a primary key
b) nBlocks(R) + nTuples(R) *
(nLevel,(l) + [SC,(R) / bFactor(R) } )
If clustering index | is on attribute A.Different strategies for JOIN operations
Sort-merge JOIN a) nBlocks(R) *[ logenBlocks(R) | +
nBlocks(S) * [ lognBlocks(R) ] For Sort
b) nBlocks(R) +nBlocks(S) For Merge
Hash JOIN a) 3(nBlocks(R) + nBlocks(S)) If Hash index is in memory
b) 2(nBlocks(R) + nBlocks(S}) *
[log (nBlocks(S)) - 1] + nBlocks(R) + nBlocks(S)
Otherwise