Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The
Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The
The goal is to maximize use of system resources to perform work as efficiently and
rapidly as possible. Most systems are designed to manage work efficiently, but it is
possible to greatly improve performance by customizing settings and the configuration
for the database and the DBMS being tuned.
DBMS tuning refers to tuning of the DBMS and the configuration of the memory and
processing resources of the computer running the DBMS. This is typically done through
configuring the DBMS, but the resources involved are shared with the host system.
Tuning the DBMS can involve setting the recovery interval (time needed to restore the
state of data to a particular point in time), assigning parallelism (the breaking up of work
from a single query into tasks assigned to different processing resources), and network
protocols used to communicate with database consumers.
Memory is allocated for data, execution plans, procedure cache, and work space. It is
much faster to access data in memory than data on storage, so maintaining a sizable
cache of data makes activities perform faster. The same consideration is given to work
space. Caching execution plans and procedures means that they are reused instead of
recompiled when needed. It is important to take as much memory as possible, while
leaving enough for other processes and the OS to use without excessive paging of
memory to storage.
– (OP3):DNO=5 (EMPLOYEE)
– Retrieves every record in the file, and test whether its attribute values satisfy the
selection condition: an expensive approach.
– Cost: b/2 if key and b if no key
• S2: Binary Search
– An equality comparison on a key attribute with a primary index (or hash key).
– This condition retrieves a single record (at most).
– Cost : primary index : bind/2 + 1 (hash key: 1bucket if no collision).
• S4: Using a primary index to retrieve multiple records
– Comparison condition is >, >=, <, or <= on a key field with a primary index
–
– Use the index to find the record satisfying the corresponding equality condition
(DNUMBER=5), then retrieve all subsequent records in the (ordered) file.
– For the condition (DNUMBER <5), retrieve all the preceding records.
– Method used for range queries too(i.e. queries to retrieve records in certain
range)
– Cost: bind/2 + ?. ‘?’ could be known if the number of duplicates known.
–
• S5: Using a clustering index to retrieve multiple records:
– The method can be used to retrieve a single record if the indexing field is a key
or to retrieve multiple records if the indexing field is not a key.
– This can also be used for comparisons involving >, >=, <, or <=. – Method used
for range queries too.
– Cost to retrieve: a key= height + 1; a non key= height+1(extra-level)+?,
comparison=(height-1)+?+?
• Many search methods can be used for complex selection which involve a Conjunctive Condition:
S7 through as S9.
– If an attribute involved in any single simple condition in the conjunctive condition has an
access path that permits the use of one of the Methods S2 to S6, use that condition to
retrieve the records.
– Then check whether each retrieved record satisfies the remaining simple conditions in the
conjunctive condition
• S8:Conjunctive selection using a composite index:
– If two or more attributes are involved in equality conditions in the conjunctive condition
and a composite index (or hash structure) exists on the combined fields.
– Example: If an index has been created on the composite key (ESSN, PNO) of the
WORKS_ON file, we can use the index directly.
– (OP5):ESSN=‘123456789’ AND PNO=10 (WORKS_ON).
• S9: Conjunctive selection by intersection of record pointers
– If the secondary indexes are available on more than one of the fields involved in simple
conditions in the conjunctive condition, and if the indexes include record pointers (rather
than block pointers), then each index can be used to retrieve the set of record pointers that
satisfy the individual condition.
– The intersection of these sets of record pointers gives the record pointers that satisfy the
conjunctive condition.
– If only some of the conditions have secondary indexes, each retrieval record is further
tested to determine whether it satisfies the remaining conditions.
• Commercial systems: Informix uses S9. Sybase ASE does it using bitmap operations. Oracle 8
uses many ways for intersection of record pointer (“hash join of indexes” and “AND bitmap”).
Microsoft SQL Server implements intersection of record pointers by index join.
2. Algorithms for implementing JOIN Operation
• Join: time-consuming operation. We will consider only natural join operation
– For each record t in R (outer loop), retrieve every record s from S (inner loop) and test
whether the two records satisfy the join condition t[A] = s[B].
• J2: Single-loop join (using an access structure to retrieve the matching records)
– If an index (or hash key) exists for one of the two join attributes (e.g B of S), retrieve
each record t in R, one at a time (single loop), and then use the access structure to retrieve
directly all matching records s from S that satisfy s[B] = t[A]
• J3. Sort-merge join:
– If the records of R and S are physically sorted (ordered) by value of the join attributes A
and B, respectively, we can implement the join in the most efficient way.
– Both files are scanned concurrently in order of the join attributes, matching the records
that have the same values for A and B.
– If the files are not sorted, they may be sorted first by using external sorting.
– Pairs of file blocks are copied into memory buffers in order and records of each file are
scanned only once each for matching with the other file if A & B are key attributes.
– The method is slightly modified in case where A and B are not key attributes.
• J4: Hash-join
– The records of files R and S are both hashed to the same hash file using the same hashing
function on the join attributes A of R and B of S as hash keys.
• Partitioning Phase
– First, a single pass through the file with fewer records (say, R) hashes its records to the
hash file buckets.
– Assumption: The smaller file fits entirely into memory buckets after the first phase.
• If the above assumption is not satisfied, the method is a more complex one and number of
variations have been proposed to improve efficiency: partition has join and hybrid hash join.
• Probing Phase
– A single pass through the other file (S) then hashes each of its records to probe
appropriate bucket, and that record is combined with all matching records from R in that
bucket.
• Commercials systems: Sybase ASE supports single-loop join and sort-merge join. Oracle 8
supports page-oriented nested loop join, sort-merge join, and variant of hybrid hash join. IBM
DB2 supports single-loop join, sort-merge, and hybrid hash-join. Microsoft SQL Server supports
single-loop join, sort-merge, hash join, and technique called hash teams. Informix supports nested
loops, singleloops, and hybrid hash join.
If the attribute list of the projection operation includes the key: the result will have the
same number of tuples but with only the values of the attribute list.
In the other case:
– Remove unwanted attributes (not specified in the projection).
FROM EMPLOYEE
– Scan EMPLOYEE and produce a set of tuples that contain only the desired attributes.
– Sort this set of tuples using the combination of all its attributes as the key for sorting.
– Scan the sorted result, comparing adjacent tuples, and discard duplicates.
– As each record is hashed (hash function on the attribute list of the projection operation)
and inserted into a bucket of the hash file in memory, it is checked against those already
in the bucket.
– If it is a duplicate, it is not inserted.
Combining Operations
• Materialization Alternative:
– Execute a single operation at a time which generates a temporary file that will be used as
an input to the next operation.
– OP8: compute and store in a temporary fileSEX=‘M’ (EMPLOYEE). Then compute and
store in a new temporary file what have been already stored join with DEPARTMENT.
Finally, compute, as a file result, the projection. So 2 input files, 2 temporary files, and
result file.
– Time consuming approach because it will generate and sort many temporary files.
expression to an equivalent one then we discuss the generation of query execution plan.
The query tree is a data structure that represents the relational algebra expression in the
query optimization process. The leaf nodes in the query tree corresponds to the input
relations of the query. The internal nodes represent the operators in the query. When
executing the query, the system will execute an internal node operation whenever its
operands available, then the internal node is replaced by the relation which is obtained
from the preceding execution.
There are many rules which can be used to transform relational algebra operations
to equivalent ones. We will state here some useful rules for query optimization.
Note that: Natural Join operator is a special case of Join, so Natural Joins are also
commutative.
Associativity of Join , Cartesian Product operations
Join operation associative in the following manner: F1 involves attributes from only E1
and E2 and F2 involves only attributes from E2 and E3
Cascade of Projection
πX1(πX2(...(πXn(E))...))≡πX1(E)
Cascade of Selection
σ F1∧F2∧...∧Fn (E)≡σ (σ (...(σ (E))...))
F1 F2 Fn
Commutativity of Selection
σF1(σF2(E))≡σF2(σF1(E))
Commuting Selection with Projection
πX(σF(E))≡σF(πX(E))
This rule holds if the selection condition F involves only the attributes in set X.
Selection with Cartesian Product and Join
If all the attributes in the selection condition F involve only the attributes of one
of the expression say E1, then the selection and Join can be combined as follows:
The same rule apply if the Join operation replaced by a Cartersian Product operation.
Commuting Projection with Join and Cartesian Product
Let X, Y be the set of attributes of E1 and E2 respectively. If the join condition
involves only attributes in XY (union of two sets) then :
The same rule apply when replace the Join by Cartersian Product
The Selection commutes with all three set operations (Union, Intersect, Set Difference) .
The same rule apply when replace Union by Intersection or Set Difference
Commuting Projection with Union
Commutativity of set operations: The Union and Intersection are commutative but
Set Difference is not.
Associativity of set operations: Union and Intersection are associative but Set
Difference is not
Consider the following query on COMPANY database: “Find the name of employee born
after 1967 who work on a project named ‘Greenlife’ “.
SELECT Name
FROM EMPLOYEE E, JOIN J, PROJECT P
WHERE E.EID = J.EID and PCode = Code and Bdate > ’31-12-1967’ and P.Name
= ‘Greenlife’;
Here are the steps of an algorithm that utilizes equivalence rules to transfrom the query
tree.
The previous example illustrates the transforming of a query tree using this algorithm.
Query optimizers use the above equivalence rules to generate a enumeration of logically
equivalent expressions to the given query expression. However, expression generating is
just one part of the optimization process. As mentioned above, the evaluation plan
include the detail algorithm for each operation in the expression and how the execution of
the operations is coordinated. The figure 6 shows an evaluation plan.
As we know, the output of Parsing and Translating step in the query processing is a
relational algebra expression. For a complex query, this expression consists of several
operations and interact with various relations. Thus the evaluation of the expression is
very costly in terms of both time and memory space. Now we consider how to evaluate
an expression containing multiple operations. The obvious way to evaluate the expression
is simply evaluate one operations at a time in an appropriate order. The result of an
individual evaluation will be stored in a temporary relation, which must be written to disk
and might be use as the input for the following evaluation. Another approach is evaluate
several operations simultaneously in a pipeline, in which result of one operation passed
on to the next one, no temporary relation is created.
These two approaches for evaluating expression are materialization and pipelining.
Materialization
We will illustrate how to evaluate an expression using materialization approach by
looking at an example expression.
When we apply the materialization approach, we start from the lowest-level operations in
the expression. In our example, there is only one such operation: the SELECTION on
DEPARTMENT. We execute this opeartion using the algorithm for it for example
Retriving multiple records using a secondary index on DName. The result is stored in the
temporary relations. We can use these temporary relation to execute the operation at the
next level up in the tree . So in our example the inputs of the join operation are
EMPLOYEE relation and temporary relation which is just created. Now, evaluate the
JOIN operation, generating another temporary relation. The execution terminates when
the root node is executed and produces the result relation for the query. The root node in
this example is the PROJECTION applied on the temporary relation produced by
executing the join.
Pipelining
We can improve query evaluation efficiency by reducing the number of temporary
relations that are produced. To archieve this reduction, it is common to combining several
operations into a pipeline of operations. For illustrate this idea, consider our example,
rather being implemented seperately, the JOIN can be combined with the SELECTION
on DEPARTMENT and the EMPLOYEE relation and the final PROJECTION operation.
When the select operation generates a tuple of its result, that tuple is passed immediately
along with a tuple from EMPLOYEE relation to the join. The join receive two tuples as
input, process them , if a result tuple is generated by the join, that tuple again is passed
immediately to the projection operation process to get a tuple in the final result relation.
Using pipelining in this situation can reduce the number of temporary files, thus reduce
the cost of query evaluation. And we can see that in general, if pipelining is applicable,
the cost of the two approaches can differ substantially. But there are cases where only
materialization is feasible.
Access cost to secondary storage: This is the cost of searching for, reading,
writing data blocks of secondary storage such as disk.
Computation cost: This is the cost of performing in-memory operation on the data
buffer during execution. This can be considered as CPU time to execute a query
Storage cost: This is the cost of storing immediate files that are generated during
execution
Communication cost: This is the cost of transfering the query and its result from
site to site ( in a distributed or parallel database system)
Memory usage cost: Number of buffers needed during execution.
In a large database, access cost is usually the most important cose since disk accesses are
slow compared to in-memory operations.
In a small database, when almost data reside in the memory, the emphasis is on
computation cost. In the distributed system, communication cost should be minimize.
It is difficult to include all the cost components in a cost function. Therefore, some cost
functions consider only disk access cost as the reasonable measure of the cost of a query-
evaluation plan.
Query optimizers use the statistic information stored in DBMS catalog to estimate the
cost of a plan. The relevant catalog information about the relation includes:
The statistical information listed here is simplified. The optimizer on real database
management system might have further information to improve the accuracy of their cost
estimates.
With the statistical information maintained in DBMS catalog and the measures of query
cost based on number of disk accesses, we can estimate the cost for different relational
algebra operations. In here, we will give an simple example of using cost model to
estimate the cost for selection operation. However, we are not intend to go into detail of
this issue in this course, please refer to the textbook and reference books if you want to
go deeper in this issue.
Linear search: Scan all file blocks, all records in a block are checked to see
whether they satisfy the search condition. In general, cost for this method is C=br.
For a selection on a key attribute, half of the blocks are scanned on average, so C =
(br/2)
Binary search: If the file is ordered on an attribute A and selection condition is a
equality comparison on A, we can use binary search. The estimate number of
blocks to be scan is
The first term is the cost to locate the first satisfied tuple by a binary search, the second
term is the number of blocks contains records that satisfy the select condition of which
one has already been retrieved that why we have the third term
σDeptId=1(EMPLOYEE)
The file EMPLOYEE has the following statistical information:
Average number of records that satisfy the condition is : 1000/10 = 100 records
Number of blocks contains these tuples is: 100/20 = 5
A binary search for the first tuple would take log250 = 6
Thus the total cost is : 5 + 6 – 1 = 10 block accesses
HASHING TECHNIQUES:
Hashing provides very fast access to records on certain search conditions. This
organization is usually called a hash file.
The search condition must be an equality condition on a single field, called the hash field
of the file. The hash field is also called as hash key.
The idea behind hashing is to provide a function ‘h’ called a hash function (or)
randomizing function, that is applied to the hash field value of a record and yields the
address of the disk block in which the record is stored.
Hashing is also used as an internal search within a program whenever a group of records
is accessed or exclusively by using the value of one field.
Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1...
bucketM-1.Typically, a bucket corresponds to one (or a fixed number of) disk block.
In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function, h (K).
The record with hash key value K is stored in bucket, where i=h(K).
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
primary pages fixed, allocated sequentially, never de-allocated; overflow pages if
needed.
One of the file fields is designated to be the hash key, K, of the file.
Collisions occur when a new record hashes to a bucket that is already full.
An overflow file is kept for storing such records. Overflow records that hash to each
bucket can be linked together.
To reduce overflow records, a hash file is typically kept 70-80% full.
The hash function h should distribute the records uniformly among the buckets;
otherwise, search time will be increased because many overflow records will exist.
Hash function works on search key field of record r. Must distribute values over range
0 ... M-1.
Typical hash functions perform computation on the internal binary representation of the
search-key.
For example, for a string search-key, the binary representations of all the
characters in the string could be added and the sum modulo the number of
buckets could be returned.
Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution of search-key values in the file.
Hashing techniques are adapted to allow the dynamic growth and shrinking of the
number of file records.
o Dynamic hashing
o Extendible hashing
o Linear hashing.
These hashing techniques use the binary representation of the hash value h(K).
In dynamic hashing the directory is a binary tree.
In extendible hashing the directory is an array of size 2d where d is called the global
depth.
The directories can be stored on disk, and they expand or shrink dynamically. Directory
entries point to the disk blocks that contain the stored records.
An insertion in a disk block that is full causes the block to split into two blocks and the
records are redistributed among the two blocks.
The directory is updated appropriately.
Dynamic and extendible hashing do not require an overflow area.
Linear hashing does require an overflow area but does not use a directory. Blocks are
split in linear order as the file expands.
Dynamic Hashing
Hash function generates values over a large range —typically b-bit integers, with b = 32.
At any time use only a prefix of the hash function to index into a table of bucket
addresses.
Let the length of the prefix be i bits, 0 _ i _ 32.
Bucket address table size = 2i. Initially i = 0
Value of i grows and shrinks as the size of the database grows and shrinks.
Multiple entries in the bucket address table may point to a bucket.
Thus, actual number of buckets is < 2i
The number of buckets also changes dynamically due to coalescing and splitting of
buckets.
Linear Hashing