0% found this document useful (0 votes)
86 views

Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The

Database tuning involves optimizing the performance of a database through activities like designing database files, selecting a database management system, and configuring system resources. The goal is to maximize efficiency by customizing settings for the database and DBMS. Tuning a DBMS involves configuring memory allocation, processing resources, recovery intervals, and parallelism. It is important to allocate memory efficiently for data caching to improve performance.

Uploaded by

ahsivirah
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Database Tuning: Database Tuning Describes A Group of Activities Used To Optimize and Homogenize The

Database tuning involves optimizing the performance of a database through activities like designing database files, selecting a database management system, and configuring system resources. The goal is to maximize efficiency by customizing settings for the database and DBMS. Tuning a DBMS involves configuring memory allocation, processing resources, recovery intervals, and parallelism. It is important to allocate memory efficiently for data caching to improve performance.

Uploaded by

ahsivirah
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Database tuning

Database tuning describes a group of activities used to optimize and homogenize the


performance of a database. It usually overlaps with query tuning, but refers to design of
the database files, selection of the database management system (DBMS), operating
system and CPU the DBMS runs on.

The goal is to maximize use of system resources to perform work as efficiently and
rapidly as possible. Most systems are designed to manage work efficiently, but it is
possible to greatly improve performance by customizing settings and the configuration
for the database and the DBMS being tuned.

DBMS tuning refers to tuning of the DBMS and the configuration of the memory and
processing resources of the computer running the DBMS. This is typically done through
configuring the DBMS, but the resources involved are shared with the host system.

Tuning the DBMS can involve setting the recovery interval (time needed to restore the
state of data to a particular point in time), assigning parallelism (the breaking up of work
from a single query into tasks assigned to different processing resources), and network
protocols used to communicate with database consumers.

Memory is allocated for data, execution plans, procedure cache, and work space. It is
much faster to access data in memory than data on storage, so maintaining a sizable
cache of data makes activities perform faster. The same consideration is given to work
space. Caching execution plans and procedures means that they are reused instead of
recompiled when needed. It is important to take as much memory as possible, while
leaving enough for other processes and the OS to use without excessive paging of
memory to storage.

Processing resources are sometimes assigned to specific activities to


improve concurrency. On a server with eight processors, six could be reserved for the
DBMS to maximize available processing resources for the database.
Basic Algorithms for Executing Relational Query Operations
 An RDBMS must include one or more alternative algorithms that implement each
relational algebra operation (SELECT, JOIN,…) and, in many cases, that implement each
combination of theses operations.
 Each algorithm may apply only to particular storage structures and access paths (such
index,…).
 Only execution strategies that can be implemented by the RDBMS algorithms and that
apply to the particular query and particular database design can be considered by the
query optimization module.

1. Algorithms for implementing SELECT operation


 These algorithms depend on the file having specific access paths and may apply only to
certain types of selection conditions.
 We will use the following examples of SELECT operations:
– (OP1):SSN=‘123456789’ (EMPLOYEE)

– (OP2):DNUMBER > 5 (DEPARTMENT)

– (OP3):DNO=5 (EMPLOYEE)

– (OP4):DNO=5 AND SALARY>30000 AND SEX = ‘F’ (EMPLOYEE)

– (OP5):ESSN=‘123456789’ AND PNO=10 (WORKS_ON)

 Many search methods can be used for simple selection: S1 through S6


• S1: Linear Search (brute force) –full scan in Oracle’s terminology-

– Retrieves every record in the file, and test whether its attribute values satisfy the
selection condition: an expensive approach.
– Cost: b/2 if key and b if no key
• S2: Binary Search

– If the selection condition involves an equality comparison on a key attribute on


which the file is ordered.
– SSN is the ordering attribute.
– Cost: log2b if key.
• S3: Using a Primary Index (hash key)

– An equality comparison on a key attribute with a primary index (or hash key).
– This condition retrieves a single record (at most).
– Cost : primary index : bind/2 + 1 (hash key: 1bucket if no collision).
• S4: Using a primary index to retrieve multiple records

– Comparison condition is >, >=, <, or <= on a key field with a primary index

– Use the index to find the record satisfying the corresponding equality condition
(DNUMBER=5), then retrieve all subsequent records in the (ordered) file.
– For the condition (DNUMBER <5), retrieve all the preceding records.
– Method used for range queries too(i.e. queries to retrieve records in certain
range)
– Cost: bind/2 + ?. ‘?’ could be known if the number of duplicates known.

• S5: Using a clustering index to retrieve multiple records:

– If the selection condition involves an equality comparison on a non-key attribute


with a clustering index.
– DNO=5(EMPLOYEE)
– Use the index to retrieve all the records satisfying the condition.
– Cost: log2bind + ?. ‘?’ could be known if the number of duplicates known.
• S6: Using a secondary (B+-tree) index on an equality comparison:

– The method can be used to retrieve a single record if the indexing field is a key
or to retrieve multiple records if the indexing field is not a key.
– This can also be used for comparisons involving >, >=, <, or <=. – Method used
for range queries too.
– Cost to retrieve: a key= height + 1; a non key= height+1(extra-level)+?,
comparison=(height-1)+?+?
• Many search methods can be used for complex selection which involve a Conjunctive Condition:
S7 through as S9.

– Conjunctive condition: several simple conditions connected with the AND


logical connective.
– (OP4): sDNO=5 AND SALARY>30000 AND SEX = ‘F’ (EMPLOYEE).
• S7:Conjunctive selection using an individual index.

– If an attribute involved in any single simple condition in the conjunctive condition has an
access path that permits the use of one of the Methods S2 to S6, use that condition to
retrieve the records.
– Then check whether each retrieved record satisfies the remaining simple conditions in the
conjunctive condition
• S8:Conjunctive selection using a composite index:

– If two or more attributes are involved in equality conditions in the conjunctive condition
and a composite index (or hash structure) exists on the combined fields.
– Example: If an index has been created on the composite key (ESSN, PNO) of the
WORKS_ON file, we can use the index directly.
– (OP5):ESSN=‘123456789’ AND PNO=10 (WORKS_ON).
• S9: Conjunctive selection by intersection of record pointers

– If the secondary indexes are available on more than one of the fields involved in simple
conditions in the conjunctive condition, and if the indexes include record pointers (rather
than block pointers), then each index can be used to retrieve the set of record pointers that
satisfy the individual condition.
– The intersection of these sets of record pointers gives the record pointers that satisfy the
conjunctive condition.
– If only some of the conditions have secondary indexes, each retrieval record is further
tested to determine whether it satisfies the remaining conditions.
• Commercial systems: Informix uses S9. Sybase ASE does it using bitmap operations. Oracle 8
uses many ways for intersection of record pointer (“hash join of indexes” and “AND bitmap”).
Microsoft SQL Server implements intersection of record pointers by index join.
2. Algorithms for implementing JOIN Operation
• Join: time-consuming operation. We will consider only natural join operation

– Two-way join: join on two files.


– Multiway join: involving more than two files.

• The following examples of two-way JOIN operation (R A=B S) will be used:

– OP6: EMPLOYEE DNO=DNUMBER DEPARTMENT


– OP7: DEPARTMENT MGRSSN=SSN EMPLOYEE
• J1: Nested-loop join (brute force)

– For each record t in R (outer loop), retrieve every record s from S (inner loop) and test
whether the two records satisfy the join condition t[A] = s[B].
• J2: Single-loop join (using an access structure to retrieve the matching records)

– If an index (or hash key) exists for one of the two join attributes (e.g B of S), retrieve
each record t in R, one at a time (single loop), and then use the access structure to retrieve
directly all matching records s from S that satisfy s[B] = t[A]
• J3. Sort-merge join:

– If the records of R and S are physically sorted (ordered) by value of the join attributes A
and B, respectively, we can implement the join in the most efficient way.
– Both files are scanned concurrently in order of the join attributes, matching the records
that have the same values for A and B.
– If the files are not sorted, they may be sorted first by using external sorting.
– Pairs of file blocks are copied into memory buffers in order and records of each file are
scanned only once each for matching with the other file if A & B are key attributes.
– The method is slightly modified in case where A and B are not key attributes.
• J4: Hash-join

– The records of files R and S are both hashed to the same hash file using the same hashing
function on the join attributes A of R and B of S as hash keys.
• Partitioning Phase

– First, a single pass through the file with fewer records (say, R) hashes its records to the
hash file buckets.
– Assumption: The smaller file fits entirely into memory buckets after the first phase.
• If the above assumption is not satisfied, the method is a more complex one and number of
variations have been proposed to improve efficiency: partition has join and hybrid hash join.

• Probing Phase

– A single pass through the other file (S) then hashes each of its records to probe
appropriate bucket, and that record is combined with all matching records from R in that
bucket.
• Commercials systems: Sybase ASE supports single-loop join and sort-merge join. Oracle 8
supports page-oriented nested loop join, sort-merge join, and variant of hybrid hash join. IBM
DB2 supports single-loop join, sort-merge, and hybrid hash-join. Microsoft SQL Server supports
single-loop join, sort-merge, hash join, and technique called hash teams. Informix supports nested
loops, singleloops, and hybrid hash join.

3. Algorithms for implementing PROJECTION Operation

 If the attribute list of the projection operation includes the key: the result will have the
same number of tuples but with only the values of the attribute list.
 In the other case:
– Remove unwanted attributes (not specified in the projection).

– Eliminate any duplicate tuples.

SELECT DISTINCT SSN, LNAME

FROM EMPLOYEE

(Duplication not removed if DISTINCT not used)

• Projection based on Sorting

– Scan EMPLOYEE and produce a set of tuples that contain only the desired attributes.
– Sort this set of tuples using the combination of all its attributes as the key for sorting.
– Scan the sorted result, comparing adjacent tuples, and discard duplicates.

• Projection Based on Hashing: Hashing is used to eliminate duplicates.

– As each record is hashed (hash function on the attribute list of the projection operation)
and inserted into a bucket of the hash file in memory, it is checked against those already
in the bucket.
– If it is a duplicate, it is not inserted.

• Projection Based Indexing:


– An existing index is useful if the key includes all the attributes that we wish to retain in
the projection.
– We can simply retrieve the key values from the index (without ever accessing the actual
relation) and apply our projection techniques to this (much smaller) set of pages. This
technique is called an index-only scan.
– If we have an ordered index whose search key includes the wanted attributes as a prefix,
we can do even better: Just retrieve the data entries in order, discarding unwanted fields,
and compare adjacent entries to check for duplicates.
 Since external sorting is required for a variety of reasons, most database systems
have a sorting utility, which can be used to implement projection relatively easily.
 Sorting is the standard approach for projection.
 Commercial Systems: Informix uses hashing, IBM DB2, Oracle 8 and Sybase
ASE uses sorting. Microsoft SQL Server and Sybase ASIQ implement both hash-based and
sort-based algorithms.

Combining Operations

 An SQL query will be translated into a sequence of relational operations.


– OP8: ΠLNAME(SEX=‘M’ (EMPLOYEE) MGRSSN=SS DEPARTMENT)

• Materialization Alternative:

– Execute a single operation at a time which generates a temporary file that will be used as
an input to the next operation.
– OP8: compute and store in a temporary fileSEX=‘M’ (EMPLOYEE). Then compute and
store in a new temporary file what have been already stored join with DEPARTMENT.
Finally, compute, as a file result, the projection. So 2 input files, 2 temporary files, and
result file.
– Time consuming approach because it will generate and sort many temporary files.

• Pipelining (stream-based) Alternative:

– Generate query execution code that correspond to algorithms for combinations of


operations in a query.
– As the result tuples from one operation are produced, they are provided as input for
parent operations. So no need to store temporary files to disk.
– OP8: Don’t store result ofSEX=‘M’ (EMPLOYEE), instead, pass tuples directly to the
join. Similarly, don’t store result of join, pass tuples directly to projection. So, only two
input files and one result file.
– Pipelines can be executed in two ways: demand driven and producer driven.

expression to an equivalent one then we discuss the generation of query execution plan.

2.1 Transformation of Relational Expression


As we mentioned, one aspect of optimization occurs at relational algebra level.
This involves transforming an initial expression (tree) into an equivalent expression (tree)
which is more efficient to execute. Two relational algebra expressions are said to be
equivalent if the two expressions generate two relation of the same set of attributes and
contain the same set of tuples although their attributes may be ordered differently.

The query tree is a data structure that represents the relational algebra expression in the
query optimization process. The leaf nodes in the query tree corresponds to the input
relations of the query. The internal nodes represent the operators in the query. When
executing the query, the system will execute an internal node operation whenever its
operands available, then the internal node is replaced by the relation which is obtained
from the preceding execution.

2.1.1 Equivalence Rules for transforming relational expressions

There are many rules which can be used to transform relational algebra operations
to equivalent ones. We will state here some useful rules for query optimization.

In this section, we use the following notation:

 E1, E2, E3,… : denote relational algebra expressions


 X, Y, Z : denote set of attributes
 F, F1, F2, F3 ,… : denote predicates (selection or join conditions)
 Commutativity of Join, Cartesian Product operations

 Note that: Natural Join operator is a special case of Join, so Natural Joins are also
commutative.
 Associativity of Join , Cartesian Product operations

Join operation associative in the following manner: F1 involves attributes from only E1
and E2 and F2 involves only attributes from E2 and E3

 Cascade of Projection
πX1(πX2(...(πXn(E))...))≡πX1(E)
 Cascade of Selection
σ F1∧F2∧...∧Fn (E)≡σ (σ (...(σ (E))...))
F1 F2 Fn

 Commutativity of Selection
σF1(σF2(E))≡σF2(σF1(E))
 Commuting Selection with Projection
πX(σF(E))≡σF(πX(E))
This rule holds if the selection condition F involves only the attributes in set X.
 Selection with Cartesian Product and Join
 If all the attributes in the selection condition F involve only the attributes of one
of the expression say E1, then the selection and Join can be combined as follows:

 If the selection condition F = F1 AND F2 where F1 involves only attributes of


expression E1 and F2 involves only attribute of expression E2 then we have:

 If the selection condition F = F1 AND F2 where F1 involves only attributes of


expression E1 and F2 involves attributes from both E1 and E2 then we have:

The same rule apply if the Join operation replaced by a Cartersian Product operation.
 Commuting Projection with Join and Cartesian Product
 Let X, Y be the set of attributes of E1 and E2 respectively. If the join condition
involves only attributes in XY (union of two sets) then :

The same rule apply when replace the Join by Cartersian Product

 If the join condition involves additional attributes say Z of E1 and W of E2 and


Z,W are not in XY then :

 Commuting Selection with set operations

The Selection commutes with all three set operations (Union, Intersect, Set Difference) .

The same rule apply when replace Union by Intersection or Set Difference
 Commuting Projection with Union

 Commutativity of set operations: The Union and Intersection are commutative but
Set Difference is not.

 Associativity of set operations: Union and Intersection are associative but Set
Difference is not

 Converting a Catersian Product followed by a Selection into Join.


If the selection condition corresponds to a join condition we can do the convert as
follows:

2.1.2 Example of Transformation

Consider the following query on COMPANY database: “Find the name of employee born
after 1967 who work on a project named ‘Greenlife’ “.

The SQL query is:

SELECT Name
FROM EMPLOYEE E, JOIN J, PROJECT P
WHERE E.EID = J.EID and PCode = Code and Bdate > ’31-12-1967’ and P.Name
= ‘Greenlife’;

The initial query tree for this SQL query is

Figure 2: Initial query tree for query in example

We can transform the query in the following steps:

- Using transformation rule number 7 apply on Catersian Product and Selection


operations to moves some Selection down the tree. Selection operations can help
reduce the size of the temprary relations which involve in Catersian Product.
Figure 3: Move Selection down the tree

- Using rule number 13 to convert the sequence (Selection, Cartesian Product) in to a


Join. In this situation, since the Join condition is equality comparison in same attributes
so we can convert it to Natural Join.

Figure 4: Combine Catersian Product with


subsequent Selection into Join

- Using rule number 8 to moves Projection operations down the tree.


Figure 5: Move projections down the query tree

2.2 Heuristic Algebraic Optimization algorithm

Here are the steps of an algorithm that utilizes equivalence rules to transfrom the query
tree.

 Break up any Selection operation with conjunctive conditions into a cascade of


Selection operations. This step is based on equivalence rule number 4.
 Move selection operations as far down the query tree as possible. This step use
the commutativity and associativity of selection as mentioned in equivalence rule
number 5,6,7 and 9.
 Rearrange the leaf nodes of the tree so most restrictive selections are done first.
Most restrictive selection is the one that produces the fewest number of tuples. In
addition, make sure that the ordering of leaf nodes does not cause the Catersian
Product operation. This step relies on the rules of associativity of binary operations
such as rule 2 and 12
 Combine a Cartesian Product with a subsequent Selection operation into a Join
operation if the selection condition represents a join condition (rule 13)
 Break down and move lists of projection down the tree as far as possible. Creating
new Projection operations as needed (rule 3, 6, 8, 10)
 Identify sub-tree that present groups of operations that can be pipelined and
executing them using pipelining

The previous example illustrates the transforming of a query tree using this algorithm.

2.3 Converting query tree to query evaluation plan

Query optimizers use the above equivalence rules to generate a enumeration of logically
equivalent expressions to the given query expression. However, expression generating is
just one part of the optimization process. As mentioned above, the evaluation plan
include the detail algorithm for each operation in the expression and how the execution of
the operations is coordinated. The figure 6 shows an evaluation plan.

Figure 6: An evaluation plan

As we know, the output of Parsing and Translating step in the query processing is a
relational algebra expression. For a complex query, this expression consists of several
operations and interact with various relations. Thus the evaluation of the expression is
very costly in terms of both time and memory space. Now we consider how to evaluate
an expression containing multiple operations. The obvious way to evaluate the expression
is simply evaluate one operations at a time in an appropriate order. The result of an
individual evaluation will be stored in a temporary relation, which must be written to disk
and might be use as the input for the following evaluation. Another approach is evaluate
several operations simultaneously in a pipeline, in which result of one operation passed
on to the next one, no temporary relation is created.

These two approaches for evaluating expression are materialization and pipelining.

Materialization
We will illustrate how to evaluate an expression using materialization approach by
looking at an example expression.

Consider the expression

The corresponding query tree is

Figure 7: Sample query tree for a relational algebra expression

When we apply the materialization approach, we start from the lowest-level operations in
the expression. In our example, there is only one such operation: the SELECTION on
DEPARTMENT. We execute this opeartion using the algorithm for it for example
Retriving multiple records using a secondary index on DName. The result is stored in the
temporary relations. We can use these temporary relation to execute the operation at the
next level up in the tree . So in our example the inputs of the join operation are
EMPLOYEE relation and temporary relation which is just created. Now, evaluate the
JOIN operation, generating another temporary relation. The execution terminates when
the root node is executed and produces the result relation for the query. The root node in
this example is the PROJECTION applied on the temporary relation produced by
executing the join.

Using materialized evaluation, every immediate operation produce a temporary relation


which then used for evaluation of higher level operations. Those temporary relations vary
in size and might have to stored in the disk . Thus, the cost for this evaluation is the sum
of the costs of all operation plus the cost of writing/reading the result of intermediate
results to disk (if it is applicable) .

Pipelining
We can improve query evaluation efficiency by reducing the number of temporary
relations that are produced. To archieve this reduction, it is common to combining several
operations into a pipeline of operations. For illustrate this idea, consider our example,
rather being implemented seperately, the JOIN can be combined with the SELECTION
on DEPARTMENT and the EMPLOYEE relation and the final PROJECTION operation.
When the select operation generates a tuple of its result, that tuple is passed immediately
along with a tuple from EMPLOYEE relation to the join. The join receive two tuples as
input, process them , if a result tuple is generated by the join, that tuple again is passed
immediately to the projection operation process to get a tuple in the final result relation.

We can implement the pipeline by create a query execution code

Using pipelining in this situation can reduce the number of temporary files, thus reduce
the cost of query evaluation. And we can see that in general, if pipelining is applicable,
the cost of the two approaches can differ substantially. But there are cases where only
materialization is feasible.

3. Cost Estimates in Query Optimization


Typically, a query optimizer is not only depended on heuristic rules but also on
estimating and compare the cost of executing different plans then choose the query
execution plans with lowest cost.

3.1 Measure of Query Cost

The cost of a query execution plan includes the following components:

 Access cost to secondary storage: This is the cost of searching for, reading,
writing data blocks of secondary storage such as disk.
 Computation cost: This is the cost of performing in-memory operation on the data
buffer during execution. This can be considered as CPU time to execute a query
 Storage cost: This is the cost of storing immediate files that are generated during
execution
 Communication cost: This is the cost of transfering the query and its result from
site to site ( in a distributed or parallel database system)
 Memory usage cost: Number of buffers needed during execution.

In a large database, access cost is usually the most important cose since disk accesses are
slow compared to in-memory operations.
In a small database, when almost data reside in the memory, the emphasis is on
computation cost. In the distributed system, communication cost should be minimize.

It is difficult to include all the cost components in a cost function. Therefore, some cost
functions consider only disk access cost as the reasonable measure of the cost of a query-
evaluation plan.

3.2 Catalog Information for Cost Estimation

Query optimizers use the statistic information stored in DBMS catalog to estimate the
cost of a plan. The relevant catalog information about the relation includes:

 Number of tuples in a relation r; denote by nr


 Number of blocks containing tuple of relation r: br
 Size of the tuple in a relation r ( assume records in a file are all of same types): sr
 Blocking factor of relation r which is the number of tuples that fit into one block:
fr
 V(A,r) is the number of distinct value of an attribute A in a relation r. This value
is the same as size of πA(r). If A is a key attribute then V(A,r) = nr
 SC(A,r) is the selection cardinality of attribute A of relation r. This is the average
number of records that satisfy an equality condition on attribute A.

In addition to relation information, some information about indices is also used:

 Number of levels in index i.


 Number of lowest –level index blocks in index i ( number of blocks in leaf level
of the index)

The statistical information listed here is simplified. The optimizer on real database
management system might have further information to improve the accuracy of their cost
estimates.

With the statistical information maintained in DBMS catalog and the measures of query
cost based on number of disk accesses, we can estimate the cost for different relational
algebra operations. In here, we will give an simple example of using cost model to
estimate the cost for selection operation. However, we are not intend to go into detail of
this issue in this course, please refer to the textbook and reference books if you want to
go deeper in this issue.

Example of cost functions for SELECTION


Consider a selection operation on a relation whose tuples are all stored in one file. The
simplest algorithms to implement seletion are linear search and binary search.

 Linear search: Scan all file blocks, all records in a block are checked to see
whether they satisfy the search condition. In general, cost for this method is C=br.
For a selection on a key attribute, half of the blocks are scanned on average, so C =
(br/2)
 Binary search: If the file is ordered on an attribute A and selection condition is a
equality comparison on A, we can use binary search. The estimate number of
blocks to be scan is

The first term is the cost to locate the first satisfied tuple by a binary search, the second
term is the number of blocks contains records that satisfy the select condition of which
one has already been retrieved that why we have the third term

Now, consider a selection in EMPLOYEE file

σDeptId=1(EMPLOYEE)
The file EMPLOYEE has the following statistical information:

 f = 20 (there are 20 tuples can fit in one block)


 V(DeptID, EMPLOYEE) = 10 (there are 10 different departments)
 n = 1000 ( there are 1000 tuples in the file)

Cost for doing linear search is b = 1000/20 = 50 block accesses

Cost for doing binary search on ordering attribute DeptID:

 Average number of records that satisfy the condition is : 1000/10 = 100 records
 Number of blocks contains these tuples is: 100/20 = 5
 A binary search for the first tuple would take log250 = 6
 Thus the total cost is : 5 + 6 – 1 = 10 block accesses

HASHING TECHNIQUES:

 Hashing provides very fast access to records on certain search conditions. This
organization is usually called a hash file.
 The search condition must be an equality condition on a single field, called the hash field
of the file. The hash field is also called as hash key.
 The idea behind hashing is to provide a function ‘h’ called a hash function (or)
randomizing function, that is applied to the hash field value of a record and yields the
address of the disk block in which the record is stored.
 Hashing is also used as an internal search within a program whenever a group of records
is accessed or exclusively by using the value of one field.
 

Static Hashing

 A bucket is a unit of storage containing one or more records (a bucket is typically a disk
block).
 The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1...
bucketM-1.Typically, a bucket corresponds to one (or a fixed number of) disk block.
 In a hash file organization we obtain the bucket of a record directly from its search-key
value using a hash function, h (K).
 The record with hash key value K is stored in bucket, where i=h(K).
 Hash function is used to locate records for access, insertion as well as deletion.
 Records with different search-key values may be mapped to the same bucket; thus
entire bucket has to be searched sequentially to locate a record.
 primary pages fixed, allocated sequentially, never de-allocated; overflow pages if
needed.

h(K) mod M = bucket to which data entry with key k belongs. (M = # of buckets)

Static External Hashing

 One of the file fields is designated to be the hash key, K, of the file.
 Collisions occur when a new record hashes to a bucket that is already full.
 An overflow file is kept for storing such records. Overflow records that hash to each
bucket can be linked together.
 To reduce overflow records, a hash file is typically kept 70-80% full.
 The hash function h should distribute the records uniformly among the buckets;
otherwise, search time will be increased because many overflow records will exist.
 Hash function works on search key  field of record r.  Must distribute values over range
0 ... M-1.

H (K) = (a * K + b) usually works well.


a and b are constants;
lots known abut how to tune h.

 Typical hash functions perform computation on the internal binary representation of the
search-key.

For example, for a string search-key, the binary representations of all the
characters in the string    could be added and the sum modulo the number of
buckets could be returned.

 Ideal hash function is random, so each bucket will have the same number of records
assigned to it irrespective of the actual distribution  of search-key values in the file.

Dynamic and Extendible Hashing Techniques

 Hashing techniques are adapted to allow the dynamic growth and shrinking of the
number of file records.

These techniques include the following:

o Dynamic hashing
o Extendible hashing
o Linear hashing.

 These hashing techniques use the binary representation of the hash value h(K).
 In dynamic hashing the directory is a binary tree.
 In extendible hashing the directory is an array of size 2d where d is called the global
depth.
 The directories can be stored on disk, and they expand or shrink dynamically. Directory
entries point to the disk blocks that contain the stored records.
 An insertion in a disk block that is full causes the block to split into two blocks and the
records are redistributed among the two blocks.
 The directory is updated appropriately.
 Dynamic and extendible hashing do not require an overflow area.
 Linear hashing does require an overflow area but does not use a directory. Blocks are
split in linear order as the file expands.
 

Dynamic Hashing

 Good for database that grows and shrinks in size


 Allows the hash function to be modified dynamically

Extendable hashing – one form of dynamic hashing

 Hash function generates values over a large range —typically b-bit integers, with b  = 32.
 At any time use only a prefix of the hash function to index into a table of bucket
addresses.
 Let the length of the prefix be i  bits, 0 _ i  _ 32.
 Bucket address table size = 2i. Initially i  = 0
 Value of i  grows and shrinks as the size of the database grows and shrinks.
 Multiple entries in the bucket address table may point to a bucket.
 Thus, actual number of buckets is < 2i
 The number of buckets also changes dynamically due to coalescing and splitting of
buckets.

General Extendable Hash Structure

 Linear Hashing

 This is another dynamic hashing scheme, an alternative to Extendible Hashing.


 LH handles the problem of long overflow chains without using a directory, and handles
duplicates.
 Idea: Use a family of hash functions h0, h1, h2,...

hi(key) = h(key) mod(2iN); N = initial # buckets


h is some hash function (range is not  0 to N-1)
If N = 2d0, for some d0, hi consists of applying h and looking at the last di  bits,
where di  = d0  + i.
hi+1 doubles the range of hi (similar to directory      doubling)

You might also like