File Organization & Indexing: Reading: C&B, Appendix C
File Organization & Indexing: Reading: C&B, Appendix C
Introduction
DBMS has to store data somewhere Choices:
Main memory
Expensive compared to secondary and tertiary storage Fast in memory operations are fast Volatile not possible to save data from one run to its next Used for storing current data
Less expensive compared to main memory Slower compared to main memory, faster compared to tapes Persistent data from one run can be saved to the disk to be used in the next run Used for storing the database Cheapest Slowest sequential data access Used for data archives
3
Because I/O disk operations are slow query performance depends upon how data is stored on hard disks The lowest component of the DBMS performs storage management activities Other DBMS components need not know how these low level activities are performed
Dept. of Computing Science, University of Aberdeen 4
File Organization
The physical arrangement of data in a file into records and pages on the disk File organization determines the set of access methods for Therefore, file organization synonymous with access method We study three types of file organization
Unordered or Heap files Ordered or sequential files Hash files Storing and retrieving records from a file
Fast because the incoming record is written at the end of the last page of the file Slow because linear search is performed on pages Slow because the record to be deleted is first searched for Deleting the record creates a hole in the page Periodic file compacting work required to reclaim the wasted space
Dept. of Computing Science, University of Aberdeen 8
Hash File
Is an array of buckets
Given a record, r a hash function, h(r) computes the index of the bucket in which record r belongs h uses one or more fields in the record called hash fields Hash key - the key of the file when it is used by the hash function
Assume that the staff last name is used as the hash field Assume also that the hash file size is 26 buckets - each bucket corresponding to each of the letters from the alphabet Then a hash function can be defined which computes the bucket address (index) based on the first letter in the last name.
Dept. of Computing Science, University of Aberdeen 10
Search Operation
Fast because the hash function computes the index of the bucket
Delete Operation
Performance may degrade if the record is not found in the bucket suggested by hash function
Fast once again for the same reason of hashing function being able to locate the record quick
Dept. of Computing Science, University of Aberdeen 11
Indexing
Can we do anything else to improve query performance other than selecting a good file organization? Yes, the answer lies in indexing Index - a data structure that allows the DBMS to locate particular records in a file more quickly
Types of Index
Very similar to the index at the end of a book to locate various topics covered in the book
Sparse index has only some of the search key values in the file Dense index has an index corresponding to every search key value in the file
Primary index one primary index per file Clustering index one clustering index per file data file is ordered on a non-key field and the index file is built on that non-key field Secondary index many secondary indexes per file
12
Primary Indexes
The data file is sequentially ordered on the key field Index file stores all (dense) or some (sparse) values of the key field and the page number of the data file in which the corresponding record is stored
B002 B003 B004 B005 1 1 2 2
Branch B002 record Branch B003 record Branch B004 record Branch B005 record Branch B007 record
Branch
BranchNo B002 B003 Street 56 Clover Dr 163 Main St 32 Manse Rd 22 Deer Rd 16 Argyll St City London Glasgow Bristol London Aberdeen Postcode NW10 6EU G11 9QX BS99 1NZ SW1 4EH AB2 3SU
2
3 4
B007
3
13
You need an overflow file which periodically needs to be merged with the main file
Dept. of Computing Science, University of Aberdeen 14
Secondary Indexes
An index file that uses a non primary field as an index e.g. City field in the branch table They improve the performance of queries that use attributes other than the primary key You can use a separate index for every attribute you wish to use in the WHERE clause of your select query But there is the overhead of maintaining a large number of these indexes
Dept. of Computing Science, University of Aberdeen 15
Summary
File organization or access method determines the performance of search, insert and delete operations.
Access methods are the primary means to achieve improved performance