File Organizations and Indexes
File Organizations and Indexes
Rangana Jayashanka
Introduction
• Databases are stored physically as files of
records, which are typically stored on
magnetic disks.
• We will discuss organization of databases in
storage and the techniques for accessing them
efficiently using various algorithms, some of
which require auxiliary data structures called
indexes.
Introduction
• The data stored in disk is organized as files of
records.
• There are primary file organizations, which
determine how the records of a file are
physically placed on the disk, and hence how
the records can be accessed.
Introduction
• Heap file (unordered file) – places the records on
disk in no particular order by appending new
records at the end of the file.
• Sorted file (sequential file) – keeps the records
ordered by the value of a particular field (sort key).
• Hashed file – uses a hash function applied to a
particular field (hash key) to determine a record’s
placement on disk.
• B-trees – uses tree structures.
Introduction
• A secondary organization or auxiliary access
structure allows efficient access to the records
of a file based on alternate fields than those
that have been used for the primary file
organization.
• Most of these exist as indexes.
Files
• A file is a sequence of records.
• In many cases, all records in a file are of the
same record type.
• If every record in the file has exactly the same
size (in bytes), the file is said to be made up of
fixed-length records.
• If different records in the file has have different
sizes, the file is said to be made up of variable-
length records.
Variable length records
• A file may have variable-length records for
several reasons.
1. One or more of the fields are of varying size.
(variable length fields) Ex: NAME field
2. One or more field may have multiple values
for individual records.
3. One or more fields are optional.
Hashing Technology
• Provides very fast access to records on certain
search conditions.
• This organization is usually called a hash file.
• The search condition must be an equality
condition on a single field, called the hash field
of the file.
• In most cases, the hash field is also a key field
of the file, in which case it is called the hash
key.
Hashing Technology
• The idea behind hashing is to provide a
function h, called a hash function or
randomization function.
• Hash function applied to the hash field value
of a record and yields the address of the disk
block in which the record is stored.
Hashing Technology
Name ENO JOB Salary
0
1
2
3
……………………………….
……………………………….
……………………………….
……………………………….
M-2
M-1
Hashing Technology
• Hashing is typically implemented as a hash
table through the use of an array of records.
• Assume array index range is from 0 to M-1;
then we have M slots whose addresses
corresponds to the array indexes.
• We choose a hash function that transforms
the hash field value into an integer between 0
and M-1. Eg: h(K) = K mod M.
Collision
• Most hashing functions is that they do not
guarantee that distinct values will hash to
distinct addresses.
• Hash field space – the number of possible
values a hash field can take is usually much
larger than the address space – the number of
available addresses for records.
• The hashing function maps the hash field
space to the address space.
Collision
• A collision occurs when the hash field value of
a record that is being inserted hashes to an
address that already contains a different
record.
• The process of finding another position is
called collision resolution.
Collision resolution methods
1. Open Addressing
2. Chaining
3. Multiple Hashing
Blocking
• The records of a file must be allocated to disk
blocks because a block is the unit of data
transfer between disk and memory.
• Blocking: refers to storing a number of records
in one block on the disk.
• Blocking factor ( bfr ) refers to the number of
records per block.
Blocking
• Blocking factor bfr = B/r
B - block size (bytes)
r - record length (bytes) maximum no. of
records that can be stored in a block.