0% found this document useful (0 votes)
11 views51 pages

File Organizations and Indexes

The document discusses file organizations and indexing structures in databases, detailing how records are stored on disks and accessed efficiently through various methods such as heap, sorted, and hashed files. It explains the importance of indexing for optimizing database performance, including primary and clustering indexes, and addresses challenges like collisions in hashing and the limitations of primary indexes. Additionally, it covers the characteristics of dense and sparse indexes and their applications in improving data retrieval speed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views51 pages

File Organizations and Indexes

The document discusses file organizations and indexing structures in databases, detailing how records are stored on disks and accessed efficiently through various methods such as heap, sorted, and hashed files. It explains the importance of indexing for optimizing database performance, including primary and clustering indexes, and addresses challenges like collisions in hashing and the limitations of primary indexes. Additionally, it covers the characteristics of dense and sparse indexes and their applications in improving data retrieval speed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

File Organizations and Indexes

Rangana Jayashanka
Introduction
• Databases are stored physically as files of
records, which are typically stored on
magnetic disks.
• We will discuss organization of databases in
storage and the techniques for accessing them
efficiently using various algorithms, some of
which require auxiliary data structures called
indexes.
Introduction
• The data stored in disk is organized as files of
records.
• There are primary file organizations, which
determine how the records of a file are
physically placed on the disk, and hence how
the records can be accessed.
Introduction
• Heap file (unordered file) – places the records on
disk in no particular order by appending new
records at the end of the file.
• Sorted file (sequential file) – keeps the records
ordered by the value of a particular field (sort key).
• Hashed file – uses a hash function applied to a
particular field (hash key) to determine a record’s
placement on disk.
• B-trees – uses tree structures.
Introduction
• A secondary organization or auxiliary access
structure allows efficient access to the records
of a file based on alternate fields than those
that have been used for the primary file
organization.
• Most of these exist as indexes.
Files
• A file is a sequence of records.
• In many cases, all records in a file are of the
same record type.
• If every record in the file has exactly the same
size (in bytes), the file is said to be made up of
fixed-length records.
• If different records in the file has have different
sizes, the file is said to be made up of variable-
length records.
Variable length records
• A file may have variable-length records for
several reasons.
1. One or more of the fields are of varying size.
(variable length fields) Ex: NAME field
2. One or more field may have multiple values
for individual records.
3. One or more fields are optional.
Hashing Technology
• Provides very fast access to records on certain
search conditions.
• This organization is usually called a hash file.
• The search condition must be an equality
condition on a single field, called the hash field
of the file.
• In most cases, the hash field is also a key field
of the file, in which case it is called the hash
key.
Hashing Technology
• The idea behind hashing is to provide a
function h, called a hash function or
randomization function.
• Hash function applied to the hash field value
of a record and yields the address of the disk
block in which the record is stored.
Hashing Technology
Name ENO JOB Salary
0
1
2
3
……………………………….
……………………………….
……………………………….
……………………………….
M-2
M-1
Hashing Technology
• Hashing is typically implemented as a hash
table through the use of an array of records.
• Assume array index range is from 0 to M-1;
then we have M slots whose addresses
corresponds to the array indexes.
• We choose a hash function that transforms
the hash field value into an integer between 0
and M-1. Eg: h(K) = K mod M.
Collision
• Most hashing functions is that they do not
guarantee that distinct values will hash to
distinct addresses.
• Hash field space – the number of possible
values a hash field can take is usually much
larger than the address space – the number of
available addresses for records.
• The hashing function maps the hash field
space to the address space.
Collision
• A collision occurs when the hash field value of
a record that is being inserted hashes to an
address that already contains a different
record.
• The process of finding another position is
called collision resolution.
Collision resolution methods
1. Open Addressing
2. Chaining
3. Multiple Hashing
Blocking
• The records of a file must be allocated to disk
blocks because a block is the unit of data
transfer between disk and memory.
• Blocking: refers to storing a number of records
in one block on the disk.
• Blocking factor ( bfr ) refers to the number of
records per block.
Blocking
• Blocking factor bfr = B/r
B - block size (bytes)
r - record length (bytes) maximum no. of
records that can be stored in a block.

• Normally B may not divide by r exactly,


therefore there is unused space in each block.
Indexing Structures
• We assume that a file already exists with some
primary organization such as the unordered,
ordered, or hashed organizations.
• We will describe additional auxiliary access
structure called indexes.
• Used to speed up the retrieval of records in
response to certain search conditions.
Indexing Structures
• The Indexing Structures typically provide
secondary access paths, which provide
alternative ways of accessing the records
without affecting the physical placement of
records on disk.
• They enable efficient access to records based
on the indexing fields that are used to
construct the index.
Indexing Structures
• Any field of the file can be used to create an
index and multiple indexes on different fields
can be constructed on the same file.
• Indexes based on ordered files – single – level
indexes
• Indexes based on tree data structures –
multilevel indexes, B+ trees
Indexing Structures
• Single-level ordered indexes.
primary
secondary
clustering
• By viewing a single-level index as an ordered
file, one can develop additional indexes for it,
giving rise to the concept of multilevel
indexes.
Single-level Ordered Indexes
• Ordered index access structure is similar to
that behind the index used in textbook, which
lists important terms at the end of the book in
alphabetical order along with a list of page
numbers where the term appears in the book.
Single-level Ordered Indexes
• An indexing access structure is usually defined
on a single field or a file, called an indexing
field.
• The index typically stores each value of the
index field along with a list of pointers to all
disk blocks that contain records with that field
value.
• The values in the index are ordered so that we
can do a binary search on the index.
Single-level Ordered Indexes
• The index file is much smaller than the data
file, so searching the index using a binary
search is reasonably efficient.
• Types of single-level indexes
1. Primary indexes
2. Clustering indexes
3. Secondary indexes
Single-level Ordered Indexes
OrderID CustID Value Date
001 1 1000 2023-01-10
002 2 1500 2023-02-15
003 1 2000 2023-01-17
………. 3
………. 1
………. 1
020 2 2750 2023-02-10
Single-level Ordered Indexes
• SELECT * FROM Orders WHERE
CustID = 1 ORDER BY Date;
• EXPLAIN SELECT * FROM Orders
WHERE CustID = 1 ORDER BY Date;
• CREATE INDEX CustIndex ON
Orders(CustID);
• EXPLAIN SELECT * FROM Orders
WHERE CustID = 1 ORDER BY Date;
Primary Indexes
• Ordered file whose records are of fixed length
with two fields.
• First field – ordering key field (primary key).
• Second field – pointer to the disk block (a
block address).
• There is one index entry in the index file for
each block in the data file.
Primary Indexes
• Each index entry has the value of the primary
key field for the first record in a block and a
pointer to that block as its two field values.
• Index entry i as <k(i), P(i) >
Primary Indexes
Primary Indexes
• We use the NAME field to create a primary index
on the ordered file.
• Assume that each value of NAME is unique.
• Each entry in the index has a NAME value and a
pointer.
<K(1) = (Abbas), P(1) = address of block 1>
<K(2) = (Agarkar), P(2) = address of block 2>
<K(3) = (Akthar), P(3) = address of block 3>
Primary Indexes
• The total number of entries in the index is the
same as the number of disk blocks in the
ordered data file.
• The first record in each block of the data file is
called the anchor record of the block, or
simply the block anchor.
• Indexes can also be characterized as dense or
sparse.
Primary Indexes
• Dense index – has an index entry for every
search key value in the data file.
• Sparse index – has index entries for only some
of the search values.
• Primary index is a sparse (nondense) index. It
includes an entry for each disk block of the
data file and the keys of its anchor record.
Primary Indexes
• Index file for a primary index needs
substantially fewer blocks than does the data
file, for two reasons.
1. There are fewer index entries than there are
records in the data file.
2. Each index entry is typically smaller in size
than a data record because it has only two
fields.
Primary Indexes
• A binary search on the index file hence requires
fewer block accesses than a binary search on the
data file.
• Binary search for an ordered data file required
log2b block accesses.
• Primary index file contains bi blocks required total
of log2bi + 1 accesses.
• This is to locate a record with a search key and
access to the block.
Average Access Time for Basic File
organizations

Type Access Method Average time to access a


specific record
Heap (Unordered) Linear Search b/2
Ordered Linear Search b/2
Ordered Binary Search log2b
Primary Indexes – Example 1
Suppose that we have an ordered file with r = 30000 records stored on a disk
with block size B = 1024 bytes. File records are of fixed size and are unspanned
with record length R = 100 bytes. How many block accesses are required to
search a record on the data file?

The blocking factor for the file bfr = L(B/ R)˩


= L (1024/100 )˩
= 10 records per block.
The number of blocks needed for the file is b = ⌈(r/ bfr) ⌉
= ⌈(30,000/ 10)⌉= 3000 blocks.
A binary search on the data file would need approximately
= ⌈log2b⌉
= ⌈log23000⌉ = 12 block accesses
Primary Indexes – Example 1
Now suppose that the ordering key field of the file is V = 9 bytes
long, a block pointer is P = 6 bytes long, and we have constructed
a primary index for the file. How many block accesses are
required to search a record using the index?

The size of each index entry is Ri= (9+6) = 15 bytes, so the


blocking factor for the index is bfri = L(B/Ri)˩ = L(1024/15)˩ = 68
entries per block.

The total number of index entries ri is equal to the number of


blocks in the data file, which is 3000.
Primary Indexes – Example 1
• The number of index blocks is hence bi =r/bfri
= (3000/68) = 45 blocks .
• To perform a binary search on the index file
would need log2bi = (log245) = 6 block accesses.
• To search for a record using the index, we need
one additional block access to the data file for a
total of 6 + 1 = 7 block accesses -- an
improvement over binary search on the data file,
which required 12 block accesses.
Limitations of Primary indexes
• Insertion and deletion of records.
• If we attempt to insert a record in its correct
position in the data file, we have to not only
move records to make space for new record
but also change some index entries, since
moving records will change the anchor records
of some blocks.
Summary
Indexing is used to optimize the performance
of a database by minimizing the number of
disk accesses required when a query is
processed.
The index is a type of data structure.
It is used to locate and access the data in a
database table quickly.
Summary
• Index structure:

Indexes can be created using some database


columns.
Summary

 The first column of the database is the search key that


contains a copy of the primary key or candidate key of the
table.
 The values of the primary key are stored in sorted order
so that the corresponding data can be accessed easily.

 The second column of the database is the data


reference. It contains a set of pointers holding the
address of the disk block where the value of the
particular key can be found.
Indexing Methods
Indexing Methods
• Primary Index − Primary index is defined on
an ordered data file. The data file is ordered
on a key field. The key field is generally the
primary key of the relation.

• Clustering Index − Clustering index is defined


on an ordered data file. The data file is
ordered on a non-key field.
Indexing Methods
• The primary index can be classified into two
types: Dense index and Sparse index.
• The dense index contains an index record for
every search key value in the data file. It
makes searching faster.
• In the sparse index data file, index record
appears only for a few items. Each item points
to a block.
Indexing Methods
Dense index

The dense index contains an index record for every


search key value in the data file. It makes searching
faster.
In this, the number of records in the index table is
same as the number of records in the main table.
It needs more space to store index record itself. The
index records have the search key and a pointer to
the actual record on the disk.
Indexing Methods
Dense index
Indexing Methods
Sparse index

In the data file, index record appears only for


a few items. Each item points to a block.
In this, instead of pointing to each record in
the main table, the index points to the records
in the main table in a gap.
Indexing Methods
Sparse index
Indexing Methods
Clustering Index
A clustered index can be defined as an
ordered data file. Sometimes the index is
created on non-primary key columns which
may not be unique for each record.
In this case, to identify the record faster, we
will group two or more columns to get the
unique value and create index out of them.
This method is called a clustering index.
Indexing Methods
Clustering Index

Example: suppose a company contains several employees in


each department. Suppose we use a clustering index, where all
employees which belong to the same Dept_ID are considered
within a single cluster, and index pointers point to the cluster as
a whole. Here Dept_Id is a non-unique key.
Indexing Methods
Clustering Index

You might also like