0% found this document useful (0 votes)
32 views42 pages

L2.2-File Organization Techniques

File organization techniques include unordered files, ordered files, and hash files. Unordered files store records in the order they are inserted, making insertion efficient but searching inefficient. Ordered files store records in sorted order, enabling efficient searching but expensive insertion. Hash files map keys to addresses using a hash function, allowing constant-time access but requiring collision resolution methods. Indexing improves access by creating an index structure that maps attribute values to records.

Uploaded by

Edmunds Larry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views42 pages

L2.2-File Organization Techniques

File organization techniques include unordered files, ordered files, and hash files. Unordered files store records in the order they are inserted, making insertion efficient but searching inefficient. Ordered files store records in sorted order, enabling efficient searching but expensive insertion. Hash files map keys to addresses using a hash function, allowing constant-time access but requiring collision resolution methods. Indexing improves access by creating an index structure that maps attribute values to records.

Uploaded by

Edmunds Larry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

File Organization Techniques

What you will Learn?


 Unordered files (Heap Files_)
 Ordered Files (Sorted Files)
 Hash Files
Unordered files
 Also known as a heap file
 Records are placed in the file in the same order they were inserted, so new records are
placed at the end of the file.

Inserting a record is efficient: Deleting a record:

 Takes the last disk block into a buffer;  Find the corresponding block;

 Adds the new record to it;  Copy the block into a buffer;

 Rewrites the block back to the disk  Delete the record from the buffer;

Searching a record  Rewrite the record back to the disk

 Is not efficient.
 It needs scanning the file record by record. (linear
search)

Deletion leaves unused space in disk block resulting in wasted storage space
The files must be periodically re-organized to claim the unused space
Ordered files
 Also known as a sequential file
 Records in a file can be sorted based on the values of one or more fields called the
ordering field

Example: File sorted with name as ordering field


Ordered files
Binary Search

1. Retrieve the mid-page of the file


 Check whether the required record is between the first and the last records of the page.
 If so, the required record lies on this page and no more pages need to be retrieved

2. If the value of the key field in the first record is greater than the required value, the required value, if it
exists, occurs on an earlier page.
 Therefore, we repeat the above steps using the lower half of the file as the new search area.

If the value of the key field in the last page is less than the required value, the required value occurs on a
later page, and so we repeat the above steps using the top half of the file as the new search area.

Using Binary Search, half the search space is eliminated from the search with each page retrieved
Ordered files
 Also known as a sequential file
 Records in a file can be sorted based on the values of one or more fields called the
ordering field

Searching a record is efficient: Insertion can be expensive

 But searching on the non-ordering field values is the  Find the position to insert the record with
same as in the unordered file. k;

Deletion can be done efficiently  Make space for the record with k;

 find the record;  Put the record with k in the place.

 delete it and mark the place;


 periodically reorganize the file
Ordered files
Two methods for making insertion more efficient
 Keep some unused space in each block.
 Maintain a temporary unordered file called an overflow or transaction file. Periodically, the overflow file is
merged with the main file
Searching files
Linear search O(n)
 Elements are placed randomly in an array

A 8 5 12 6 15 9 4 3 7 10

0 1 2 3 4 5 6 7 8 9

Binary search O (log n)


 Elements are placed in a sorted order in an array

A 3 4 5 6 7 8 9 10 12 15

0 1 2 3 4 5 6 7 8 9

Need for a technique that uses constant time O(1), hence hashing technique
Hash files
 Hashing provides a function h, hash function which is applied to the hash field value of

a record and yields the address of the disk block in which the record is stored.

 The base field is called the hash field, or if it is also a key of the file, it is called the hash

key.

 Records in a hash file appear to be randomly distributed across the available space

 Character strings are converted into integers before the function is applied using some

type of code, such as alphabetic position or ASCII values


Hash files
 Keys- 8,3,13,6,4,10

Hashing O(1)
 Elements are placed at their own index such that 8 stored index 8

A 3 4 8 10 13

0 1 2 3 4 5 6 7 8 9 10 11 12 13
14
 Space is wasted.
 Mathematical model is developed to improve this technique
Hashing
Hash Function
 Function maps key space to hash table h(x)=x.
 To avoid wastage of space hash function is
improved using different methods (hash
algorithms):

a. Division modulus method

h(x) = x % 10
Size of
hash table

 Remainder of the division becomes the disk address


 When the same address is generated for two or
more records, a collision is said to have occurred
Hash files
 When a collision occurs the record must be inserted in another hash address.
 There are several techniques that can be used to manage collisions:

 Open addressing

 Chained overflow

 Multiple hashing

 Unchained overflow
Hash files
Open addressing

 If a collision occurs, the system performs a linear search to find the first available slot to

insert a new record.

 When the last bucket has been searched, the system starts back at the first bucket.

 Searching for a record employs the same technique used to store a record, except that the

record is considered not to exist when an unused slot is encountered before the record is

located
Hash files
Open Addressing
 Incase of a collision insert records to the next available slot.

0 1 2 3 4 5 6 7 8 9 10
Hash files
Chained Overflow

 An overflow area is maintained for collisions that cannot be placed at the hash address.

 With the chained overflow technique, each bucket has an additional field, sometimes

called a synonym pointer, that indicates whether a collision has occurred, and if so,

points to the overflow page used.

 If the pointer is zero, then it means no collision has occurred.


Hash files
Chained Overflow
 Incase of a collision a record is put in an overflow region where a pointer points to the region.
 The pointers are traversed to locate the record
Hash files
Objectives of a hash function

 Minimize collision

 Uniform distribution of hash values

 Easy to calculate

 Resolve any collisions


Indexing
Indexing
Indexing

 It is not sufficient simply to scatter the records that represent tuples of a relation among various

blocks

Example 1:

SELECT * FROM R
 If the tuples of the above relation R are placed in different blocks, we would have to examine every block in
the storage system to find the tuples

 A better idea is to reserve some blocks, perhaps several whole cylinders, for relation R.

 Now, at least we can find the tuples of R without scanning the entire data store
Indexing
Indexing

 Reserving some blocks, perhaps several whole cylinders, for relation R may not work with queries

that specify values for one or more attributes.

Example 2:

SELECT * FROM R WHERE a=10;

This query specifies the value for attribute a , hence each record will have to be compared to the value of a.

 An index is any data structure that takes the value of one or more fields and finds the records with

that value “quickly.”


 An index lets us find a record without having to look at more than a small fraction of all possible
records.
Structure of an Index
An index file consists of a search key value and a block pointer address

Search Key Data Reference


(Block pointer)

Contains copy of Contains set of


primary key or pointers
candidate key of (holding
a table addresses of
disk blocks)

Each index file associates values of the search key with pointers to data-file records that have that value
for the attribute(s) of the search key.
Classification of an Index
Dense Index
An index entry is created for every search key value. (every record in the data file )

Sparse Index
An index entry is created for only some of the search values
Classification of an Index
Dense Index
 A dense index, is a sequence of blocks holding only the keys of the records and pointers
to the records themselves . The dense index supports queries that ask for records with a
given search key value.

Example :

Given key value K , we search the index blocks for K , and when we find it, we follow the associated pointer to the
record with key K .

 It might appear that we need to examine every block of the index, or half the blocks of the index, on
average, before we find K .
 However, there are several factors that make the index-based search more efficient than it seems.
Classification of an Index
Dense Index
 Factors that make index based search more efficient include:

Factors that make index based search more efficient include:

1. The number of index blocks is usually small compared with the number of data blocks.

2. Since keys are sorted, we can use binary search to find K. If there are n blocks of the index,
we only look at log 2 n of them.

3. The index may be small enough to be kept permanently in main memory buffers. If so, the
search for key K involves only main-memory accesses, and there are no expensive disk I/O’s
to be performed.
Classification of an Index
Dense Index

 Since keys and pointers presumably take much


less space than complete records, we expect to
use many fewer blocks for the index file than
for the data file itself.

 In the example: the first index block


contains pointers to the first four records,
the second block has pointers to the next
four records, and so on.
Classification of an Index
Sparse Index

 A sparse index typically has only one key-

pointer pair per block of the data file.

 It thus uses less space than a dense index, at

the expense of somewhat more time to find a

record given its key.

 Figure 14.3 shows a sparse index with one

key-pointer per data block. The keys are for

the first records on each data block.


Types of Indexes
Single-level index
Primary Index
(uses primary key + ordered data file)
Secondary Index
(non key , candidate key)
Cluster Index

(non key + ordered)

Multilevel index
B-tree
B+ tree
Types of Indexes
Single-level index
Index and data communicate directly
Index Data block

Search key Block 200


Pointer

201
200 BP1

201 BP2 202

Multilevel index
The index are broken down into several indices
Single level indexes
Single-level index

Primary Indexing

 It is an index specified on the primary key (or ordering field) of an ordered

file.

 There is one index entry for each block in the data file.

 Each index has a value of the primary key field for the first record in a block

and a pointer to that block.

 Index entry is created for the first record in each block of data file is known as

the anchor record of block (block anchor)

Primary Index: A fixed length index with 2 fields: Number of index entries= number of disk blocks
• Ordering field (primary key)
• Pointer (disk block address)
Single-level Indexing
Primary Indexing

 The total number of entries in the index is equal to the total number of disk

blocks

 To access data, the number of block accesses required = log2 n + 1


 To access data, the number of block accesses required = log2 n + 1

Where n= number of blocks occupied by index files


Single-level Indexing
Primary Indexing
 Example:

Single level Index:
Cluster Indexes
Single-level Index
Clustered Indexing
 If records are physically ordered on a nonkey field - which does not
have distinct value for each record, that field is called the clustering field
and the data file is called a clustered file

 A clustered index is created to retrieve records that have the same value
for the clustering field.

 A clustering index is an ordered file with two fields:


1. Clustering field of the data file
2. Disk block pointer
Single-level Index
Clustered Indexing

There is only one entry in


the clustering index for each
distinct value in the clustering
field.

A clustered index
contains the value
and a pointer to the
first block in the
data file that has a
record with that
value for its
clustering field
Single level indexes
Secondary Index
 Provides a secondary means of accessing data files for which a primary means
already exists.
 Data file records could be ordered, unordered, or hashed.
 The secondary index could be created on a candidate key with unique values
in every record or a non key field with duplicate values.
 The secondary index is an ordered file with two fields:
1. Indexing field (same datatype as some non-ordering field)
2. Block pointer or a record pointer

Many Secondary indexes can be created for the same file


Single level indexes
Secondary Index
Single level indexes
Secondary Index
 Here the data is not sorted by the search key.

 However, the keys in the index file are sorted.

 The result is that the pointers in one index block can go

to many different data blocks, instead of one or a few

consecutive blocks.

 For example, to retrieve all the records with search key 20, we

not only have to look at two index blocks, but we are sent by

their pointers to three different data blocks.

 Thus, using a secondary index may result in many more

disk I/O’s than if we get the same number of records via

a primary index
Multilevel Indexes
Multilevel Indexes
 It reduces the search space by the blocking factor of the index (number of records per

block) , an improvement of binary search which reduces the search space by a factor of

2.

 The blocking factor is also known as the fan-out of the multi-level index.

 Searching a multi-level index requires (log fobi) block accesses

 Multi-level index considers the index file, that is the first level (or base level) as an

ordered file with a distinct value for each key.


Multilevel Indexes
 Considering the first level as a sorted data file, a primary index is created for the first

level. This index to the first level is called the second level of the multi-level index.

 Since second level is a primary index, we can use block anchors such that the second

level has one entry for each block of the first level.

 The process in repeated until all entries of some level t fit in a single block called the

top index.

 Each level reduces the number of entries by a factor of the index fan out.
Multi-level Index
Example:

You might also like