0% found this document useful (0 votes)

32 views42 pages

L2.2-File Organization Techniques

File organization techniques include unordered files, ordered files, and hash files. Unordered files store records in the order they are inserted, making insertion efficient but searching inefficient. Ordered files store records in sorted order, enabling efficient searching but expensive insertion. Hash files map keys to addresses using a hash function, allowing constant-time access but requiring collision resolution methods. Indexing improves access by creating an index structure that maps attribute values to records.

Uploaded by

Edmunds Larry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views42 pages

L2.2-File Organization Techniques

Uploaded by

Edmunds Larry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

File Organization Techniques

What you will Learn?

 Unordered files (Heap Files_)
 Ordered Files (Sorted Files)
 Hash Files
Unordered files
 Also known as a heap file
 Records are placed in the file in the same order they were inserted, so new records are
placed at the end of the file.

Inserting a record is efficient: Deleting a record:

 Takes the last disk block into a buffer;  Find the corresponding block;

 Adds the new record to it;  Copy the block into a buffer;

 Rewrites the block back to the disk  Delete the record from the buffer;

Searching a record  Rewrite the record back to the disk

 Is not efficient.
 It needs scanning the file record by record. (linear
search)

Deletion leaves unused space in disk block resulting in wasted storage space
The files must be periodically re-organized to claim the unused space
Ordered files
 Also known as a sequential file
 Records in a file can be sorted based on the values of one or more fields called the
ordering field

Example: File sorted with name as ordering field

Ordered files
Binary Search

1. Retrieve the mid-page of the file

 Check whether the required record is between the first and the last records of the page.
 If so, the required record lies on this page and no more pages need to be retrieved

2. If the value of the key field in the first record is greater than the required value, the required value, if it
exists, occurs on an earlier page.
 Therefore, we repeat the above steps using the lower half of the file as the new search area.

If the value of the key field in the last page is less than the required value, the required value occurs on a
later page, and so we repeat the above steps using the top half of the file as the new search area.

Using Binary Search, half the search space is eliminated from the search with each page retrieved
Ordered files
 Also known as a sequential file
 Records in a file can be sorted based on the values of one or more fields called the
ordering field

Searching a record is efficient: Insertion can be expensive

 But searching on the non-ordering field values is the  Find the position to insert the record with
same as in the unordered file. k;

Deletion can be done efficiently  Make space for the record with k;

 find the record;  Put the record with k in the place.

 delete it and mark the place;

 periodically reorganize the file
Ordered files
Two methods for making insertion more efficient
 Keep some unused space in each block.
 Maintain a temporary unordered file called an overflow or transaction file. Periodically, the overflow file is
merged with the main file
Searching files
Linear search O(n)
 Elements are placed randomly in an array

A 8 5 12 6 15 9 4 3 7 10

0 1 2 3 4 5 6 7 8 9

Binary search O (log n)

 Elements are placed in a sorted order in an array

A 3 4 5 6 7 8 9 10 12 15

0 1 2 3 4 5 6 7 8 9

Need for a technique that uses constant time O(1), hence hashing technique
Hash files
 Hashing provides a function h, hash function which is applied to the hash field value of

a record and yields the address of the disk block in which the record is stored.

 The base field is called the hash field, or if it is also a key of the file, it is called the hash

key.

 Records in a hash file appear to be randomly distributed across the available space

 Character strings are converted into integers before the function is applied using some

type of code, such as alphabetic position or ASCII values

Hash files
 Keys- 8,3,13,6,4,10

Hashing O(1)
 Elements are placed at their own index such that 8 stored index 8

A 3 4 8 10 13

0 1 2 3 4 5 6 7 8 9 10 11 12 13
14
 Space is wasted.
 Mathematical model is developed to improve this technique
Hashing
Hash Function
 Function maps key space to hash table h(x)=x.
 To avoid wastage of space hash function is
improved using different methods (hash
algorithms):

a. Division modulus method

h(x) = x % 10
Size of
hash table

 Remainder of the division becomes the disk address

 When the same address is generated for two or
more records, a collision is said to have occurred
Hash files
 When a collision occurs the record must be inserted in another hash address.
 There are several techniques that can be used to manage collisions:

 Open addressing

 Chained overflow

 Multiple hashing

 Unchained overflow
Hash files
Open addressing

 If a collision occurs, the system performs a linear search to find the first available slot to

insert a new record.

 When the last bucket has been searched, the system starts back at the first bucket.

 Searching for a record employs the same technique used to store a record, except that the

record is considered not to exist when an unused slot is encountered before the record is

located
Hash files
Open Addressing
 Incase of a collision insert records to the next available slot.

0 1 2 3 4 5 6 7 8 9 10
Hash files
Chained Overflow

 An overflow area is maintained for collisions that cannot be placed at the hash address.

 With the chained overflow technique, each bucket has an additional field, sometimes

called a synonym pointer, that indicates whether a collision has occurred, and if so,

points to the overflow page used.

 If the pointer is zero, then it means no collision has occurred.

Hash files
Chained Overflow
 Incase of a collision a record is put in an overflow region where a pointer points to the region.
 The pointers are traversed to locate the record
Hash files
Objectives of a hash function

 Minimize collision

 Uniform distribution of hash values

 Easy to calculate

 Resolve any collisions

Indexing
Indexing
Indexing

 It is not sufficient simply to scatter the records that represent tuples of a relation among various

blocks

Example 1:

SELECT * FROM R
 If the tuples of the above relation R are placed in different blocks, we would have to examine every block in
the storage system to find the tuples

 A better idea is to reserve some blocks, perhaps several whole cylinders, for relation R.

 Now, at least we can find the tuples of R without scanning the entire data store
Indexing
Indexing

 Reserving some blocks, perhaps several whole cylinders, for relation R may not work with queries

that specify values for one or more attributes.

Example 2:

SELECT * FROM R WHERE a=10;

This query specifies the value for attribute a , hence each record will have to be compared to the value of a.

 An index is any data structure that takes the value of one or more fields and finds the records with

that value “quickly.”

 An index lets us find a record without having to look at more than a small fraction of all possible
records.
Structure of an Index
An index file consists of a search key value and a block pointer address

Search Key Data Reference

(Block pointer)

Contains copy of Contains set of

primary key or pointers
candidate key of (holding
a table addresses of
disk blocks)

Each index file associates values of the search key with pointers to data-file records that have that value
for the attribute(s) of the search key.
Classification of an Index
Dense Index
An index entry is created for every search key value. (every record in the data file )

Sparse Index
An index entry is created for only some of the search values
Classification of an Index
Dense Index
 A dense index, is a sequence of blocks holding only the keys of the records and pointers
to the records themselves . The dense index supports queries that ask for records with a
given search key value.

Example :

Given key value K , we search the index blocks for K , and when we find it, we follow the associated pointer to the
record with key K .

 It might appear that we need to examine every block of the index, or half the blocks of the index, on
average, before we find K .
 However, there are several factors that make the index-based search more efficient than it seems.
Classification of an Index
Dense Index
 Factors that make index based search more efficient include:

Factors that make index based search more efficient include:

1. The number of index blocks is usually small compared with the number of data blocks.

2. Since keys are sorted, we can use binary search to find K. If there are n blocks of the index,
we only look at log 2 n of them.

3. The index may be small enough to be kept permanently in main memory buffers. If so, the
search for key K involves only main-memory accesses, and there are no expensive disk I/O’s
to be performed.
Classification of an Index
Dense Index

 Since keys and pointers presumably take much

less space than complete records, we expect to
use many fewer blocks for the index file than
for the data file itself.

 In the example: the first index block

contains pointers to the first four records,
the second block has pointers to the next
four records, and so on.
Classification of an Index
Sparse Index

 A sparse index typically has only one key-

pointer pair per block of the data file.

 It thus uses less space than a dense index, at

the expense of somewhat more time to find a

record given its key.

 Figure 14.3 shows a sparse index with one

key-pointer per data block. The keys are for

the first records on each data block.

Types of Indexes
Single-level index
Primary Index
(uses primary key + ordered data file)
Secondary Index
(non key , candidate key)
Cluster Index

(non key + ordered)

Multilevel index
B-tree
B+ tree
Types of Indexes
Single-level index
Index and data communicate directly
Index Data block

Search key Block 200

Pointer

201
200 BP1

201 BP2 202

Multilevel index
The index are broken down into several indices
Single level indexes
Single-level index

Primary Indexing

 It is an index specified on the primary key (or ordering field) of an ordered

file.

 There is one index entry for each block in the data file.

 Each index has a value of the primary key field for the first record in a block

and a pointer to that block.

 Index entry is created for the first record in each block of data file is known as

the anchor record of block (block anchor)

Primary Index: A fixed length index with 2 fields: Number of index entries= number of disk blocks
• Ordering field (primary key)
• Pointer (disk block address)
Single-level Indexing
Primary Indexing

 The total number of entries in the index is equal to the total number of disk

blocks

 To access data, the number of block accesses required = log2 n + 1



 To access data, the number of block accesses required = log2 n + 1

Where n= number of blocks occupied by index files

Single-level Indexing
Primary Indexing
 Example:

Single level Index:
Cluster Indexes
Single-level Index
Clustered Indexing
 If records are physically ordered on a nonkey field - which does not
have distinct value for each record, that field is called the clustering field
and the data file is called a clustered file

 A clustered index is created to retrieve records that have the same value
for the clustering field.

 A clustering index is an ordered file with two fields:

1. Clustering field of the data file
2. Disk block pointer
Single-level Index
Clustered Indexing

There is only one entry in

the clustering index for each
distinct value in the clustering
field.

A clustered index
contains the value
and a pointer to the
first block in the
data file that has a
record with that
value for its
clustering field
Single level indexes
Secondary Index
 Provides a secondary means of accessing data files for which a primary means
already exists.
 Data file records could be ordered, unordered, or hashed.
 The secondary index could be created on a candidate key with unique values
in every record or a non key field with duplicate values.
 The secondary index is an ordered file with two fields:
1. Indexing field (same datatype as some non-ordering field)
2. Block pointer or a record pointer

Many Secondary indexes can be created for the same file

Single level indexes
Secondary Index
Single level indexes
Secondary Index
 Here the data is not sorted by the search key.

 However, the keys in the index file are sorted.

 The result is that the pointers in one index block can go

to many different data blocks, instead of one or a few

consecutive blocks.

 For example, to retrieve all the records with search key 20, we

not only have to look at two index blocks, but we are sent by

their pointers to three different data blocks.

 Thus, using a secondary index may result in many more

disk I/O’s than if we get the same number of records via

a primary index
Multilevel Indexes
Multilevel Indexes
 It reduces the search space by the blocking factor of the index (number of records per

block) , an improvement of binary search which reduces the search space by a factor of

 The blocking factor is also known as the fan-out of the multi-level index.

 Searching a multi-level index requires (log fobi) block accesses

 Multi-level index considers the index file, that is the first level (or base level) as an

ordered file with a distinct value for each key.

Multilevel Indexes
 Considering the first level as a sorted data file, a primary index is created for the first

level. This index to the first level is called the second level of the multi-level index.

 Since second level is a primary index, we can use block anchors such that the second

level has one entry for each block of the first level.

 The process in repeated until all entries of some level t fit in a single block called the

top index.

 Each level reduces the number of entries by a factor of the index fan out.
Multi-level Index
Example:

Internship F
No ratings yet
Internship F
9 pages
Top 50 Oracle Applications Interview Questions & Answers
No ratings yet
Top 50 Oracle Applications Interview Questions & Answers
8 pages
Nutanix Hybrid Cloud
No ratings yet
Nutanix Hybrid Cloud
56 pages
CD DVD Management Final Report With Code and Output
No ratings yet
CD DVD Management Final Report With Code and Output
66 pages
Generic Application Audit
No ratings yet
Generic Application Audit
7 pages
HE190952 PhamThuaGiap Lab2 DBI202
No ratings yet
HE190952 PhamThuaGiap Lab2 DBI202
3 pages
Distributed Systems Detailed Explanation
No ratings yet
Distributed Systems Detailed Explanation
5 pages
Veritas Infoscale: Technical Overview: Managing Mission-Critical Applications in A Software-Defined Data Center
No ratings yet
Veritas Infoscale: Technical Overview: Managing Mission-Critical Applications in A Software-Defined Data Center
33 pages
2012 TDWI Annual BI Report
No ratings yet
2012 TDWI Annual BI Report
85 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
Project For The Web Admin Help
No ratings yet
Project For The Web Admin Help
64 pages
The Development of A Reconnaissance Tool Aiming To Achieve A More Efficient Information Gathering Phase of A Penetration Test
No ratings yet
The Development of A Reconnaissance Tool Aiming To Achieve A More Efficient Information Gathering Phase of A Penetration Test
125 pages
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
87695a5a6dfa0b0649650d5ccf468a15
No ratings yet
87695a5a6dfa0b0649650d5ccf468a15
402 pages
Os Lab 3
No ratings yet
Os Lab 3
15 pages
How To Setup Unlimited SMS Gateway
No ratings yet
How To Setup Unlimited SMS Gateway
14 pages
Project Wireframes: Problem Statement and Solution Proposal
No ratings yet
Project Wireframes: Problem Statement and Solution Proposal
25 pages
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Glassdoor - Resume - CV - Pradeep Kumar Sheliveri - Standout
No ratings yet
Glassdoor - Resume - CV - Pradeep Kumar Sheliveri - Standout
2 pages
Fall 2020 - Cloud Computing and Big Data - HW 1
No ratings yet
Fall 2020 - Cloud Computing and Big Data - HW 1
7 pages
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Redhat Rh302: Practice Exam: Question No: 1 Correct Text
No ratings yet
Redhat Rh302: Practice Exam: Question No: 1 Correct Text
20 pages
Database Management Systems - Prelims 2nd Attempt - 26 PDF
No ratings yet
Database Management Systems - Prelims 2nd Attempt - 26 PDF
8 pages
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
04 Data Structure Definition Test
No ratings yet
04 Data Structure Definition Test
6 pages
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Software Characteristics
No ratings yet
Software Characteristics
19 pages
Sheet6 - Stacks and Queues - S2018 - Solution
No ratings yet
Sheet6 - Stacks and Queues - S2018 - Solution
7 pages
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Download
No ratings yet
Download
2 pages
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Strategic Business Analysis Overview
No ratings yet
Strategic Business Analysis Overview
12 pages
Centralized vs. Distributed Messaging Systems
No ratings yet
Centralized vs. Distributed Messaging Systems
4 pages
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Anti Virus Best Practices PDF
No ratings yet
Anti Virus Best Practices PDF
8 pages
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
In The Star Schema Design
No ratings yet
In The Star Schema Design
11 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
S03 Data Clearing SD V 4 2
No ratings yet
S03 Data Clearing SD V 4 2
22 pages
FCUBS FD01!01!01 Development Overview Guide
No ratings yet
FCUBS FD01!01!01 Development Overview Guide
25 pages
Object Oriented Analysis and Design Two Mark and Sixteen Mark Q & A Part - A Questions and Answers Unit-I
No ratings yet
Object Oriented Analysis and Design Two Mark and Sixteen Mark Q & A Part - A Questions and Answers Unit-I
39 pages

L2.2-File Organization Techniques

Uploaded by

L2.2-File Organization Techniques

Uploaded by

File Organization Techniques

What you will Learn?

Inserting a record is efficient: Deleting a record:

Searching a record  Rewrite the record back to the disk

Example: File sorted with name as ordering field

1. Retrieve the mid-page of the file

Searching a record is efficient: Insertion can be expensive

 find the record;  Put the record with k in the place.

 delete it and mark the place;

Binary search O (log n)

type of code, such as alphabetic position or ASCII values

a. Division modulus method

 Remainder of the division becomes the disk address

insert a new record.

points to the overflow page used.

 If the pointer is zero, then it means no collision has occurred.

 Uniform distribution of hash values

 Resolve any collisions

that specify values for one or more attributes.

SELECT * FROM R WHERE a=10;

that value “quickly.”

Search Key Data Reference

Contains copy of Contains set of

Factors that make index based search more efficient include:

 Since keys and pointers presumably take much

 In the example: the first index block

 A sparse index typically has only one key-

pointer pair per block of the data file.

 It thus uses less space than a dense index, at

the expense of somewhat more time to find a

record given its key.

 Figure 14.3 shows a sparse index with one

key-pointer per data block. The keys are for

the first records on each data block.

(non key + ordered)

Search key Block 200

201 BP2 202

 It is an index specified on the primary key (or ordering field) of an ordered

and a pointer to that block.

the anchor record of block (block anchor)

 To access data, the number of block accesses required = log2 n + 1

 To access data, the number of block accesses required = log2 n + 1

Where n= number of blocks occupied by index files

 A clustering index is an ordered file with two fields:

There is only one entry in

Many Secondary indexes can be created for the same file

 However, the keys in the index file are sorted.

 The result is that the pointers in one index block can go

to many different data blocks, instead of one or a few

their pointers to three different data blocks.

 Thus, using a secondary index may result in many more

disk I/O’s than if we get the same number of records via

 Searching a multi-level index requires (log fobi) block accesses

ordered file with a distinct value for each key.

You might also like