0% found this document useful (0 votes)
35 views24 pages

Querry Processing and Indexing, Hashing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views24 pages

Querry Processing and Indexing, Hashing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

File Organization in DBMS

A database consists of a huge amount of data. The data is grouped within a table in RDBMS,
and each table has related records. A user can see that the data is stored in the form of
tables, but in actuality, this huge amount of data is stored in physical memory in the form
of files.
What is a File?
• A file is named a collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tapes, and optical disks.
What is File Organization?
• File Organization refers to the logical relationships among various records that constitute
the file, particularly with respect to the means of identification and access to any specific
record.
• In simple terms, Storing the files in a certain order is called File Organization.
• File Structure refers to the format of the label and data blocks and of any logical control
record.
The Objective of File Organization
• It helps in the faster selection of records i.e. it makes the process faster.
• Different Operations like inserting, deleting, and updating different records are faster and
easier.
• It prevents us from inserting duplicate records via various operations.
Types of File Organizations
• Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection. Thus it is all upon the
programmer to decide the best-suited file Organization method according to his
requirements.
Some types of File Organizations are:
• Sequential File Organization
• Heap File Organization
• Hash File Organization
• B+ Tree File Organization
• Clustered File Organization
• ISAM (Indexed Sequential Access Method)

Sequential File Organization


• The easiest method for file Organization is the Sequential method. In this method, the
file is stored one after another in a sequential manner. There are two ways to
implement this method:
1. Pile File Method
• This method is quite simple, in which we store the records in a sequence i.e. one after the
other in the order in which they are inserted into the tables.
• Insertion of the new record: Let the R1, R3, and so on up to R5 and R4 be four records in
the sequence. Here, records are nothing but a row in any table. Suppose a new record R2
has to be inserted in the sequence, then it is simply placed at the end of the file.
• 2. Sorted File Method
• In this method, As the name itself suggests whenever a new record has to be inserted, it is
always inserted in a sorted (ascending or descending) manner. The sorting of records may
be based on any primary key or any other key.
• Insertion of the new record: Let us assume that there is a pre existing sorted sequence of
four records R1, R3, and so on up to R7 and R8. Suppose a new record R2 has to be inserted
in the sequence, then it will be inserted at the end of the file and then it will sort the
sequence.
Advantages of Sequential File Organization
• Fast and efficient method for huge amounts of data, Simple design.
• Files can be easily stored in magnetic tapes i.e. cheaper storage mechanism.
Disadvantages of Sequential File Organization
• Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
• The sorted file method is inefficient as it takes time and space for sorting records.
Query processing and Indexing

• Query Processing includes translations of high-level Queries into low-level


expressions that can be used at the physical level of the file system, query
optimization, and actual execution of the query to get the actual result.
• High-level queries are converted into low-level expressions during query
processing. It is a methodical procedure that can be applied at the physical
level of the file system, during query optimization, and when the query is
actually executed to obtain the result.
• It needs a basic understanding of relational algebra and file organization. It
includes the variety of tasks involved in getting data out of the database. It
consists of converting high-level database language queries into expressions
that can be used at the file system’s physical level.
• The process of extracting data from a database is called query processing. It
requires several steps to retrieve the data from the database during query
processing. The actions involved are:

• Parsing and translation


• Optimization
• The Block Diagram of Query Processing is as:

Parsing
During the parse call, the database performs the following checks:
Syntax check, Semantic check, and Shared pool check, after converting the query
into relational algebra.
First, high-level database languages like SQL are used to translate the user queries that
have been provided. It is transformed into expressions that can be applied further
at the file system’s physical level.
Following this, the queries are actually evaluated along with a number of query-
optimizing transformations. Consequently, a computer system must convert a query
into a language that is readable and understandable by humans before processing
it. Therefore, the best option for humans is SQL or Structured Query Language.
A detailed Diagram is drawn as:
Parser performs the following checks (refer to the detailed diagram):
Syntax check:
concludes SQL syntactic validity.
Example:
SELECT * FORM employee
Here, the error of the wrong spelling of
FROM is given by this check.
Step-1
Semantic check
determines whether the statement is
meaningful or not.
Example: query contains a table name
that does not exist and is checked by
this check.
Shared Pool check
Every query possesses a hash code during
its execution. So, this check determines the existence of written hash code in the
shared pool if the code exists in the shared pool then the database will not take
Step-2
Optimization:
• During the optimization stage, the database must perform a hard parse
at least for one unique DML statement and perform optimization
during this parse. This database never optimizes DDL unless it includes
a DML component such as a subquery that requires optimization. It is a
process in which multiple query execution plans for satisfying a query
are examined and the most efficient query plan is satisfied for
execution. The database catalog stores the execution plans and then
the optimizer passes the lowest-cost plan for execution.
• Row Source Generation
• Row Source Generation is software that receives an optimal execution
plan from the optimizer and produces an iterative execution plan that
is usable by the rest of the database. The iterative plan is the binary
program that, when executed by the SQL engine, produces the result
set.
• Step-3
• Evaluation
Indexing in DBMS

• A database index is a data structure that helps in improving the speed of data access. The
database index helps quickly locate the data in database without having to search every row of
database. The process of creating an index for a database is known indexing.
Real life example of Indexing
The first few pages of book contains the index of book, which tells which topic is covered at which
page number. This helps us quickly locate the topic in the book using the index. Without the
index, we would have to scan the entire book to look for the topic which would take a long time.

• An index is defined by a field expression that we specify when we create the index.
• Typically, the field expression is a single field name, like EMP_ID.
• For example, An index created on the EMP_ID field, contains a sorted list of the employee ID
values in the table.
• A database driver can use indexes to find records quickly. An index on the EMP_ID field, for
example, greatly reduces the time that the driver spends searching for a particular employee ID
value.
Consider the following Where clause:
WHERE emp_id = 'E10001'
• Without an index, the driver must search the entire database table to find this record having an
employee ID of E10001.
• By using an index on the EMP_ID field, however, the driver can quickly find this record
• Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
• Index structure: Indexes can be created using some database
columns.

• The first column of the database is the search key that contains
a copy of the primary key or candidate key of the table. The
values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
• The second column of the database is the data reference. It
contains a set of pointers holding the address of the disk block
where the value of the particular key can be found.
• Indexing Methods:
Ordered indices: The indices are usually sorted to make searching faster.
• The indices which are sorted are known as ordered indices.
• Example: Suppose we have an employee table with thousands of record and each
of which is 10 bytes long.
• If their IDs start with 1, 2, 3....and so on and we have to search Employee with ID-
543.
• In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
• In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.
Primary Index
• If the index is created on the basis of the primary key of the table, then it is known
as primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
• As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
• The primary index can be classified into two types: Dense index and Sparse index.
Dense index: The dense index contains an index record for every search key value in
the data file. It makes searching faster.
• In this, the number of records in the index table is same as the number of
records in the main table.
• It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.

• Sparse index: In the data file, index record appears only for a few items. Each
item points to a block.
• In this, instead of pointing to each record in the main table, the index points to
the records in the main table in a gap.
Clustering Index: A clustered index can be defined as an ordered data file. Sometimes
the index is created on non-primary key columns which may not be unique for each
record.
In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
The records which have similar characteristics are grouped, and indexes are created
for these group.
Example: suppose a company contains several employees in each department.
Suppose we use a clustering index, where all employees which belong to the same
Dept_ID are considered within a single cluster, and index pointers point to the
cluster as a whole.
Here Dept_Id is a non-unique key.
• This schema is little confusing because one disk block is shared by
records which belong to the different cluster. If we use separate disk
block for separate clusters, then it is called better technique.
Secondary Index
• In the sparse indexing, as the size
of the table grows, the size of mapping
also grows. These mappings are usually
kept in the primary memory so that
Address fetch should be faster. Then the
secondary memory searches the
actual data based on the address
got from mapping. If the mapping size
grows then fetching the address itself
becomes slower. In this case, the sparse
index will not be efficient.
In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced. In this method, the huge range for the columns is selected initially
so that the mapping size of the first level becomes small. Then each range is
further divided into smaller ranges. The mapping of the first level is stored in
the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
If we want to find the record of roll 111
in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the
first level index. It will get 100 at this level.
Then in the second index level, again it does
max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block
and starts searching each record till it gets 111.
This is how a search is performed in this
method. Inserting, updating or deleting is also
done in the same manner.
Hashing in DBMS

Hashing in DBMS is a technique to quickly locate a data record in a database


irrespective of the size of the database. For larger databases containing
thousands and millions of records, the indexing data structure technique
becomes very inefficient because searching a specific record through
indexing will consume more time. So, to counter this problem hashing
technique is used.
What is Hashing?
• The hashing technique utilizes an auxiliary hash table to store the data
records using a hash function. There are 2 key components in hashing:
• Hash Table: A hash table is an array or data structure and its size is
determined by the total volume of data records present in the database.
Each memory location in a hash table is called a ‘bucket‘ or hash indices and
stores a data record’s exact location and can be accessed through a hash
function. These buckets generally store a disk block which further stores
multiple records. It is also known as the hash index.
• Hash Function: A hash function is a mathematical equation or algorithm that
takes one data record’s primary key as input and computes the hash index as
• It computes the index or the location where the current data record is to
be stored in the hash table so that it can be accessed efficiently later. This
hash function is the most crucial component that determines the speed of
fetching data.
Working of Hash Function
• The hash function generates a hash index through the primary key of the
data record.
Now, there are 2 possibilities:
1. The hash index generated isn’t already occupied by any other value. So,
the address of the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is
called collision so to counter this, a collision resolution technique will be
applied.
3. Now whenever we query a specific record, the hash function will be
applied and returns the data record comparatively faster than indexing
because we can directly reach the exact location of the data record
through the hash function rather than searching through indices one by
• Example:

Types of Hashing in DBMS


• There are two primary hashing techniques in DBMS.
1. Static Hashing
• In static hashing, the hash function always generates the same bucket’s
address.
For example, if we have a data record for employee_id = 106, the hash
function is mod-5 which is – H(x) % 5, where x = id. Then the operation
will take place like this:
• H(106) % 5 = 1.
This indicates that the data record should be placed or searched in the
• Example: Static Hashing:
The primary key is used as the input to the
hash function and the hash function generates
the output as the hash index (bucket’s address)
which contains the address of the actual data
record on the disk block.
Static Hashing has the following Properties
Data Buckets: The number of buckets in memory
remains constant. The size of the hash table is decided
initially and it may also implement chaining that will
allow handling some collision issues though, it’s only a slight optimization and may not prove
worthy if the database size keeps fluctuating.
Hash function: It uses the simplest hash function to map the data records to its appropriate
bucket. It is generally modulo-hash function
Efficient for known data size: It’s very efficient in terms when we know the data size and its
distribution in the database.
• It is inefficient and inaccurate when the data size dynamically varies because we have limited
space and the hash function always generates the same value for every specific input. When
the data size fluctuates very often it’s not at all useful because collision will keep happening
and it will result in problems like – bucket skew, insufficient buckets etc.
• To resolve this problem of bucket overflow, techniques such as – chaining and open
1. Chaining: Chaining is a mechanism in which the hash table is
implemented using an array of type nodes, where each
bucket is of node type and can contain a long chain of linked
lists to store the data records. So, even if a hash function
generates the same value for any data record it can still be
stored in a bucket by adding a new node.
• However, this will give rise to the problem bucket skew that
is, if the hash function keeps generating the same value again
and again then the hashing will become inefficient as the
remaining data buckets will stay unoccupied or store minimal
data.
2. Open Addressing/Closed Hashing
• This is also called closed hashing this aims to solve the
problem of collision by looking out for the next empty slot
available which can store data. It uses techniques like linear
2. Dynamic Hashing
• Dynamic hashing is also known as extendible hashing, used to handle database
that frequently changes data sets. This method offers a way to add and remove
data buckets on demand dynamically.
Properties of Dynamic Hashing
• The buckets will vary in size dynamically periodically as changes are made
offering more flexibility in making any change.
• Dynamic Hashing aids in improving overall performance by minimizing or
completely preventing collisions.
It has the following major components:
Data bucket, Flexible hash function, and directories
• A flexible hash function means that it will generate more dynamic values and
will keep changing periodically asserting to the requirements of the database.
• Directories are containers that store the pointer to buckets. If bucket overflow
or bucket skew-like problems happen to occur, then bucket splitting is done to
maintain efficient retrieval time of data records. Each directory will have a
directory id.
• Global Depth: It is defined as the number of bits in each directory id. The more
Working of Dynamic Hashing
Example: If global depth: k = 2, the keys will be mapped accordingly to the hash
index. K bits starting from LSB will be taken to map a key to the buckets. That
leaves us with the following 4 possibilities: 00, 11, 10, 01.

As we can see in the above image, the k bits from LSBs are taken in the
hash index to map to their appropriate buckets through directory IDs.
The hash indices point to the directories, and the k bits are taken from
the directories’ IDs and then mapped to the buckets. Each bucket holds
the value corresponding to the IDs converted in binary.

You might also like