Querry Processing and Indexing, Hashing
Querry Processing and Indexing, Hashing
A database consists of a huge amount of data. The data is grouped within a table in RDBMS,
and each table has related records. A user can see that the data is stored in the form of
tables, but in actuality, this huge amount of data is stored in physical memory in the form
of files.
What is a File?
• A file is named a collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tapes, and optical disks.
What is File Organization?
• File Organization refers to the logical relationships among various records that constitute
the file, particularly with respect to the means of identification and access to any specific
record.
• In simple terms, Storing the files in a certain order is called File Organization.
• File Structure refers to the format of the label and data blocks and of any logical control
record.
The Objective of File Organization
• It helps in the faster selection of records i.e. it makes the process faster.
• Different Operations like inserting, deleting, and updating different records are faster and
easier.
• It prevents us from inserting duplicate records via various operations.
Types of File Organizations
• Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection. Thus it is all upon the
programmer to decide the best-suited file Organization method according to his
requirements.
Some types of File Organizations are:
• Sequential File Organization
• Heap File Organization
• Hash File Organization
• B+ Tree File Organization
• Clustered File Organization
• ISAM (Indexed Sequential Access Method)
Parsing
During the parse call, the database performs the following checks:
Syntax check, Semantic check, and Shared pool check, after converting the query
into relational algebra.
First, high-level database languages like SQL are used to translate the user queries that
have been provided. It is transformed into expressions that can be applied further
at the file system’s physical level.
Following this, the queries are actually evaluated along with a number of query-
optimizing transformations. Consequently, a computer system must convert a query
into a language that is readable and understandable by humans before processing
it. Therefore, the best option for humans is SQL or Structured Query Language.
A detailed Diagram is drawn as:
Parser performs the following checks (refer to the detailed diagram):
Syntax check:
concludes SQL syntactic validity.
Example:
SELECT * FORM employee
Here, the error of the wrong spelling of
FROM is given by this check.
Step-1
Semantic check
determines whether the statement is
meaningful or not.
Example: query contains a table name
that does not exist and is checked by
this check.
Shared Pool check
Every query possesses a hash code during
its execution. So, this check determines the existence of written hash code in the
shared pool if the code exists in the shared pool then the database will not take
Step-2
Optimization:
• During the optimization stage, the database must perform a hard parse
at least for one unique DML statement and perform optimization
during this parse. This database never optimizes DDL unless it includes
a DML component such as a subquery that requires optimization. It is a
process in which multiple query execution plans for satisfying a query
are examined and the most efficient query plan is satisfied for
execution. The database catalog stores the execution plans and then
the optimizer passes the lowest-cost plan for execution.
• Row Source Generation
• Row Source Generation is software that receives an optimal execution
plan from the optimizer and produces an iterative execution plan that
is usable by the rest of the database. The iterative plan is the binary
program that, when executed by the SQL engine, produces the result
set.
• Step-3
• Evaluation
Indexing in DBMS
• A database index is a data structure that helps in improving the speed of data access. The
database index helps quickly locate the data in database without having to search every row of
database. The process of creating an index for a database is known indexing.
Real life example of Indexing
The first few pages of book contains the index of book, which tells which topic is covered at which
page number. This helps us quickly locate the topic in the book using the index. Without the
index, we would have to scan the entire book to look for the topic which would take a long time.
• An index is defined by a field expression that we specify when we create the index.
• Typically, the field expression is a single field name, like EMP_ID.
• For example, An index created on the EMP_ID field, contains a sorted list of the employee ID
values in the table.
• A database driver can use indexes to find records quickly. An index on the EMP_ID field, for
example, greatly reduces the time that the driver spends searching for a particular employee ID
value.
Consider the following Where clause:
WHERE emp_id = 'E10001'
• Without an index, the driver must search the entire database table to find this record having an
employee ID of E10001.
• By using an index on the EMP_ID field, however, the driver can quickly find this record
• Indexing is used to optimize the performance of a database by minimizing the number of disk
accesses required when a query is processed.
• Index structure: Indexes can be created using some database
columns.
• The first column of the database is the search key that contains
a copy of the primary key or candidate key of the table. The
values of the primary key are stored in sorted order so that the
corresponding data can be accessed easily.
• The second column of the database is the data reference. It
contains a set of pointers holding the address of the disk block
where the value of the particular key can be found.
• Indexing Methods:
Ordered indices: The indices are usually sorted to make searching faster.
• The indices which are sorted are known as ordered indices.
• Example: Suppose we have an employee table with thousands of record and each
of which is 10 bytes long.
• If their IDs start with 1, 2, 3....and so on and we have to search Employee with ID-
543.
• In the case of a database with no index, we have to search the disk block from
starting till it reaches 543. The DBMS will read the record after reading
543*10=5430 bytes.
• In the case of an index, we will search using indexes and the DBMS will read the
record after reading 542*2= 1084 bytes which are very less compared to the
previous case.
Primary Index
• If the index is created on the basis of the primary key of the table, then it is known
as primary indexing. These primary keys are unique to each record and contain 1:1
relation between the records.
• As primary keys are stored in sorted order, the performance of the searching
operation is quite efficient.
• The primary index can be classified into two types: Dense index and Sparse index.
Dense index: The dense index contains an index record for every search key value in
the data file. It makes searching faster.
• In this, the number of records in the index table is same as the number of
records in the main table.
• It needs more space to store index record itself. The index records have the
search key and a pointer to the actual record on the disk.
• Sparse index: In the data file, index record appears only for a few items. Each
item points to a block.
• In this, instead of pointing to each record in the main table, the index points to
the records in the main table in a gap.
Clustering Index: A clustered index can be defined as an ordered data file. Sometimes
the index is created on non-primary key columns which may not be unique for each
record.
In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
The records which have similar characteristics are grouped, and indexes are created
for these group.
Example: suppose a company contains several employees in each department.
Suppose we use a clustering index, where all employees which belong to the same
Dept_ID are considered within a single cluster, and index pointers point to the
cluster as a whole.
Here Dept_Id is a non-unique key.
• This schema is little confusing because one disk block is shared by
records which belong to the different cluster. If we use separate disk
block for separate clusters, then it is called better technique.
Secondary Index
• In the sparse indexing, as the size
of the table grows, the size of mapping
also grows. These mappings are usually
kept in the primary memory so that
Address fetch should be faster. Then the
secondary memory searches the
actual data based on the address
got from mapping. If the mapping size
grows then fetching the address itself
becomes slower. In this case, the sparse
index will not be efficient.
In secondary indexing, to reduce the size of mapping, another level of indexing is
introduced. In this method, the huge range for the columns is selected initially
so that the mapping size of the first level becomes small. Then each range is
further divided into smaller ranges. The mapping of the first level is stored in
the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
If we want to find the record of roll 111
in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the
first level index. It will get 100 at this level.
Then in the second index level, again it does
max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block
and starts searching each record till it gets 111.
This is how a search is performed in this
method. Inserting, updating or deleting is also
done in the same manner.
Hashing in DBMS
As we can see in the above image, the k bits from LSBs are taken in the
hash index to map to their appropriate buckets through directory IDs.
The hash indices point to the directories, and the k bits are taken from
the directories’ IDs and then mapped to the buckets. Each bucket holds
the value corresponding to the IDs converted in binary.