Primary Indexing
Primary Indexing
A primary index is built on a sorted data file and is associated with the primary key of the table. It
creates a single-level index where each index entry points to a block of the data file. Primary indexes
ensure faster access to records when the primary key is used for searching. There are two main types
of primary indexes:
1. Dense Index:
o In a dense index, there is an index entry for every record in the data file.
o Each index entry holds the primary key and a pointer to the exact location of the
corresponding record.
o While dense indexes provide faster lookups because every record has a direct entry,
they consume more space.
Example: If we have a table with 1000 records and a dense primary index, the index will have 1000
entries.
2. Sparse Index:
o In a sparse index, index entries exist for only some of the records, typically the first
record in every block of the data file.
o Each entry in the sparse index points to a block rather than an individual record.
o Sparse indexes are more space-efficient compared to dense indexes but require
more time to locate specific records.
Example: For a table of 1000 records stored across 10 blocks, a sparse index might have only 10
entries, one for each block.
Disadvantages:
Maintaining the index can become cumbersome if the data file is updated frequently, as
every insertion or deletion may require reorganization of the index.
Secondary Indexing
o Similar to the dense primary index, it has an index entry for every record in the table.
o Each entry contains the value of the secondary key and a pointer to the record in the
data file.
o Use Case: If you frequently query the table using a non-primary key, a dense
secondary index can help speed up access.
Example: If we have a table of employees, and we frequently query based on the "Department"
column (which is not the primary key), creating a dense secondary index on "Department" will
provide faster query results.
o A sparse secondary index contains entries only for some records, and each entry
points to a block or a range of records.
o Like the sparse primary index, it is more space-efficient but less precise in terms of
exact record lookup.
Useful for tables where searches are frequently performed on multiple fields.
Disadvantages:
Secondary indexes add additional overhead for maintenance because the index needs to be
updated when records are inserted, updated, or deleted.
They may also result in slower inserts and updates, as multiple indexes may need to be
updated.
Cluster indexing is a type of index that groups rows with the same value for the indexed attribute(s)
together. Unlike primary and secondary indexing, which are typically built on individual columns
(either primary keys or other columns), cluster indexing organizes the physical storage of data in a
way that related records are stored in contiguous blocks.
Cluster indexing is most effective when there are multiple records in the table with the same value
for an attribute or a group of attributes. It helps in improving the performance of queries that
retrieve data based on the clustered attribute(s).
1. Clustering Key:
o Records with the same clustering key values are stored physically close to each other
in the data file.
o The clustering key is not necessarily unique (unlike a primary key), which allows
multiple rows to share the same value.
2. Clustered Table:
o When a cluster index is created, the table is referred to as a clustered table because
the rows are physically rearranged to group records with the same values for the
clustering key.
o The physical arrangement of data helps in reducing the number of disk I/O
operations when retrieving records with the same clustering key.
o The index does not contain entries for every record but rather for blocks of records
that have the same clustering key values. It points to the first record of each group
(cluster) in the data file.
o In some cases, a dense index can be used, where there is an entry for every record in
the cluster. However, typically, a sparse index is preferred because the data is already
grouped physically, so fewer index entries are required.
Consider a table of students with the following attributes: Student_ID, Department, Name, Age.
If you create a cluster index on the Department column, students belonging to the same department
will be stored together on the disk. So, all students from the "Computer Science" department will be
stored in one block or adjacent blocks, followed by students from "Mechanical Engineering," and so
on.
Here, the cluster index is built on the Department column. Students are stored together based on
their department.
Advantages
When querying data that involves a range of values or multiple records with the same key (such as
fetching all students from a particular department), cluster indexing is highly efficient. Since the
records are stored together, fewer disk I/O operations are needed.
Cluster indexes are particularly useful when there are multiple rows with the same value for the
clustering key (non-unique values).
1. Insert/Update Overhead:
2. Less Flexibility:
Multilevel indexing is a technique used to improve the efficiency of searching large datasets by
creating multiple levels of indexes, similar to a hierarchical structure. When a single-level index
becomes too large to fit into memory, multilevel indexing divides it into smaller, more manageable
parts, reducing the number of disk accesses required to find a record.
1. First-Level Index:
o The first level consists of a primary index (e.g., sparse or dense), where each entry
points to a block or set of records in the data file.
2. Second-Level Index:
o When the first-level index grows too large, a second-level index is created, which
indexes the first-level index.
o The second-level index contains entries pointing to blocks in the first-level index.
3. Higher-Level Indexes:
o If even the second-level index becomes large, a third-level index is created, and so
on, forming a hierarchical structure.
The idea is to have multiple levels of indexes so that each level fits into memory, allowing the system
to load only the relevant portion of the index for faster searches.
Scalable
Disadvantages
Increased Complexity
B-Tree Properties:
1. Balanced Structure: B-trees remain balanced, ensuring the height of the tree is logarithmic
concerning the number of nodes, leading to efficient search, insertion, and deletion
operations.
2. Order (M): Each node in a B-tree can have at most MMM children and at least ⌈M2⌉\lceil \
frac{M}{2} \rceil⌈2M⌉ children (except the root).
3. Key Distribution: Keys are distributed across all nodes, both internal and leaf nodes.
4. Minimum Occupancy: Every node except the root must be at least half full.
6. Leaf Level: All leaves are at the same level (balanced structure).
7. Search Efficiency: Searches can be performed in O(logn)O(\log n)O(logn), where nnn is the
number of keys.
B+ Tree Properties:
1. Balanced Tree Structure: Similar to B-trees, B+ trees are balanced, ensuring logarithmic
depth.
2. Internal and Leaf Nodes: Internal nodes contain keys only, while leaf nodes contain the
actual data (or pointers to data).
3. Linked Leaves: Leaf nodes are linked together to form a sorted linked list for efficient
sequential access.
4. Efficient Range Queries: Since all data resides in leaf nodes linked together, range queries
and sequential access can be done efficiently.
6. Full Utilization of Nodes: Non-leaf nodes are used solely for guiding searches, leading to
better space utilization in the leaf nodes.
7. Search Optimization: Searching is done in internal nodes, and once the leaf node is reached,
the exact key can be found.
B+ Tree Indexing:
B+ trees are extensively used in database indexing because of their efficient storage structure and
ability to handle a large number of records. Here's how they help in indexing:
1. Efficient Search: The logarithmic height of the B+ tree ensures fast search operations.
2. Sorted Leaf Nodes: Leaf nodes in B+ trees are linked, making range queries efficient and
sequential access easier.
3. Minimized Disk Access: As B+ trees minimize disk I/O by clustering similar records together
and keeping internal nodes small, they reduce the number of disk accesses for queries.
4. Better Space Utilization: By storing data only in leaf nodes, internal nodes stay small, which
makes indexing more efficient.
5. Support for Range Queries: Since all records are stored at the leaf level in sorted order and
are linked, B+ trees allow efficient range searches (e.g., finding all records between two
keys).