0% found this document useful (0 votes)
14 views57 pages

Lec20Indexing v1

The document discusses database storage and indexing mechanisms, emphasizing the importance of indexing for improving data access speed. It covers various types of indexes, including primary, secondary, dense, and sparse indexes, along with their advantages and disadvantages. Additionally, it introduces B-trees as an efficient structure for maintaining indexes in databases, highlighting their self-organizing capabilities and balanced nature.

Uploaded by

saiabhishek0119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views57 pages

Lec20Indexing v1

The document discusses database storage and indexing mechanisms, emphasizing the importance of indexing for improving data access speed. It covers various types of indexes, including primary, secondary, dense, and sparse indexes, along with their advantages and disadvantages. Additionally, it introduces B-trees as an efficient structure for maintaining indexes in databases, highlighting their self-organizing capabilities and balanced nature.

Uploaded by

saiabhishek0119
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

INDEXING

Spring2024
Dr. Shamsad Parvin
Chapter-14 silberchartz, chapter-14 ulman
Data base storage
 The DBMS assumes that the primary storage location of the database is on a
non-volatile disk.
 The DBMS's components manage the movement of data between non-volatile
and volatile storage.
Expensive
Faster
‘- Volatile

Non -Volatile

Cheaper
Slower 2
Data base storage

Access Time

‘-

https://gist.github.com/hellerbarde/2843375 3
Basic Concepts

 Indexing mechanisms used to speed up


access to desired data.
• E.g., author catalog in library

• When the relations are very large: it Assume that Movies


became expensive to scan all the relations has 10,000 tuples
tuples to find the tuples that matches
‘- where only 100 movies
a given conditions. made in year 1990 .

• Example :

SELECT * File Size Search


FROM Movies WHERE Time
studioname=“Disney” AND Year =1990

4
Basic concepts

 Primary mechanism to improve the performance of the database


 Persistent data structure stored in a database
 Functionality :
A B C
Apple 2 ..
Banana 5 ..
T.A=‘Apple’ Kiwi‘- 6 ..
Index
on A Orange 8 ..
Apple 9 ..
Banana 11 ..
Strawberr 5 ..
y

5
Basic concepts

 Primary mechanism to improve the performance of the database


 Persistent data structure stored in a database
 Functionality :
A B C
Apple 2 ..
Index Banana 5 ..
T.A=‘Apple’ on A Kiwi‘- 6 ..
Orange 8 ..
Index
on B Apple 9 ..
T.B=‘5’
Banana 11 ..
T.B<‘8’ Index on Strawberr 5 ..
A and B y
T.A=‘Apple’ and
T.B>2

6
Indexes (or Indices)

 Data Structures used for quickly locating tuples that meet


a specific type of condition
– Equality condition: Find Movie tuples where Director=X
– Range conditions: Find Employee tuples where
Salary>40 AND Salary<50
 Contains (key, value) pairs:
– The key k = an attribute value
‘-
– The value k* = one of:
• pointer to the record secondary index
• or the record itself primary index
 Any subset of the fields of a relation can be the Search
key for an index on the relation can be any subset of the
fields of a relation and different from ‘key’ (minimal set of
fields that uniquely identify a record in a relation).

7
Indexes (or Indices)
 Search Key - attribute to set of attributes used to look up
records in a file.
 An index file consists of records (called index entries) of the
form

search-key pointer
‘-
Set of pointer holding address of
Primary key the disk block

 Index files are typically much smaller than the original file
 Two basic kinds of indices:
• Ordered indices: search keys are stored in sorted order
• Hash indices: search keys are distributed uniformly
across “buckets” using a “hash function”.

8
Index Evaluation Metrics

 Many Types of indexes evaluate them on :


 Access types
 Access time
 Insertion time
 Deletion time
 Space overhead
‘-

9
Indexes (or Indices)

 Advantages of using Indexing


Indexing Reduces I/O time
 Disadvantages
Extra space --- marginal
overhead ---- medium ‘-
Index maintenance – tradeoff with the speed and cost
 Benefits of index depends on
How big the relation size is?
Data distribution
Query vs. update load

10
Physical design Advisor

 Input: Databases and Work load


 Output: recommended Index

Database statistics

‘-
Query Optimizer
Query or Update

Indexes

Best Execution plan


with estimated cost

11
SQL Syntax

 Create Index IndexName on T(A)


 Create Index IndexName on T(A,B,C…)
 Drop Index IndexName

‘-

12
Index Classification

 Primary
• Primary = is over attributes that include the primary key
 Secondary : otherwise Ordered Clustered
Primary index
file index
 Primary index is classified as Dense and Sparse
 Clustered/unclustered
• – Clustered = records close in index are close in data ‘-
Un Ordered Secondary Secondary
file index index
• – Unclustered = records close in index may be far in
data
 A file can be clustered on at most one search key Key Non Key

 Cost of retrieving data records through index varies


greatly based on whether index is clustered or not!
 Primary indexing can be dense or sparse

13
Sequential Files
 A sequential file is created by
sorting the tuples of a relation by
their primary key. The tuples are
then distributed among blocks, in
this order.
 Index file needs much fewer
blocks than the data file, hence
Block easier to fit in memory
‘-
 It is common to leave some
space in each block, else
insertion of new samples need to
handled by overflow

14
Dense Index

 A dense index is a sequence of blocks holding only the keys of the records and pointers
to the records themselves
 Index blocks of the dense index maintain these keys in the same sorted order as in the
main data file itself.
 Then, by using the index, we can find any record given its search key, with only one disk
I/O per lookup.
‘-

15
Dense Index Files : Example

 Dense index — Index record appears for every search-key


value in the file.
 E.g. index on ID attribute of instructor relation

‘-

No. of Row in Index


table= No. of row in the
main table 16
Dense Index Files (Cont.)

 Dense clustering index on dept_name, with instructor file


sorted on dept_name

‘-

Ordered in Dept name


And non key 17
Class Activity

 What dense index could possibly help?


• Number of index blocks are small compared to the number of data blocks

‘-

18
Sparse Index

 Typically, only one key per data block


 Searching: Find the index record with largest
value that is less or equal to the value we are
looking
 What is its advantage over dense index?
‘- It thus uses less space than a dense index, at
the expense of somewhat more time to find a
record given its key.
 When should you use it?
 only use a sparse index if the data file is
sorted by the search key, while a dense
index can be used for any search key.

No. of Row in Index


table= No. of blocks in
the main table
19
Multiple Levels of Index
 An index file can cover many
blocks.
• Even if we use binary search to
find the desired index entry, we
still may need to do many disk
I/O ’s to get to the record we
want.
 An index on the index, we can
‘- make the use of the first level of
index more efficient.
 However, this idea has its limits,
and we prefer the B-tree structure,
to be discussed a bit later in this
discussion
 In the example, we see the second
level is sparse index, can we have
multiple layers of dense index instead
to enhance search efficiency?

20
Multilevel Index

 If index does not fit in memory, access becomes expensive.


 Solution: treat index kept on disk as a sequential file and
construct a sparse index on it.
• outer index – a sparse index of the basic index
• inner index – the basic index file
 If even outer index is too large to fit in main memory, yet
another level of index can be created, and‘- so on.
 Indices at all levels must be updated on insertion or deletion
from the file.

21
Multilevel Index (Cont.)

‘-

22
Secondary Index

 It serves the purpose of any index Primary Clustered


Ordered
 It is a data structure that facilitates finding index index
file
records given a value for one or more fields.
Un Ordered Secondary Secondary
 Secondary index does not determine the index index
placement of records in the data file. Rather, file
the secondary index tells us the current
‘- Non Key
locations of records; that location may have Key
been decided by a primary index on some
other field.

23
Secondary Index

 It serves the purpose of any index


 It is a data structure that facilitates finding
records given a value for one or more fields.
 Secondary index does not determine the
placement of records in the data file. Rather,
the secondary index tells us the current
locations of records; that location may have ‘-
been decided by a primary index on some
other field.

24
Applications of Secondary Index

 Suppose there are relations R and S, with a many-one relationship from the tuples of R
to tuples of S. It may make sense to store each tuple of R with the tuple of S to which it is
related, rather than according to the primary key of R.
 Example:
Movie (title, year, length, genre, studioName, producerC#)
Studio(name, address, presC#)
‘-
select title, year
from Movie, Studio
where presC# == zzz AND Movie.studioName=Studio.name;

Search with the primary key title and year are not preferable here

25
Example

 We can create a clustered file structure for both relations that has the below
configuration
 Then if we create an index for Studio with search key presC#, then whatever the value of
zzz is, we can quickly find the tuple for the proper studio.

‘-

26
Indirection in Secondary Index

What happens when some keys are repeated?

Try to search for key


value 10

‘-

When would this approach be helpful? 27


Secondary Indices Example
 Secondary index on salary field of instructor

‘-

 Index record points to a bucket that contains pointers to all the actual
records with that particular search-key value.
 Secondary indices have to be dense
28
Indirection in Secondary Indexes
 Several conditions to a query, and select title
each condition has a secondary index from Movie
to help it where year == 2005 AND
Movie.studioName=‘’Disney;
 Find the bucket pointers that satisfy all
the conditions by intersecting sets of
pointers in memory, and retrieving only
the records pointed to by the surviving
pointers. ‘-
 Save the I/O cost of retrieving records
that satisfy some, but not all, of the
conditions.

29
Index Update: Deletion

 If deleted record was the only record in the file with its particular search-
key value, the search-key is deleted from the index also.

 Single-level index entry deletion:


• Dense indices – deletion of search-key is similar to file
record deletion. ‘-
• Sparse indices –
 if an entry for the search key exists in the index, it is deleted
by replacing the entry in the index with the next search-key
value in the file (in search-key order).
 If the next search-key value already has an index entry, the
entry is deleted instead of being replaced.

30
Index Update: Insertion

 Single-level index insertion:


• Perform a lookup using the search-key value of the record
to be inserted.
• Dense indices – if the search-key value does not appear in
the index, insert it
• Sparse indices – if index stores an entry for each block of
the file, no change needs to be made to the index unless a
‘-
new block is created.
 If a new block is created, the first search-key value appearing
in the new block is inserted into the index.

31
B Tree Index Files

 Disadvantage of indexed-sequential files


• Performance degrades as file grows, since many overflow blocks
get created.
• Periodic reorganization of entire file is required.
 Advantage of B+-tree index files:
• Automatically reorganizes itself with small, local, changes, in the
face of insertions and deletions. ‘-
• Reorganization of entire file is not required to maintain performance.
 (Minor) disadvantage of B+-trees:
• Extra insertion and deletion overhead, space overhead.
 Advantages of B+-trees outweigh disadvantages
• B+-trees are used extensively

32
B Trees
 A more general structure that is commonly used in
commercial systems
 The particular variant that is most often used is
known as a B+ tree
• B-trees automatically maintain as many levels of
index as is appropriate for the size of the file
being indexed.
‘-
• B-trees manage the space on the blocks they
use so that every block is between half used and
completely full
• Balanced (more or less), i.e. all the paths from
root to leaves have nearly the same length
• Disk-based: one node per block; Each block will
have space for 𝑛𝑛 search-key values and 𝑛𝑛 + 1
pointers.

P1 K1 P2 K2 --- Pn Kn Pn+1
33
Example of B+-Tree

‘-

34
B+-Tree Node Structure

 Typical node

• Ki are the search-key values


• Pi are pointers to children (for non-leaf nodes) or pointers to
records or buckets of records (for leaf nodes).
‘-
 The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
(Initially assume no duplicate keys, address duplicates later)

35
B+-Tree Index Files (Cont.)

A B+-tree is a rooted tree satisfying the following properties:

 All paths from root to leaf are of the same length : Balanced tree
 Each Intermediate node has between n/2 and n children.
 A leaf node has between (n–1)/2 and n–1 values of keys
 Special cases:
• If the root is not a leaf, it has at least 2‘-children.
• If the root is a leaf (that is, there are no other nodes in the
tree), it can have between 0 and (n–1) values of key

36
Example: Leaf Nodes in B+-Trees

 A leaf node for instructor B+-tree index (n = 4).

‘-

37
Class Activity

 Suppose our blocks are 4096 bytes. Also let keys be integers of 4 bytes and let pointers
be 8 bytes. If there is no header information kept on the blocks, then we want to find the
largest integer value of 𝑛𝑛, such that an entire node can fit in once block.

‘-
4 × 𝑛𝑛 + 8 × 𝑛𝑛 + 1 ≤ 4096

𝑛𝑛 = 340

38
Rules for the Blocks of a B-tree
 The keys in leaf nodes are copies of keys from the
data file. These keys are distributed among the leaves
in sorted order, from left to right.

 At the root, there are at least two used pointers.


All pointers point to B-tree blocks at the level
below.

 Among the‘-other 𝑛𝑛 pointers in a leaf block, at


𝑛𝑛+1
least of these pointers are used and point to
2
data records; unused pointers are null and do not
point anywhere.

 At an interior node, all 𝑛𝑛 + 1 pointers can be used to


point to B-tree blocks at the next lower level. At least
(𝑛𝑛 + 1)/2 of them are actually used (but if the node
is the root, then we require only that at least 2 be used,
regardless of how large 𝑛𝑛 is).

39
Example: A complete B- Tree

‘-

Notice that at the leaf, all the keys appear just once, in order, as we look across the leaves from left to right
40
Class Activity

 B+-tree for instructor file (n = 6) How many search-key for


each node ?

‘-
 Leaf nodes must have between 3 and 5 values Minimum How many
((n–1)/2 and n –1, with n = 6). search-key at leaf node ?
 Non-leaf nodes other than root must have between 3
and 6 children ((n/2 and n with n =6). Minimum How many
children for Internal node ?
 Root must have at least 2 children.

41
B-tree Vs B+ Tree
 We will discuss B+ tree, which has several advantages over standard B tree, which
include:
• Though duplication of keys are maintained, B+ tree allows the data pointers to be
present only in the leaf nodes, which makes the search and updates more efficient
• Leaf nodes are stored as structural linked list

‘-

42
Balancing Constraints

 Height Constraint: all leaves at the same lowest level


 Fan out Constraint: all nodes at least half full ( except root)

‘-

43
Rules for the Blocks of a B-tree

 The keys in leaf nodes are copies of keys from the


data file. These keys are distributed among the
leaves in sorted order, from left to right.

 Class activity : what is the the value of n?


‘-
4
How many search key for each node?
3

44
Lookups in B-tree

Suppose we have a B-tree index and we want to find a record with search-key value K .

 Basis: If we are at a leaf, look among the keys there. If the ith key
is K , then
‘-
 the ith pointer will take us to the desired record.
 Induction : If we are at an interior node with keys K i ,K 2 , . . . , K
n, That is, there is only one child that could lead to a leaf with key K .
If K < K \ , then it is the first child, if K \ < K < K 2 , it is the second
child, and so on. Recursively apply the search procedure at this child.

45
Lookups

 select * from R where k == 179;


 select * from R where k = 32;

‘-

46
Range Queries on B+-Trees (Cont.)
 Range queries find all records with search key values in a given
range

 Example, with a B+-tree on attribute salary of instructor, we can


find all instructor records with salary in a specified range such as
[50000,100000]

‘- in a specified
 We can find all records with search key values
range [lb, ub].

 It first traverses to a leaf in a manner similar to the look up


method; the leaf may or may not actually contain value lb. It then
steps through records in that and subsequent leaf nodes
collecting pointers to all records with key values C,Ki
 The function stops when C.Ki > ub, or there are no more keys in
the tree.
47
Range Query

 select * from R where k > 32 AND k<179;

‘-

lookup 32

48
Updates on B+-Trees: Insertion

Assume record already added to the file.


Let assume:
• P_r be pointer to the record, and let
• v be the search key value of the record
1. Find the leaf node in which the search-key value would appear
1. If there is room in the leaf node, insert (v, pr) pair in the leaf
node ‘-
2. Otherwise, split the node (along with the new (v, pr) entry)
as discussed in the next slide, and propagate updates to
parent nodes.

49
Updates on B+-Trees: Insertion (Cont.)
 Splitting a leaf node:
• take the n (search-key value, pointer) pairs (including the
one being inserted) in sorted order. Place the first n/2 in
the original node, and the rest in a new node.
• let the new node be p, and let k be the least key value in p.
Insert (k,p) in the parent of the node being split.
• If the parent is full, split it and propagate the split further
up.
‘-
 Splitting of nodes proceeds upwards till a node that is not full is
found.
• In the worst case the root node may be split increasing the
height of the tree by 1.

50
Example : Insertion

 Insert a record with search key value 32

‘-

lookup where the


inserted key should
go

32
51
Insert it right there
Another Insertion Example

 Insert a record with search key value 152

‘-

Need to Split to
make a room….

152
Oops!!! 52
Another Insertion
Example

 Insert a record with search key


value 152

Oops! This node


becomes full ‘-
now!
Added a pointer to the
newly created node

53
Another Insertion Example

 Insert a record with search key


value 152
Added a pointer to the
newly created node
 In the worst case, node
‘- splitting can “propagate”
all the way up to the root
of the tree (not illustrated
here)
• Splitting the root
introduces a new root
of fan-out 2 and
causes the tree to
grow “up” by one level

54
Another Insertion Example
 Construct a B-tree for the following sets of key values:
 1,2,3,4,5,6,7,
 Where the number of pointers that will fit in one node is 4

What is the maximum number of keys = 3


Min. number o f keys = (4/2)-1=1 ‘-

55
 Delete a record with search key value 130
Deletion

‘-

lookup 130 the key to


be deleted Steal one, if a sibling has more
than enough keys

And delete it….


56
Ohhh…. The new node seems to be too empty
Deletion

156 ‘-

Remember to fix the key


in the least common ancestor
of the affected nodes

57

You might also like