Index and Hashing 2017 Combined
Index and Hashing 2017 Combined
Week 4
National College of Ireland
Dublin, Ireland.
1
Indexing and Hashing
• Introduction to Indexing
• Basic Concepts
• Ordered Indices
• Static Hashing
• Dynamic Hashing
• Comparison of Ordered Indexing and Hashing
• Index Definition in SQL
• Tree Structures
2
Introduction to Indexing
• Computer memory (RAM or ROM) is significantly faster than HDD.
• SSDs are faster but could not eliminate this difference yet.
• But generally, we assume that each relation (table) is stored in a separate file
with: B blocks and R records per block. 3
File Organizations
• Technique for physically arranging records of a file on secondary
storage
• Factors for selecting file organization:
▪ Fast data retrieval and throughput ▪ Minimizing need for reorganization
▪ Efficient storage space utilization ▪ Accommodating growth
▪ Protection from failure and data loss ▪ Security from unauthorized use
Sequential
Records of the file are
storage:
stored in sequence by
Average time to
the primary key field find desired record
values. = log2n
If this were a
heap,
Average time to
find desired record
= n/2
Indexed File Organizations
• Storage of records sequentially or
nonsequentially with an index that
allows software to locate individual
records
Heap File
no specific structure
Use sequential (linear) access to records
Hash File
Uses hash function of a set of hash fields
Allows direct access if hash fields values are known 7
Introduction to Indexing
Indexes
An index access structure is associated with a particular search key
and contains records consisting of the key value and the address of
the logical record in the data file containing the key value
• Index files are typically much smaller than the original file
• Two basic kinds of indices:
i. Ordered indices: search keys are stored in sorted order
ii. Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”. 9
Files without an Index
Ordered File
Consider the following example of Staff table:
Sno Lname Position NIN Bno • Find Sno = SG36
SG14 White Manager WK4416 B5
• Find Sno from SG36 to
SG21 Black Snr Asst WL7868 B3
WL8767 8
SL37
SG24 Ford Deputy B3
SG36 Brown Assistant WF7656 75 B4 • Find Lname of Murphy
Primary Indexes
• If the data file is sequentially
ordered and the indexing field
is a key field of the file (i.e., it
is guaranteed to have a unique
value in each record) then the
index is called a primary
index.
Primary index on the ordering
11
key field of the file
Clustering Index
• If the data file is sequentially
ordered on a non-key field (values
may be repeated) and the indexing
field corresponds to this non-key
field, then SQL Server adds an
additional hidden column to make
the key unique. This is called as the
Clustering Index. A clustered index
is the most common type of table
organization. A clustering index on the Dept_number
ordering nonkey field of an EMPLOYEE12 file
Secondary Index
• A Secondary Index is a
data structure that contains a
subset of attributes from a
table, along with an alternate
key to
support Query Operations.
17
Dense Index Files
• Dense index — Index record appears for every search-key value in the file.
• e.g., index on ID attribute of instructor relation
Single-level Ordered Indexes
Primary Index Performance
• The index file requires significantly fewer blocks than the data
file
i. Sparse index
ii. Index file record typically smaller in size than data file record
• A binary search on the index file requires fewer block accesses
than a binary search on the data file
• Insertion and deletion of records is problematic
• Not only have we to move records in the data file we also have to change
some index entries
• Storage Overhead is not a serious problem 19
Single-level Ordered Indexes
1 Block 1 Block 2 1
1
2 Block 2
2
3 Block 3
Block 3 2
3
3
21
Single-level Ordered Indexes
Secondary Index Performance
• A secondary index is built for a data file sorted on a non-ordering field
• The index file is itself another sorted file whose records are of fixed or
variable length consisting of two fields
• The first field is of the same data type as the indexing field of the data file
• The second field is a pointer to a disk block or a record
26
Multi-level Indexes
Multi-level Indexes Performance
• Search performance is increased when searching for a record based
on a specified indexing field value
Data
Index Block 0
Block 0
M
M Data
Block 1
Index M
Block 1
M
M
M
M
29
Index Evaluation Index Update:
Metrics Deletion
• If the deleted record
was the only record
in the file with its
particular search-
key value, the
search- key is deleted
from the index also.
30
Index Update: Insertion
Single-level index insertion:
• Perform a lookup using the search-key value appearing in the
record to be inserted.
bucket 1 bucket 6
• (See figure in previous slide.)
• Hashing can be used not only for file organization, but also
for index-structure creation. A hash index organizes the
search keys, with their associated record pointers, into a
hash file structure.
bucket 1
A-215 Brighton A-217 750
A-305 Downtown A-101 500
bucket 2 Downtown A-110 600
A-101
A-110
Miaimi A-215 700
Perryridge A-102 400
bucket 3
A-217 A-201 Perryridge A-201 900
A-102 Perryridge A-218 700
bucket 4 Redwood A-222 700
A-218 Round Hill A-305 350
bucket 5
bucket 6
A-222
39
Deficiencies of Static Hashing
• In static hashing, function (h) maps search-key values to a fixed
set (B) of bucket addresses.
• Databases grow with time. If initial number of buckets is too small,
performance will degrade due to too much overflows.
• If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.
• One option is periodic, re-organization of the file with a new hash function,
but it is very expensive.
42
Comparison of
Ordered Index and Hashing
• Issues to consider:
▪ Cost of periodic re-organization
▪ Relative frequency of insertions and deletions
▪ Is it desirable to optimize average access time at the expense
of worst-case access time?
▪ Expected type of queries:
▪ Hashing is generally better at retrieving records having a
specified value of the key.
▪ If range queries are common, ordered indices are to be
preferred 43
Comparison of
Different File Organizations
B-Tree Index
• Provide multi-level
access structure
• Tree is always balanced
• Space wasted by
deletion never becomes
excessive
– Each node is at least
half-full
B-tree structures (a) A node in a B-tree with q−1
• Each node in a B-tree search values (b) A B-tree of order p=3. The values
of order p can have at were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6
most p-1 search values
Advantages of B-Tree Index
▪ B-Tree Index speeds up data access
Storage engine traverses from root node to leaf node with the
help of pointers
▪ Increase performance of following query patterns:
Full Value (e.g. ‘London’, ‘Bristol’)
Leftmost Value or Column Prefix (e.g. ‘Lon’ from ‘London’,
‘Mary’ from ‘Mary Hwe’)
Range of Value (e.g. 1 to 99, Aaron to Fritz, Aaron to Kei%)
▪ B-Tree structure helps ORDER BY clause to increase
the performance 46
B+ -Trees
• Data pointers stored only at
the leaf nodes
– Leaf nodes have an entry for
every value of the search
field, and a data pointer to
the record if search field is a
key field
– For a nonkey search field,
the pointer points to a block
containing pointers to the
data file records The nodes of a B+-tree (a) Internal node of a
• Internal nodes B+-tree with q−1 search values (b) Leaf
node of a B+-tree with q−1 search values
– Some search field values
from the leaf nodes repeated
and q−1 data pointers
to guide search
Difference
between B-tree and B+-tree
• In a B-tree, pointers to data records exist at all
levels of the tree
▪ B-Tree
▪ Clustered Index
▪ Secondary Index
49
Clustered Index B-Tree
Root Node 1 ….
Leaf Node 1 -1 0 … 2 1 -3 0 … … …
50
Building Clustered B-Tree Index
Row1
Row2
Row3
Row4
Row5
0:1 Row6
Row7
Row8
Row9
Row1 0
51
Building Clustered B-Tree Index
Row1 Row11
Row2 Row12
Row3 Row13
Row4 Row14
Row5 Row15
Row6 Row16
Row7 Row17
Row8 Row18
Row9 Row19
Row10 Row20
0:1 0:2
52
Building Clustered B-Tree Index
▪ Assumption: Each page contains 10 rows
1:1
Row1 Row11
Row2 Row12
Row3 Row13
Row4 Row14
Row5 Row15
Row6 Row16
Row7 Row17
Row8 Row18
Row9 Row19 53
0:1 Row10 0:2 Row20
Traversing Clustered B-Tree Index
Root Node 1 ….
Leaf Node 1 -1 0 … 2 1 -3 0 … … …
Root Node 1 ….
Leaf Node 1 -1 0 … 2 1 -3 0 … … …
55
Building Secondary Index B-Tree
Root Node 1 ….
Leaf Node 1 -1 0 … 2 1 -3 0 … … …
FirstName LastName
SELECT FirstName, LastName
Aaron Skonnard
FROM TableName
Fritz Onion WHERE FirstName = ‘Jeff’
Keith Brown
Mike Woodring
Jeff Ross
Megan Russell
57
Building Hash Index
SELECT Value
FROM TableName
WHERE HashFn() = 4786