0% found this document useful (0 votes)
30 views60 pages

Index and Hashing 2017 Combined

Indexing and hashing are techniques used to improve data retrieval efficiency in databases. Indexes are data structures that allow the database management system to locate records more quickly. There are two main types of indexes: ordered indexes where search keys are stored in sorted order, and hash indexes where keys are distributed uniformly using a hash function. Files can be organized using different indexing techniques like primary, secondary, sparse, and dense indexes to support efficient querying of data.

Uploaded by

munawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views60 pages

Index and Hashing 2017 Combined

Indexing and hashing are techniques used to improve data retrieval efficiency in databases. Indexes are data structures that allow the database management system to locate records more quickly. There are two main types of indexes: ordered indexes where search keys are stored in sorted order, and hash indexes where keys are distributed uniformly using a hash function. Files can be organized using different indexing techniques like primary, secondary, sparse, and dense indexes to support efficient querying of data.

Uploaded by

munawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Indexing and Hashing

Week 4
National College of Ireland
Dublin, Ireland.
1
Indexing and Hashing

• Introduction to Indexing
• Basic Concepts
• Ordered Indices
• Static Hashing
• Dynamic Hashing
• Comparison of Ordered Indexing and Hashing
• Index Definition in SQL
• Tree Structures
2
Introduction to Indexing
• Computer memory (RAM or ROM) is significantly faster than HDD.

• SSDs are faster but could not eliminate this difference yet.

• Databases are stored on HDDs as files.

• Files are stored as collections of BLOCKS on HDD.

• Many blocks can be read into a PAGE or SEGMENT in the RAM.

• For simplicity, we estimate the cost of an operation by counting the number of


blocks that are read or written to hard disk.

• Possibility of blocked access could significantly lower the cost of I/O.

• But generally, we assume that each relation (table) is stored in a separate file
with: B blocks and R records per block. 3
File Organizations
• Technique for physically arranging records of a file on secondary
storage
• Factors for selecting file organization:
▪ Fast data retrieval and throughput ▪ Minimizing need for reorganization
▪ Efficient storage space utilization ▪ Accommodating growth
▪ Protection from failure and data loss ▪ Security from unauthorized use

• Types of file organizations


i. Heap – no particular order
ii. Sequential
iii. Indexed
iv. Hashed
Sequential file
organization

Sequential
Records of the file are
storage:
stored in sequence by
Average time to
the primary key field find desired record
values. = log2n

If this were a
heap,
Average time to
find desired record
= n/2
Indexed File Organizations
• Storage of records sequentially or
nonsequentially with an index that
allows software to locate individual
records

• Index: a table or other data structure


used to determine in a file the location
of records that satisfy some condition

• Primary keys are automatically


uses a tree search
indexed Average time to find desired record based
on depth of the tree and length of the list
• Other fields or combinations of fields
can also be indexed; these are called
secondary keys (or nonunique keys)
Introduction to Indexing
 The use of indexes makes the retrieval of data
more efficient.
• An index is a data structure that allows the DBMS to
locate particular records in a file more quickly, and
thereby speed up response to the user queries.

 Heap File
 no specific structure
 Use sequential (linear) access to records
 Hash File
 Uses hash function of a set of hash fields
 Allows direct access if hash fields values are known 7
Introduction to Indexing
 Indexes
 An index access structure is associated with a particular search key
and contains records consisting of the key value and the address of
the logical record in the data file containing the key value

• Data File: the file containing the logical records


• Index File: the file containing the index records
 Values in the index are usually sorted (ordered) according to the
indexing field which is usually based on a single attribute.
 When an index is ordered, we can perform an efficient binary
search on the index.
 NOTE: files can have more than one indices. 8
Introduction to Indexing
• Indexing mechanisms used to speed up access to desired data.
– e.g., author catalog in library
• Search Key – A specific attribute to set of attributes used to look
up records in a file.
• An index file consists of records (called index entries) of the form
search-key pointer

• Index files are typically much smaller than the original file
• Two basic kinds of indices:
i. Ordered indices: search keys are stored in sorted order
ii. Hash indices: search keys are distributed uniformly across
“buckets” using a “hash function”. 9
Files without an Index

 Ordered File
 Consider the following example of Staff table:
Sno Lname Position NIN Bno • Find Sno = SG36
SG14 White Manager WK4416 B5
• Find Sno from SG36 to
SG21 Black Snr Asst WL7868 B3
WL8767 8
SL37
SG24 Ford Deputy B3
SG36 Brown Assistant WF7656 75 B4 • Find Lname of Murphy

SG37 Black Assistant WD7867 B4 • Find Lname of Black


SL20 Red Manager WG786 B5
• Find all Managers
SL21 Murphy Assistant WF7666 75 B4
SL37 Whyte Deputy WD7816 7 B3 • Insert new record

SL66 Blue Manager WG7816 B5 • Delete record


10
i. Primary Indexes
ii. Clustering Indexes
iii. Secondary Indexes
Types of Index iv. Multilevel Indexes

Primary Indexes
• If the data file is sequentially
ordered and the indexing field
is a key field of the file (i.e., it
is guaranteed to have a unique
value in each record) then the
index is called a primary
index.
Primary index on the ordering
11
key field of the file
Clustering Index
• If the data file is sequentially
ordered on a non-key field (values
may be repeated) and the indexing
field corresponds to this non-key
field, then SQL Server adds an
additional hidden column to make
the key unique. This is called as the
Clustering Index. A clustered index
is the most common type of table
organization. A clustering index on the Dept_number
ordering nonkey field of an EMPLOYEE12 file
Secondary Index

• A Secondary Index is a
data structure that contains a
subset of attributes from a
table, along with an alternate
key to
support Query Operations.

Dense secondary index (with


block pointers) on a nonordering
key field of a file.
13
Unique and Nonunique Indexes
• Unique (primary) Index
– Typically done for primary keys, but could also apply
to other unique fields

• Nonunique (secondary) index


– Done for fields that are often used to group individual
entities (e.g. zip code, product category)
Single-level Ordered Indexes
• A file can have at most one primary index or one clustering
index, but not both.

• A file can have several secondary indexes


• Secondary indexes do not affect the physical organisation of
records.

• Further, an index can be sparse or dense


• A sparse index has an index record for only some of the search
key values in the file.
• A dense index has an index record for every search key value
in the file. 15
Sparse Index Files
• Sparse index – index record appears for only some of the
search key values in the file.
Brighton Brighton A-217 750
Miami Downtown A-101 500
Redwood Downtown A-110 600
• In sparse index, index records are not created for Miami A-215 700
every search key. Perryridge A-102 400
• An index record here contains a search key and an Perryridge A-201 900
actual pointer to the data on the disk. Perryridge A-218 700
• To search a record, we first proceed by index Redwood A-222 700
record and reach at the actual location of the data. Round Hill A-305 350

• If the data we are looking for is not where we


directly reach by following the index, then the
system starts sequential search until the desired 16
data is found.
Dense Index Files

• Dense index – index record appears for every search-


key value in the file.

Brighton Brighton A-217 750


Downtown Downtown A-101 500
Miami Downtown A-110 600
Perryridge Miami A-215 700
Redwood Perryridge A-102 400
Round Hill
Perryridge A-201 900
Perryridge A-218 700
Redwood A-222 700
Round Hill A-305 350

17
Dense Index Files
• Dense index — Index record appears for every search-key value in the file.
• e.g., index on ID attribute of instructor relation
Single-level Ordered Indexes
Primary Index Performance
• The index file requires significantly fewer blocks than the data
file
i. Sparse index
ii. Index file record typically smaller in size than data file record
• A binary search on the index file requires fewer block accesses
than a binary search on the data file
• Insertion and deletion of records is problematic
• Not only have we to move records in the data file we also have to change
some index entries
• Storage Overhead is not a serious problem 19
Single-level Ordered Indexes

 Primary Indexes The first record in each block


 Example: of the data file is called the
anchor record
Primary Block sId Level Address
Key Value Address Block 1 9666162
9667145
9666162 Block 1
9674545
9684535 Block 2 Block 2 9684535
9716262 Block 3 9695352
9706363
Total number of entries in the Block 3 9716262
index is the same as the number 9723437
9733255
of disk blocks in the data file 20
Single-level Ordered Indexes
Clustering Index (Example)

Clustering Block Level sId Address


Field Value Address Block 1 0
0
0 Block 1 1

1 Block 1 Block 2 1
1
2 Block 2
2
3 Block 3
Block 3 2
3
3
21
Single-level Ordered Indexes
Secondary Index Performance
• A secondary index is built for a data file sorted on a non-ordering field
• The index file is itself another sorted file whose records are of fixed or
variable length consisting of two fields
• The first field is of the same data type as the indexing field of the data file
• The second field is a pointer to a disk block or a record

Indexing Field Value Block Address / Record Pointer

• We can consider two cases


i. The index access structure is constructed on a key field
ii. The index access structure is constructed on a non-key field
22
Secondary Indices Example
Secondary index on salary field of instructor

• Index record points to a bucket that contains pointers to all the


actual records with that particular search-key value.
• Secondary indices have to be dense
Single-level Ordered Indexes
(Summary)
Index Type Number of Index Entries Dense / Use Block Anchor
Sparse
Primary Equal to the number of Sparse Yes
blocks in the data file

Clustering Equal to the number of Sparse Yes if separate blocks


distinct indexing field are used for records
values with different
indexing field values.
No otherwise
Secondary on a key field Equal to number of Dense No
records in the data file

Secondary on a non-key Equal to the number of Dense for


field records for option 1. Equal option 1. Sparse
to the number of distinct for options 2
indexing field values for and
options 2 and 3. 3
24
Multi-level Indexes
• When an index file becomes large and extends over many
pages, the search time for the required index increases

• A binary search requires approximately log2p page accesses


for an index with p pages.

• A multi-level index attempts to overcome this problem by


reducing the search range

‒ Treat the index like any other file

‒ Split the index into a number of smaller indexes

‒ Maintain an index to the indexes 25


Multi-level Indexes

26
Multi-level Indexes
Multi-level Indexes Performance
• Search performance is increased when searching for a record based
on a specified indexing field value

• Problems with insertions and deletions still evident

• To retain the benefits of using multi-level indexing while reducing


index insertion and deletion problems, an approach is taken to
adopt a multi-level structure that leaves some space in each of its
blocks for inserting new entries.

• This is called a dynamic multi-level index and is often


implemented using data structures called B-trees and B+trees.
27
Multi-level Indexes

• If an index does not fit in memory, access becomes expensive.

• To reduce number of disk accesses to index records, treat the index


kept on disk as a sequential file and construct a sparse index on it.

• Outer index – a sparse index on main index


• Inner index – the main index file
• If even outer index is too large to fit in main memory, yet another
level of index can be created, and so on.

• Indices at all levels must be updated on insertion or deletion from


the file. 28
Multi-level Indexes
outer index inner index

Data
Index Block 0
Block 0

M
M Data
Block 1

Index M
Block 1
M
M

M
M
29
Index Evaluation Index Update:
Metrics Deletion
• If the deleted record
was the only record
in the file with its
particular search-
key value, the
search- key is deleted
from the index also.
30
Index Update: Insertion
Single-level index insertion:
• Perform a lookup using the search-key value appearing in the
record to be inserted.

• Dense indices – if the search-key value does not appear in


the index, insert it.

• Sparse indices – if the index stores an entry for each block


of the file, no change needs to be made to the index unless a
new block is created. In this case, the first search-key value
appearing in the new block is inserted into the index.

• Multilevel insertion (as well as deletion) algorithms are


31
simple extensions of the single-level algorithms
Introduction to Hashing
• Hashing is the transformation
of a string of characters into a
usually shorter fixed-length
value or key that represents the
original string.

• Hashing is used to index and retrieve items in a database


because it is faster to find the item using the shorter hashed key
than to find it using the original value.

• It is also used in many encryption algorithms. 32


Static Hashing
• Hashing is an effective technique to calculate the direct location of a data
record on the disk without using index structure.

• In static hashing, when a search-key value


is provided, the hash function always
computes the same address.

• For example, if mod-4 hash function is


used, then it shall generate only 5 values.
The output address shall always be same
for that function.
• The number of buckets provided remains unchanged at all times.

• A bucket is a unit of storage containing one or more records (a bucket is


typically a disk block). 33
Operations
(Access, Insertion and Deletion)

• Insertion: When a record is required to be entered using static


hash, the hash function h computes the bucket address for search key
K, where the record will be stored.

• Bucket address = h(K)

• Search: When a record needs to be retrieved, the same hash


function can be used to retrieve the address of the bucket where the
data is stored.

• Delete: This is simply a search followed by a deletion operation.


34
Hash Functions
• Worst hash function maps all search-key values to the same bucket; this
makes access time proportional to the number of search-key values in the file.
• An ideal hash function is:
i. Uniform: Each bucket is assigned the same number of search-key
values from the set of all possible values.
ii. Random: Each bucket will have the same number of records assigned
to it irrespective of the actual distribution of search-key values in the
file.
• Typical hash functions perform computation on the internal binary
representation of the search-key.
• For example, for a string search-key, the binary representations of all
the characters in the string could be added and the sum modulo number
35
of buckets could be returned.
Example of
Hash File Organization
bucket 0 bucket 5
• Hash file organization of account Perryridge A-102 400
Perryridge A-201 900

file, using branch-name as key. Perryridge


Miami
A-218
A-215
700
700

bucket 1 bucket 6
• (See figure in previous slide.)

• There are 10 buckets. bucket 2 bucket 7

• The binary representation of the


ith character is assumed to be the bucket 3
Brighton A-217 750
bucket 8
Downtown A-101 500
Round Hill A-305 350 Downtown A-110 600
integer i.

• The hash function returns the bucket 4 bucket 9


Redwood A-222 700
sum of the binary representations
of the characters modulo 10. 36
Handling of
Bucket Overflows
• Although the probability of
bucket overflow can be reduced,
it can not be eliminated; it is
handled by using overflow
buckets.

• Overflow chaining — the overflow buckets of a given bucket


are chained together in a linked list

• Above scheme is called closed hashing. An alternative, called


open hashing, is not suitable for database applications. 37
Hash Indices

• Hashing can be used not only for file organization, but also
for index-structure creation. A hash index organizes the
search keys, with their associated record pointers, into a
hash file structure.

• Hash indices are always secondary indices — if the file itself


is organized using hashing, a separate primary hash index
on it using the same search-key is unnecessary.

• However, the term hash index is used to refer to both


secondary index structures and hash organized files.
38
Example of Hash Indices
bucket 0

bucket 1
A-215 Brighton A-217 750
A-305 Downtown A-101 500
bucket 2 Downtown A-110 600
A-101
A-110
Miaimi A-215 700
Perryridge A-102 400
bucket 3
A-217 A-201 Perryridge A-201 900
A-102 Perryridge A-218 700
bucket 4 Redwood A-222 700
A-218 Round Hill A-305 350

bucket 5

bucket 6
A-222
39
Deficiencies of Static Hashing
• In static hashing, function (h) maps search-key values to a fixed
set (B) of bucket addresses.
• Databases grow with time. If initial number of buckets is too small,
performance will degrade due to too much overflows.

• If file size at some point in the future is anticipated and number of buckets
allocated accordingly, significant amount of space will be wasted initially.

• If database shrinks, again space will be wasted.

• One option is periodic, re-organization of the file with a new hash function,
but it is very expensive.

• These problems can be avoided by using techniques that allow


40
the number of buckets to be modified dynamically.
Dynamic Hashing
• Good for database that grows and shrinks rapidly in size
• Allows the hash function to be modified dynamically
• Extendable hashing – one form of dynamic hashing
• Hash function generates values over a large range — typically b-bit integers,
with b = 32.
• At any time use only the last i bits of the hash function to index into a table of
bucket addresses, where:
• 0 ≤ i ≤ 32
• Initially i = 0
• Value of i grows and shrinks as the size of the database grows and shrinks.
• Actual number of buckets is < 2i, and this also changes dynamically due to
merging and splitting of buckets. 41
Use of Hash Structure Example

42
Comparison of
Ordered Index and Hashing
• Issues to consider:
▪ Cost of periodic re-organization
▪ Relative frequency of insertions and deletions
▪ Is it desirable to optimize average access time at the expense
of worst-case access time?
▪ Expected type of queries:
▪ Hashing is generally better at retrieving records having a
specified value of the key.
▪ If range queries are common, ordered indices are to be
preferred 43
Comparison of
Different File Organizations
B-Tree Index
• Provide multi-level
access structure
• Tree is always balanced
• Space wasted by
deletion never becomes
excessive
– Each node is at least
half-full
B-tree structures (a) A node in a B-tree with q−1
• Each node in a B-tree search values (b) A B-tree of order p=3. The values
of order p can have at were inserted in the order 8, 5, 1, 7, 3, 12, 9, 6
most p-1 search values
Advantages of B-Tree Index
▪ B-Tree Index speeds up data access
 Storage engine traverses from root node to leaf node with the
help of pointers
▪ Increase performance of following query patterns:
 Full Value (e.g. ‘London’, ‘Bristol’)
 Leftmost Value or Column Prefix (e.g. ‘Lon’ from ‘London’,
‘Mary’ from ‘Mary Hwe’)
 Range of Value (e.g. 1 to 99, Aaron to Fritz, Aaron to Kei%)
▪ B-Tree structure helps ORDER BY clause to increase
the performance 46
B+ -Trees
• Data pointers stored only at
the leaf nodes
– Leaf nodes have an entry for
every value of the search
field, and a data pointer to
the record if search field is a
key field
– For a nonkey search field,
the pointer points to a block
containing pointers to the
data file records The nodes of a B+-tree (a) Internal node of a
• Internal nodes B+-tree with q−1 search values (b) Leaf
node of a B+-tree with q−1 search values
– Some search field values
from the leaf nodes repeated
and q−1 data pointers
to guide search
Difference
between B-tree and B+-tree
• In a B-tree, pointers to data records exist at all
levels of the tree

• In a B+-tree, all pointers to data records exists at


the leaf-level nodes

• A B+-tree can have less levels (or higher capacity


of search values) than the corresponding B-tree
Visual Representations

▪ B-Tree

▪ Clustered Index

▪ Secondary Index

49
Clustered Index B-Tree

Root Node 1 ….

Intermediate Node 1 -3 0 31…

Leaf Node 1 -1 0 … 2 1 -3 0 … … …

50
Building Clustered B-Tree Index

▪ Assumption: Each page contains 10 rows

Row1
Row2
Row3
Row4
Row5
0:1 Row6
Row7
Row8
Row9
Row1 0

51
Building Clustered B-Tree Index

▪ Assumption: Each page contains 10 rows

Row1 Row11
Row2 Row12
Row3 Row13
Row4 Row14
Row5 Row15
Row6 Row16
Row7 Row17
Row8 Row18
Row9 Row19
Row10 Row20
0:1 0:2

52
Building Clustered B-Tree Index
▪ Assumption: Each page contains 10 rows

Row 1 -1 0 -> Page0:1


Row11 -2 0 -> Page0:2

1:1

Row1 Row11
Row2 Row12
Row3 Row13
Row4 Row14
Row5 Row15
Row6 Row16
Row7 Row17
Row8 Row18
Row9 Row19 53
0:1 Row10 0:2 Row20
Traversing Clustered B-Tree Index

Root Node 1 ….

Intermediate Node 1 -3 0 31…

Leaf Node 1 -1 0 … 2 1 -3 0 … … …

Leaf node will contain entire data ordered by key column 54


Traversing Clustered B-Tree Index

Root Node 1 ….

Intermediate Node 1 -3 0 31…

Leaf Node 1 -1 0 … 2 1 -3 0 … … …

55
Building Secondary Index B-Tree

Root Node 1 ….

Intermediate Node 1 -3 0 31…

Leaf Node 1 -1 0 … 2 1 -3 0 … … …

Instead of data leaf node will contain pointers to clustered index


56
Building Hash Index

FirstName LastName
SELECT FirstName, LastName
Aaron Skonnard
FROM TableName
Fritz Onion WHERE FirstName = ‘Jeff’

Keith Brown
Mike Woodring
Jeff Ross
Megan Russell
57
Building Hash Index

FirstName LastName HashFn() Value


Aaron Skonnard 1254 Pointer to row 1
Fritz Onion 5487 Pointer to row 2
Keith Brown 6587 Pointer to row 3
Mike Woodring 6842 Pointer to row 4
Jeff Ross 4786 Pointer to row 5
Megan Russell 9587 Pointer to row 6
58
Building Hash Index
SELECT FirstName, LastName
FROM TableName HashFn(‘Jeff’) = 4786
WHERE FirstName = ‘Jeff’

SELECT Value
FROM TableName
WHERE HashFn() = 4786

HashFn() Value FirstName LastName


1254 Pointer to row 1 Aaron Skonnard
5487 Pointer to row 2 Fritz Onion
6587 Pointer to row 3 Keith Brown
6842 Pointer to row 4 Mike Woodring
4786 Pointer to row 5 Jeff Ross
9587 Pointer to row 6
Megan Russell 59
Module Resources
Recommended Book Resources
• Thomas Connolly, Carolyn Begg 2014, Database Systems: A Practical Approach
to Design, Implementation, and Management, 6th Edition Ed., Pearson
Education [ISBN: 1292061189] [Present in our Library]
Supplementary Book Resources
• Gordon S. Linoff, Data Analysis Using SQL and Excel, Wiley [ISBN:
0470099518]
• Eric Redmond, Jim Wilson, Seven Databases in Seven Weeks, Pragmatic
Bookshelf [ISBN: 1934356921]
• Baron Schwartz, Peter Zaitsev, Vadim Tkachenko, High Performance MySQL,
O'Reilly Media [ISBN: 1449314287]
Other Resources
• Website: http://www.thearling.com • Website:
• Website: http://www.mongodb.org https://www.tutorialspoint.com/dbms/db
• Website: https://app.pluralsight.com ms_hashing.htmpluralsight.com
60

You might also like