0% found this document useful (0 votes)

5 views4 pages

07 Hashtables

This lecture discusses hash tables as a data structure used in database management systems (DBMS), highlighting their implementation, including hash functions and hashing schemes. It covers static and dynamic hashing schemes, such as linear probing, cuckoo hashing, chained hashing, extendible hashing, and linear hashing, each with its own advantages and trade-offs. The lecture emphasizes the importance of optimizing hash functions for speed and collision rates while managing memory efficiently.

Uploaded by

itxjack100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views4 pages

07 Hashtables

Uploaded by

itxjack100

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Lecture #07: Hash Tables

15-445/645 Database Systems (Spring 2024)

https://15445.courses.cs.cmu.edu/spring2024/
Carnegie Mellon University
Jignesh Patel

1 Data Structures
A DBMS uses various data structures for many different parts of the system internals. Some examples
include:
• Internal Meta-Data: This is data that keeps track of information about the database and the system
state.
Ex: Page tables, page directories
• Core Data Storage: Data structures are used as the base storage for tuples in the database.
• Temporary Data Structures: The DBMS can build ephemeral data structures on the fly while
processing a query to speed up execution (e.g., hash tables for joins).
• Table Indices: Auxiliary data structures can be used to make it easier to find specific tuples.
There are two main design decisions to consider when implementing data structures for the DBMS:
1. Data organization: We need to figure out how to layout memory and what information to store
inside the data structure in order to support efficient access.
2. Concurrency: We also need to think about how to enable multiple threads to access the data structure
without causing problems, ensuring that the data remains correct and sound.

2 Hash Table
A hash table implements an associative array abstract data type that maps keys to values. It provides on
average O (1) operation complexity (O (n) in the worst-case) and O (n) storage complexity. Note that even
with O (1) operation complexity on average, there are constant factor optimizations which are important
to consider in the real world.
A hash table implementation is comprised of two parts:
• Hash Function: This tells us how to map a large key space into a smaller domain. It is used to
compute an index into an array of buckets or slots. We need to consider the trade-off between
fast execution and collision rate. On one extreme, we have a hash function that always returns a
constant (very fast, but everything is a collision). On the other extreme, we have a “perfect” hashing
function where there are no collisions, but would take extremely long to compute. The ideal design
is somewhere in the middle.
• Hashing Scheme: This tells us how to handle key collisions after hashing. Here, we need to con-
sider the trade-off between allocating a large hash table to reduce collisions and having to execute
additional instructions when a collision occurs.
Spring 2024 – Lecture #07 Hash Tables

3 Hash Functions
A hash function takes in any key as its input. It then returns an integer representation of that key (i.e., the
“hash”). The function’s output is deterministic (i.e., the same key should always generate the same hash
output).
The DBMS need not use a cryptographically secure hash function (e.g., SHA-256) because we do not need
to worry about protecting the contents of keys. These hash functions are primarily used internally by the
DBMS and thus information is not leaked outside of the system. In general, we only care about the hash
function’s speed and collision rate.
The current state-of-the-art hash function is Facebook XXHash3.

4 Static Hashing Schemes

A static hashing scheme is one where the size of the hash table is fixed. This means that if the DBMS runs
out of storage space in the hash table, then it has to rebuild a larger hash table from scratch, which is very
expensive. Typically the new hash table is twice the size of the original hash table.
To reduce the number of wasteful comparisons, it is important to avoid collisions of hashed key. Typically,
we use twice the number of slots as the number of expected elements.
The following assumptions usually do not hold in reality:
1. The number of elements is known ahead of time.
2. Keys are unique.
3. There exists a perfect hash function.
Therefore, we need to choose the hash function and hashing schema appropriately.

4.1 Linear Probe Hashing

This is the most basic hashing scheme. It is also typically the fastest. It uses a circular buffer of array slots.
The hash function maps keys to slots.
For insertions, when a collision occurs, we linearly search the subsequent slots until an open one is found,
looping around from the end to the start of the array if necessary. For lookups, we can check the slot the
key hashes to, and search linearly until we find the desired entry. If we reach an empty slot or we iterated
over every slot in the hashtable, then the key is not in the table. Note that this means we have to store
both key and value in the slot so that we can check if an entry is the desired one.
Deletions are more tricky. We have to be careful about just removing the entry from the slot, as this may
prevent future lookups from finding entries that have been put below the now empty slot. There are two
solutions to this problem:
• The most common approach is to use “tombstones”. Instead of deleting the entry, we replace it with
a “tombstone” entry which tells future lookups to keep scanning. Note that insertions are able to
insert into tombstone indices.
• The other option is to shift the adjacent data after deleting an entry to fill the now empty slot.
However, we must be careful to only move the entries which were originally shifted. This is rarely
implemented in practice as it is extremely expensive when we have a large number of keys.

15-445/645 Database Systems

Page 2 of 4
Spring 2024 – Lecture #07 Hash Tables

Non-unique Keys: In the case where the same key may be associated with multiple different values or
tuples, there are two approaches.
• Separate Linked List: Instead of storing the values with the keys, we store a pointer to a separate
storage area that contains a linked list of all the values, which may overflow to multiple pages.
• Redundant Keys: The more common approach is to simply store the same key multiple times in the
table. Everything with linear probing still works even if we do this.
Optimizations: There are several ways to further optimize this hashing scheme:
• Specialized hash table implementations based on the data type or size of keys: These could differ
in the way they store data, perform splits, etc. For example, if we have string keys, we could store
smaller strings in the original hash table and only a pointer or hash for larger strings.
• Storing metadata in a separate array: An example would be to store empty slot/tombstone informa-
tion in a packed bitmap as a part of the page header or in a separate hash table, which would help
us avoid looking up deleted keys.
• Maintaining versions for the hash table and its slots: Since allocating memory for a hash table is
expensive, we may want to reuse the same memory repeatedly. To clear out the table and invalidate
its entries, we can increment the version counter of the table instead of marking each slot as delet-
ed/empty. A slot can be treated as empty if there is a mismatch between the slot version and table
version.
Google’s absl::flat hash map is a state-of-the-art implementation of Linear Probe Hashing.

4.2 Cuckoo Hashing

Instead of using a single hash table, this approach maintains multiple hashtables with different hash func-
tions. The hash functions are the same algorithm (e.g., XXHash, CityHash); they generate different hashes
for the same key by using different seed values.
When we insert, we check every table and choose one that has a free slot (if multiple have one, we can
compare things like load factor, or more commonly, just choose a random table). If no table has a free slot,
we choose (typically a random one) and evict the old entry. We then rehash the old entry into a different
table. In rare cases, we may end up in a cycle. If this happens, we can rebuild all of the hash tables with
new hash function seeds (less common) or rebuild the hash tables using larger tables (more common).
Cuckoo hashing guarantees O (1) lookups and deletions, but insertions may be more expensive.
Professor’s note: The essence of cuckoo hashing is that multiple hash functions map a key to different
slots. In practice, cuckoo hashing is implemented with multiple hash functions that map a key to different
slots in a single hash table. Further, as hashing may not always be O (1), cuckoo hashing lookups and
deletions may cost more than O (1).

5 Dynamic Hashing Schemes

The static hashing schemes require the DBMS to know the number of elements it wants to store. Otherwise
it has to rebuild the table if it needs to grow/shrink in size.
Dynamic hashing schemes are able to resize the hash table on demand without needing to rebuild the
entire table. The schemes perform this resizing in different ways that can either maximize reads or writes.

15-445/645 Database Systems

Page 3 of 4
Spring 2024 – Lecture #07 Hash Tables

5.1 Chained Hashing

This is the most common dynamic hashing scheme. The DBMS maintains a linked list of buckets for each
slot in the hash table. Keys which hash to the same slot are simply inserted into the linked list for that slot.
To look up an element, we hash to its bucket and then scan for it.
Bloom Filters: Lookup can be optimized by additionally storing bloom filters in the bucket pointer list,
which would tell us if a key does not exist in the linked list, thus helping us avoid the cost of lookup in such
cases. The bloom filter is a probabilistic data structure, a bitmap, that answers set membership queries. It
allows false positives but never false negatives.
Assume we use bitmaps of size n and we employ k hash functions h1 , h2 , ..., hk to implement the bloom
filters. Adding bloom filters modifies the hashing implementation in the following ways:
• Insertion: When we insert x into a bucket in our hash table, we set the indices
h1 (x)%n, h2 (x)%n, …, hk (x)%n of the bucket’s bitmap as 1s.
• Lookup: When we lookup x, which hashes to a certain bucket, we check whether all indices
h1 (x)%n, h2 (x)%n, …, hk (x)%n of the bucket’s bitmap are 1s. If they are, the bloom filter tells us
that x may exist in the bucket linked list. If not, the bloom filter tells us that x definitely does not
exist in the bucket linked list.

5.2 Extendible Hashing

Improved variant of chained hashing that splits buckets instead of letting chains to grow forever. This
approach allows multiple slot locations in the hash table to point to the same bucket chain.
The core idea behind re-balancing the hash table is to to move bucket entries on split and increase the
number of bits to examine to find entries in the hash table. This means that the DBMS only has to move
data within the buckets of the split chain; all other buckets are left untouched.
• The DBMS maintains a global and local depth bit counts. These bit counts determine the number of
most significant bits we need to look at to find buckets in the slot array.
• When a bucket is full, the DBMS splits the bucket and re-distributes its elements. If the local depth
of the split bucket is less than the global depth, then the new bucket is just added to the existing slot
array. Otherwise, the DBMS doubles the size of the slot array to accommodate the new bucket and
increments the global depth counter.

5.3 Linear Hashing

Instead of immediately splitting a bucket when it overflows, this scheme maintains a split pointer that keeps
track of the next bucket to split. No matter whether this pointer is pointing to the bucket that overflowed,
the DBMS always splits. The overflow criterion is left up to the implementation.
• When any bucket overflows, split the bucket at the pointer location. Add a new slot entry and a new
hash function, and apply this function to rehash the keys in the split bucket.
• If the original hash function maps to a slot that has previously been pointed to by the split pointer,
apply the new hash function to determine the actual location of the key.
• When the pointer reaches the very last slot, delete the original hash function and move the pointer
back to the beginning.
If the highest bucket below the split pointer is empty, we can also remove the bucket and move the split
pointer in the reverse direction, thereby shrinking the size of the hash table.

15-445/645 Database Systems

Page 4 of 4

Cassandra Succinctly
100% (1)
Cassandra Succinctly
121 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
6 pages
Homework 4
50% (2)
Homework 4
13 pages
Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
MCS 226
No ratings yet
MCS 226
13 pages
Homework 4 Coursera 07 PDF
89% (9)
Homework 4 Coursera 07 PDF
11 pages
06 NAND and NOR
No ratings yet
06 NAND and NOR
78 pages
Stream Computing Methods
No ratings yet
Stream Computing Methods
35 pages
Module 5
No ratings yet
Module 5
72 pages
CH8 Hashing
No ratings yet
CH8 Hashing
110 pages
Data Structures and Algorithm
No ratings yet
Data Structures and Algorithm
61 pages
DSA Unit 1
No ratings yet
DSA Unit 1
144 pages
Lecture05 Hash Table
No ratings yet
Lecture05 Hash Table
65 pages
06 Hashtables
No ratings yet
06 Hashtables
85 pages
Hashing and Skiplist - Removed
No ratings yet
Hashing and Skiplist - Removed
113 pages
Unit 1 Dsa Hashing
No ratings yet
Unit 1 Dsa Hashing
137 pages
Hashing Unit 1
No ratings yet
Hashing Unit 1
91 pages
UNIT 1 - Hashing
No ratings yet
UNIT 1 - Hashing
118 pages
What Is Hashing?
No ratings yet
What Is Hashing?
24 pages
DS Module-X
No ratings yet
DS Module-X
74 pages
Data Structure
No ratings yet
Data Structure
21 pages
UNIT 1 - Hashing
No ratings yet
UNIT 1 - Hashing
118 pages
Hash Table
No ratings yet
Hash Table
68 pages
Dsa 4
No ratings yet
Dsa 4
55 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Mini Final
No ratings yet
Mini Final
79 pages
CSC 302 - Hashing Techniques
No ratings yet
CSC 302 - Hashing Techniques
19 pages
Assignment 5-Fall 2024 - 553
No ratings yet
Assignment 5-Fall 2024 - 553
8 pages
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
No ratings yet
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
32 pages
Hash Tables
No ratings yet
Hash Tables
20 pages
11 Hash Tables Slides
No ratings yet
11 Hash Tables Slides
34 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Secure Keyword Search Using Bloom Filter With Specified Character Positions
No ratings yet
Secure Keyword Search Using Bloom Filter With Specified Character Positions
18 pages
Unit 7
No ratings yet
Unit 7
27 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
How We Improved Our Performance Using ElasticSearch Plugins - Part 2 - by Xiaohu Li - Tinder Tech Blog - Medium
No ratings yet
How We Improved Our Performance Using ElasticSearch Plugins - Part 2 - by Xiaohu Li - Tinder Tech Blog - Medium
1 page
Stateful Switch Hotnets 2020
No ratings yet
Stateful Switch Hotnets 2020
7 pages
Week4 2 Forloops
No ratings yet
Week4 2 Forloops
27 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
ds-5 Removed
No ratings yet
ds-5 Removed
16 pages
Hashing
No ratings yet
Hashing
56 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
Week 9 - Hash Functions and Collision
No ratings yet
Week 9 - Hash Functions and Collision
73 pages
Big Data Pyq 2023 Solution
No ratings yet
Big Data Pyq 2023 Solution
18 pages
Searching, Sorting and Hashing
No ratings yet
Searching, Sorting and Hashing
52 pages
HASHING
No ratings yet
HASHING
8 pages
Unit 9 Hashing BIM
No ratings yet
Unit 9 Hashing BIM
5 pages
BDA IAT2 Theory
No ratings yet
BDA IAT2 Theory
7 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Hash Tables: Dr. Dibakar Saha
No ratings yet
Hash Tables: Dr. Dibakar Saha
26 pages
Hashing
No ratings yet
Hashing
42 pages
SQL Foreign Key
No ratings yet
SQL Foreign Key
1 page
Hashing
No ratings yet
Hashing
44 pages
Lec12 Hash Tables 09092024 090609pm
No ratings yet
Lec12 Hash Tables 09092024 090609pm
48 pages
Final Hashing
No ratings yet
Final Hashing
41 pages
CS-3032 (BD) - CS End April 2024
No ratings yet
CS-3032 (BD) - CS End April 2024
27 pages
Hashing
No ratings yet
Hashing
23 pages
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
No ratings yet
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
10 pages
CS143: Hash Index
No ratings yet
CS143: Hash Index
26 pages
Hashing New
No ratings yet
Hashing New
48 pages
Unit 3.docx Dbms
No ratings yet
Unit 3.docx Dbms
25 pages
06 Hashtables
No ratings yet
06 Hashtables
3 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
5 pages
Introduction To Hashing & Hashing Techniques: Review of Searching Techniques
No ratings yet
Introduction To Hashing & Hashing Techniques: Review of Searching Techniques
19 pages
DSA Lab 11 Hashing
No ratings yet
DSA Lab 11 Hashing
9 pages
Lect Hashing
No ratings yet
Lect Hashing
36 pages
4.5 Static Hashing, Dynamic Hashing
No ratings yet
4.5 Static Hashing, Dynamic Hashing
8 pages
Hashing PDF
No ratings yet
Hashing PDF
56 pages
GROUP 15.Pptx Presentation
No ratings yet
GROUP 15.Pptx Presentation
29 pages
ADD Find Unsorted Array Sorted Array Linked List
No ratings yet
ADD Find Unsorted Array Sorted Array Linked List
27 pages
Hashing in Data Structure
No ratings yet
Hashing in Data Structure
43 pages
DSA MK Lect2 PDF
No ratings yet
DSA MK Lect2 PDF
92 pages
Hashing
No ratings yet
Hashing
56 pages
Adbs 5
No ratings yet
Adbs 5
37 pages
Unit 2 Hashing
No ratings yet
Unit 2 Hashing
3 pages
Unit-3 Hashing Storage Btree
No ratings yet
Unit-3 Hashing Storage Btree
26 pages
05 Hashing
No ratings yet
05 Hashing
47 pages
Optimization of Rocksdb For Redis On Flash: Keren Ouaknine Oran Agra Zvika Guz
No ratings yet
Optimization of Rocksdb For Redis On Flash: Keren Ouaknine Oran Agra Zvika Guz
7 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Hashing
No ratings yet
Hashing
8 pages
Hashing
No ratings yet
Hashing
34 pages
Build A Web Crawler
No ratings yet
Build A Web Crawler
6 pages
Logic Gates: Object: Theory
No ratings yet
Logic Gates: Object: Theory
5 pages
L5 HashTables
No ratings yet
L5 HashTables
22 pages
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
No ratings yet
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
6 pages
The Backbone of Computing: An Exploration of Data Structures
No ratings yet
The Backbone of Computing: An Exploration of Data Structures
9 pages
Unit28 Hashing1
No ratings yet
Unit28 Hashing1
19 pages
Hashing
No ratings yet
Hashing
16 pages
Hashing in DBMS
No ratings yet
Hashing in DBMS
11 pages
Hash Function
No ratings yet
Hash Function
9 pages
Bloom Filter PDF
No ratings yet
Bloom Filter PDF
13 pages
ByteByteGo LinkedIn PDF
100% (2)
ByteByteGo LinkedIn PDF
159 pages
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet

07 Hashtables

Uploaded by

07 Hashtables

Uploaded by

Lecture #07: Hash Tables

15-445/645 Database Systems (Spring 2024)

4 Static Hashing Schemes

4.1 Linear Probe Hashing

15-445/645 Database Systems

4.2 Cuckoo Hashing

5 Dynamic Hashing Schemes

15-445/645 Database Systems

5.1 Chained Hashing

5.2 Extendible Hashing

5.3 Linear Hashing

15-445/645 Database Systems

You might also like