07 Hashtables
07 Hashtables
1 Data Structures
A DBMS uses various data structures for many different parts of the system internals. Some examples
include:
• Internal Meta-Data: This is data that keeps track of information about the database and the system
state.
Ex: Page tables, page directories
• Core Data Storage: Data structures are used as the base storage for tuples in the database.
• Temporary Data Structures: The DBMS can build ephemeral data structures on the fly while
processing a query to speed up execution (e.g., hash tables for joins).
• Table Indices: Auxiliary data structures can be used to make it easier to find specific tuples.
There are two main design decisions to consider when implementing data structures for the DBMS:
1. Data organization: We need to figure out how to layout memory and what information to store
inside the data structure in order to support efficient access.
2. Concurrency: We also need to think about how to enable multiple threads to access the data structure
without causing problems, ensuring that the data remains correct and sound.
2 Hash Table
A hash table implements an associative array abstract data type that maps keys to values. It provides on
average O (1) operation complexity (O (n) in the worst-case) and O (n) storage complexity. Note that even
with O (1) operation complexity on average, there are constant factor optimizations which are important
to consider in the real world.
A hash table implementation is comprised of two parts:
• Hash Function: This tells us how to map a large key space into a smaller domain. It is used to
compute an index into an array of buckets or slots. We need to consider the trade-off between
fast execution and collision rate. On one extreme, we have a hash function that always returns a
constant (very fast, but everything is a collision). On the other extreme, we have a “perfect” hashing
function where there are no collisions, but would take extremely long to compute. The ideal design
is somewhere in the middle.
• Hashing Scheme: This tells us how to handle key collisions after hashing. Here, we need to con-
sider the trade-off between allocating a large hash table to reduce collisions and having to execute
additional instructions when a collision occurs.
Spring 2024 – Lecture #07 Hash Tables
3 Hash Functions
A hash function takes in any key as its input. It then returns an integer representation of that key (i.e., the
“hash”). The function’s output is deterministic (i.e., the same key should always generate the same hash
output).
The DBMS need not use a cryptographically secure hash function (e.g., SHA-256) because we do not need
to worry about protecting the contents of keys. These hash functions are primarily used internally by the
DBMS and thus information is not leaked outside of the system. In general, we only care about the hash
function’s speed and collision rate.
The current state-of-the-art hash function is Facebook XXHash3.
Non-unique Keys: In the case where the same key may be associated with multiple different values or
tuples, there are two approaches.
• Separate Linked List: Instead of storing the values with the keys, we store a pointer to a separate
storage area that contains a linked list of all the values, which may overflow to multiple pages.
• Redundant Keys: The more common approach is to simply store the same key multiple times in the
table. Everything with linear probing still works even if we do this.
Optimizations: There are several ways to further optimize this hashing scheme:
• Specialized hash table implementations based on the data type or size of keys: These could differ
in the way they store data, perform splits, etc. For example, if we have string keys, we could store
smaller strings in the original hash table and only a pointer or hash for larger strings.
• Storing metadata in a separate array: An example would be to store empty slot/tombstone informa-
tion in a packed bitmap as a part of the page header or in a separate hash table, which would help
us avoid looking up deleted keys.
• Maintaining versions for the hash table and its slots: Since allocating memory for a hash table is
expensive, we may want to reuse the same memory repeatedly. To clear out the table and invalidate
its entries, we can increment the version counter of the table instead of marking each slot as delet-
ed/empty. A slot can be treated as empty if there is a mismatch between the slot version and table
version.
Google’s absl::flat hash map is a state-of-the-art implementation of Linear Probe Hashing.