Intro To C - Module 10
Intro To C - Module 10
An associative array is indexed by keys in some known type K, and returns values in type V. A
resizeable array or linked list of (K, V) pairs will suffice to implement one. In fact, for small
collections, this is the preferred way to do it. The problem is that, when you have millions of
keys, a “linear search” (possible full scan) of the backing array is going to be too slow.
Tree-based data structures improve asymptotic complexity from O(N), where N is the number of
keys stored in the array, to O(log N). We will now discuss a data structure that, under certain
circumstances, improves it further—a hash table.
You will not be required to use hash tables for PSI. Your interpreter does not need to be fast,
and you may use regular lists for every purpose where you might want to use a hash table in a
production-grade interpreter. I am teaching this because you should know how hash tables
work; you don’t, however, need them for the course project.
It’s good practice, when writing C code that others will use, to document one’s interface in a
header file.
// hashtbl.h
#include <inttypes.h>
#ifndef HASHTBL_H
#define HASHTBL_H
table* table_new();
void table_print(table* t);
bool table_insert(table* t, const char* key, const char* value)
bool table_lookup(table* t, const char* key, const char** value);
bool table_remove(table* t, const char* key);
void table_delete(table* t);
#endif
For this hash table, both K and V are const char*, or immutable strings. Only functions
intended for external users are exposed. The hash function, notably, is not included—it is
assumed that the user doesn’t need to know how keys are hashed. (When, and whether, that is
a valid assumption is debatable.) The table type is declared opaquely—the .h file declares that
it is a struct, but leaves its fields to implementation; this means that implementations can be
changed without affecting the interface. The user doesn’t need to care about its internals; the
table_new function returns a table*—a pointer to a heap-allocated table structure that the
other functions use.
The table_print function is included mostly for our educational purposes; it will print out the
internal state of the table. The table_insert, table_lookup, and table_remove functions
handle the reading and writing of (key, value) pairs; table_delete removes a table object
from the system and deletes all of the information therein.
The type signature of table_lookup might seem odd, with its double-pointer argument. This
function, like the others, uses a bool return to indicate success or failure (key not found), which
means we need to use a different channel, in the case of success, to return the matching value.
We do this by passing in a place where this value, of type char*, can be stored. Hence, we
need to create a char**. In the case where the queried key isn’t found, table_lookup shall
return false and the pointer will not be adjusted.
#include <assert.h>
#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "hashtbl.h"
uint32_t hash(const char* str) {
uint32_t acc = 0;
while (*str) {
acc *= 137;
acc += *str;
str++;
}
return acc;
}
We start with our necessary #includes, and then we write a hash function, which tells us where
to put things when added to the hash table, and where to find them during lookup. Since our
backing array will be an ordinary integer-indexed array—C doesn’t natively offer much else—we
will need integer values. Hash functions must be deterministic—h("cat") must return the same
value every time it is called—but we also want them to be “random” in the sense of avoiding
collisions.
For example, h("cat") = 0x1c8eb0, or 1871536 in decimal. If the size of our backing array is
64, then the natural place to store a pair with key value "cat" is index 1871536 % 64 = 48.
Thus, it’s not enough that collisions be rare with our hash function—we would also prefer that,
with respect to any given modulus—or, at least, any modulus that might be a size of a hash
table’s backing array—its values be relatively uniform. That is, Pr(hash(k) % 64 == 48) should
be 1/64 for a randomly chosen string k. The hash function above is not sophisticated, nor is it
suitable for cryptographic purposes—an adversary could easily “break” it by finding colliding
values—but, for our objectives, it will do.
We create a struct for our (K, V) pairs. We also store the key’s hash value, so we don’t have to
recompute it, and add a pointer that can be used to create linked lists, for reasons we’ll see
shortly:
Next, we create a struct for our table, including a backing array, entries. Since the entries are
heap-allocated, we will be handling them by pointer, and so it is appropriate to use an
entry**—an array of entry*, of length capacity—for the purpose.
The size field corresponds not to the backing array’s length, then, but the number of entries
actually in the table. The upsize_threshold and downsize_threshold are load
factors—ratios between table->size and table->capacity at which the table should be
expanded or contracted.
table* table_new() {
table* t = malloc(sizeof(table));
assert(t);
t->size = 0;
t->capacity = 16;
t->upsize_threshold = 0.75;
t->downsize_threshold = 0.25;
t->entries = calloc(t->capacity, sizeof(entry *));
assert(t->entries);
return t;
}
The table_new function creates an empty table with capacity 16 and reasonable defaults for
our resizing thresholds. The backing array is created using calloc; the pointers are all NULL,
because nothing has been put in the table yet. We use assert statements to crash (preferring
defined behavior) in the event of allocation failure.
Here is a function we can use to print the state—including its internal values—of a table:
void table_print(table* t) {
printf("Hash Table @ %p\n", t);
printf(" Size = %d\n", t->size);
printf(" Capacity = %d\n", t->capacity);
for (int i = 0; i < t->capacity; i++) {
printf(" [%d]", i);
entry* ptr = t->entries[i];
while (ptr) { // possibly, follow a linked list
printf(" --> ");
printf("%s (%08x) : %s", ptr->key, ptr->hash_val, ptr->value);
ptr = ptr->next;
}
printf("\n");
}
}
For example, here is output showing that ("lime", "blue") and ("grape", "black") are
both in a small table and have, unfortunately, mapped to the same index of the backing array;
five of the slots are currently empty.
As you can see, each entry in this table is stored in the slot for h % 8, where h is the hash value
of the key. Because we’re using linked lists to handle collisions, we can have two entries
mapped to index 7, at a cost to performance.
As hash table performance deteriorates when load factors are high, we use table_resize to
double the size of the backing array whenever we’re about to exceed our maximum load factor,
t->upsize_threshold. In the normal case, however, we don’t have to worry about
this—instead, we create an entry* appropriate to the key and value, and uses
backing_array_insert, a helper function we’ll show below, to insert it into the table; also
t->size must be incremented. However, if backing_array_insert fails (returns false)
because an entry already exists for that key, we also return false, and must free the entry
we’ve created, since it would otherwise be unreachable.
The insertion into the backing array is performed using a function named appropriately:
This function takes a single entry* and decides where in the backing array the entry should go.
If it immediately finds an empty slot—a NULL value—then it puts the entry* there and returns
true. Otherwise, it scans the linked list that is already there—if it finds that a value has already
been inserted for that key, it returns false; if it gets to the end, it appends the inserted entry*
and returns true.
Our resizing function must (a) allocate a new array, (b) scan the old one for all entries, (c)
“break” linked lists apart (as one hopes, on an upsizing, that fewer entries will sit in congested
slots), (d) place them into the new array, and (e) clean up the old one. The function below does
all this:
Note that this function, when it succeeds, does not realloc the backing array, but creates a
new one—and frees the old. It also breaks apart any linked lists in the old hashtable—we hope,
when upsizing, that previously colliding entries will map to separate indices. All the indices must
be recomputed—but the hash values don’t need to be, because they’re stored in the entry
struct—and then backing_array_insert is used on each one. This function, notably, scans
and reinserts an entire hash table, so it’s expensive—in fact, it’s O(N). We hope to be doing it
rarely.
We compute the hash of the key we’re looking for—if an entry exists, it will be found in the
corresponding slot, within the list, which we scan, If we find one an e such that strcmp(e->key,
key) is zero, we set *value to e->value and return true. Otherwise, we return false,
indicating that no match was found.
Last of all, we need to perform deletion. Since deletion from a linked list requires repair—if a2 is
deleted from a1 --> a2 --> a3 --> NULL, then a link a1 --> a3 must be added. For this
reason, we iterate using an entry** that tracks the previous pointer to an entry, enabling this
repair after a deletion. We have to do it this way because C doesn’t offer a way to take an object
and ask, “What pointers point to this?” If we didn’t manually track, in addition to the entry we’re
examining, where (which pointer) we came from, we could not do the repair.
When the table’s load factor drops below t->downsize_threshold, the backing array shrinks.
Finally, when there is no entry in the table corresponding to the queried key in the first place,
false is returned to indicate that nothing was done.
void table_delete(table* t) {
for (int i = 0; i < t->capacity; i++) {
entry* ptr = t->entries[i];
while (ptr) {
entry* p = ptr;
ptr = ptr->next;
free(p);
}
}
free(t->entries);
free(t);
}
We scan the backing array, taking apart the linked lists and freeing the entries, then free the
backing array, and finally free the table struct.
As I said, you don’t need to use this for the PSI project, and you can go far in life without ever
having to implement a C hash table. That said, you will use data structures backed by these
things throughout your career as a programmer. Python’s dictionaries, std::unordered_map in
C++, and JavaScript’s objects come to mind.
Hash table insertion, removal, and lookup performance can be described as O(1)”-ish.”
Constants matter, and degradation to O(N) time can occur, for example, if the hashing is poorly
implemented. In practice, hash table performance relies on a number of factors:
There are few data structures that perform as well as hash tables at scale, but one should
understand that they aren’t strictly optimal for all circumstances—there’s a lot of nuance that
goes into getting the best possible performance.
Assignment 9 Questions
9.1. A generic (typeless) hashtable could be built using void* as its key type. In this case, you
could use the identity hash function—the pointer’s address is its hash value. Is this a good
choice? Why or why not?
9.2. To avoid wasting space, might it be a good idea to set downsize_threshold to 0.5? What
change could be made to the implementation to make this a sound thing to do?