0% found this document useful (0 votes)
4 views

Intro To C - Module 10

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Intro To C - Module 10

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to C: Module 10

Weekly Reading: (no additional readings)

A Hash Table Implementation

Dictionaries in Python, objects in JavaScript, and maps in Clojure are associative


arrays—instead of using an integer index, we can use a key of some other type—for example,
strings. This is so useful a feature that it is required for a language to be considered high-level.
C, however, does not offer it—because of the numerous tradeoffs in implementation, C makes
you build your own.

An associative array is indexed by keys in some known type K, and returns values in type V. A
resizeable array or linked list of (K, V) pairs will suffice to implement one. In fact, for small
collections, this is the preferred way to do it. The problem is that, when you have millions of
keys, a “linear search” (possible full scan) of the backing array is going to be too slow.
Tree-based data structures improve asymptotic complexity from O(N), where N is the number of
keys stored in the array, to O(log N). We will now discuss a data structure that, under certain
circumstances, improves it further—a hash table.

You will not be required to use hash tables for PSI. Your interpreter does not need to be fast,
and you may use regular lists for every purpose where you might want to use a hash table in a
production-grade interpreter. I am teaching this because you should know how hash tables
work; you don’t, however, need them for the course project.

First, the Header File

It’s good practice, when writing C code that others will use, to document one’s interface in a
header file.

// hashtbl.h
#include <inttypes.h>

#ifndef HASHTBL_H
#define HASHTBL_H

typedef struct table table;

table* table_new();
void table_print(table* t);
bool table_insert(table* t, const char* key, const char* value)
bool table_lookup(table* t, const char* key, const char** value);
bool table_remove(table* t, const char* key);
void table_delete(table* t);
#endif

For this hash table, both K and V are const char*, or immutable strings. Only functions
intended for external users are exposed. The hash function, notably, is not included—it is
assumed that the user doesn’t need to know how keys are hashed. (When, and whether, that is
a valid assumption is debatable.) The table type is declared opaquely—the .h file declares that
it is a struct, but leaves its fields to implementation; this means that implementations can be
changed without affecting the interface. The user doesn’t need to care about its internals; the
table_new function returns a table*—a pointer to a heap-allocated table structure that the
other functions use.

The table_print function is included mostly for our educational purposes; it will print out the
internal state of the table. The table_insert, table_lookup, and table_remove functions
handle the reading and writing of (key, value) pairs; table_delete removes a table object
from the system and deletes all of the information therein.

The type signature of table_lookup might seem odd, with its double-pointer argument. This
function, like the others, uses a bool return to indicate success or failure (key not found), which
means we need to use a different channel, in the case of success, to return the matching value.
We do this by passing in a place where this value, of type char*, can be stored. Hence, we
need to create a char**. In the case where the queried key isn’t found, table_lookup shall
return false and the pointer will not be adjusted.

Below is how a user might use such a function:

const char* v = NULL;


bool b = table_lookup(table, "key1", &v); // &v is of type char**
if (b) {
printf("Successful lookup: %s", v);
else {
printf("Not found");
}

This discussed, let’s go through the implementation that is in hashtbl.c:

#include <assert.h>
#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "hashtbl.h"
uint32_t hash(const char* str) {
uint32_t acc = 0;
while (*str) {
acc *= 137;
acc += *str;
str++;
}
return acc;
}

We start with our necessary #includes, and then we write a hash function, which tells us where
to put things when added to the hash table, and where to find them during lookup. Since our
backing array will be an ordinary integer-indexed array—C doesn’t natively offer much else—we
will need integer values. Hash functions must be deterministic—h("cat") must return the same
value every time it is called—but we also want them to be “random” in the sense of avoiding
collisions.

For example, h("cat") = 0x1c8eb0, or 1871536 in decimal. If the size of our backing array is
64, then the natural place to store a pair with key value "cat" is index 1871536 % 64 = 48.
Thus, it’s not enough that collisions be rare with our hash function—we would also prefer that,
with respect to any given modulus—or, at least, any modulus that might be a size of a hash
table’s backing array—its values be relatively uniform. That is, Pr(hash(k) % 64 == 48) should
be 1/64 for a randomly chosen string k. The hash function above is not sophisticated, nor is it
suitable for cryptographic purposes—an adversary could easily “break” it by finding colliding
values—but, for our objectives, it will do.

We create a struct for our (K, V) pairs. We also store the key’s hash value, so we don’t have to
recompute it, and add a pointer that can be used to create linked lists, for reasons we’ll see
shortly:

typedef struct entry {


const char* key;
const char* value;
uint32_t hash_val;
struct entry* next;
} entry;

Next, we create a struct for our table, including a backing array, entries. Since the entries are
heap-allocated, we will be handling them by pointer, and so it is appropriate to use an
entry**—an array of entry*, of length capacity—for the purpose.

typedef struct table {


int size;
int capacity;
double upsize_threshold;
double downsize_threshold;
entry** entries;
} table;

The size field corresponds not to the backing array’s length, then, but the number of entries
actually in the table. The upsize_threshold and downsize_threshold are load
factors—ratios between table->size and table->capacity at which the table should be
expanded or contracted.

Let’s now discuss table creation:

table* table_new() {
table* t = malloc(sizeof(table));
assert(t);
t->size = 0;
t->capacity = 16;
t->upsize_threshold = 0.75;
t->downsize_threshold = 0.25;
t->entries = calloc(t->capacity, sizeof(entry *));
assert(t->entries);
return t;
}

The table_new function creates an empty table with capacity 16 and reasonable defaults for
our resizing thresholds. The backing array is created using calloc; the pointers are all NULL,
because nothing has been put in the table yet. We use assert statements to crash (preferring
defined behavior) in the event of allocation failure.

Here is a function we can use to print the state—including its internal values—of a table:

void table_print(table* t) {
printf("Hash Table @ %p\n", t);
printf(" Size = %d\n", t->size);
printf(" Capacity = %d\n", t->capacity);
for (int i = 0; i < t->capacity; i++) {
printf(" [%d]", i);
entry* ptr = t->entries[i];
while (ptr) { // possibly, follow a linked list
printf(" --> ");
printf("%s (%08x) : %s", ptr->key, ptr->hash_val, ptr->value);
ptr = ptr->next;
}
printf("\n");
}
}

For example, here is output showing that ("lime", "blue") and ("grape", "black") are
both in a small table and have, unfortunately, mapped to the same index of the backing array;
five of the slots are currently empty.

Hash Table @ 0x6000032c0040


Size = 4
Capacity = 8
[0]
[1] --> banana (3abffae1) : red
[2] --> apple (06065c92) : green
[3]
[4]
[5]
[6]
[7] --> lime (10abc27f) : blue --> grape (844c516f) : black

As you can see, each entry in this table is stored in the slot for h % 8, where h is the hash value
of the key. Because we’re using linked lists to handle collisions, we can have two entries
mapped to index 7, at a cost to performance.

We show table_insert, which relies on two other helper functions, next:

bool table_insert(table* t, const char* key, const char* value) {


if (t->size >= t->upsize_threshold * t->capacity) {
int new_capacity = t->capacity * 2;
table_resize(t, new_capacity));
}

uint32_t hash_val = hash(key);


entry* e = calloc(1, sizeof(entry));
assert(e);
e->key = key;
e->value = value;
e->hash_val = hash_val;
if (backing_array_insert(e, t->entries, t->capacity)) {
t->size += 1;
return true;
} else {
free(e);
return false;
}
}

As hash table performance deteriorates when load factors are high, we use table_resize to
double the size of the backing array whenever we’re about to exceed our maximum load factor,
t->upsize_threshold. In the normal case, however, we don’t have to worry about
this—instead, we create an entry* appropriate to the key and value, and uses
backing_array_insert, a helper function we’ll show below, to insert it into the table; also
t->size must be incremented. However, if backing_array_insert fails (returns false)
because an entry already exists for that key, we also return false, and must free the entry
we’ve created, since it would otherwise be unreachable.

The insertion into the backing array is performed using a function named appropriately:

bool backing_array_insert(entry* e, entry** array, int array_size) {


assert(e->next == NULL);
int index = e->hash_val % array_size;
if (array[index] == NULL) {
array[index] = e;
} else {
entry* last = array[index];
while (true) {
if (strcmp(last->key, e->key) == 0) {
return false;
} else if (last->next == NULL) {
last->next = e;
break;
} else {
last = last->next;
}
}
}
}

This function takes a single entry* and decides where in the backing array the entry should go.
If it immediately finds an empty slot—a NULL value—then it puts the entry* there and returns
true. Otherwise, it scans the linked list that is already there—if it finds that a value has already
been inserted for that key, it returns false; if it gets to the end, it appends the inserted entry*
and returns true.
Our resizing function must (a) allocate a new array, (b) scan the old one for all entries, (c)
“break” linked lists apart (as one hopes, on an upsizing, that fewer entries will sit in congested
slots), (d) place them into the new array, and (e) clean up the old one. The function below does
all this:

void table_resize(table* t, int new_capacity) {


entry** new_backing_array = calloc(new_capacity, sizeof(entry *));
assert(new_backing_array);
for (int i = 0; i < t->capacity; i++) {
entry* e = t->entries[i];
while (e) {
entry* ptr = e->next;
e->next = NULL;
backing_array_insert(e, new_backing_array, new_capacity);
e = ptr;
}
}
free(t->entries);
t->entries = new_backing_array;
}

Note that this function, when it succeeds, does not realloc the backing array, but creates a
new one—and frees the old. It also breaks apart any linked lists in the old hashtable—we hope,
when upsizing, that previously colliding entries will map to separate indices. All the indices must
be recomputed—but the hash values don’t need to be, because they’re stored in the entry
struct—and then backing_array_insert is used on each one. This function, notably, scans
and reinserts an entire hash table, so it’s expensive—in fact, it’s O(N). We hope to be doing it
rarely.

Next, let’s examine the lookup function:

bool table_lookup(table* t, const char* key, const char** value) {


uint32_t hash_val = hash(key);
int index = hash_val % t->capacity;
entry* e = t->entries[index];
while (e) {
if (strcmp(e->key, key) == 0) {
*value = e->value;
return true;
} else {
e = e->next;
}
}
// no match
return false;
}

We compute the hash of the key we’re looking for—if an entry exists, it will be found in the
corresponding slot, within the list, which we scan, If we find one an e such that strcmp(e->key,
key) is zero, we set *value to e->value and return true. Otherwise, we return false,
indicating that no match was found.

Last of all, we need to perform deletion. Since deletion from a linked list requires repair—if a2 is
deleted from a1 --> a2 --> a3 --> NULL, then a link a1 --> a3 must be added. For this
reason, we iterate using an entry** that tracks the previous pointer to an entry, enabling this
repair after a deletion. We have to do it this way because C doesn’t offer a way to take an object
and ask, “What pointers point to this?” If we didn’t manually track, in addition to the entry we’re
examining, where (which pointer) we came from, we could not do the repair.

When the table’s load factor drops below t->downsize_threshold, the backing array shrinks.
Finally, when there is no entry in the table corresponding to the queried key in the first place,
false is returned to indicate that nothing was done.

bool table_remove(table* t, const char* key) {


uint32_t hash_val = hash(key);
int index = hash_val % t->capacity;
entry** p = &(t->entries[index]);
while (*p) {
entry* e = *p;
if (strcmp(e->key, key) == 0) {
*p = e->next;
free(e);
t->size -= 1;
if ((t->size <= t->downsize_threshold * t->capacity)
&& (t->capacity > 16)) {
int new_capacity = t->capacity / 2;
table_resize(t, new_capacity);
t->capacity = new_capacity;
}
return true;
} else {
p = &(e->next);
}
}
return false;
}
Last of all, we need a function for deleting a hash table when we’re done with it. Luckily, this one
isn’t so bad.

void table_delete(table* t) {
for (int i = 0; i < t->capacity; i++) {
entry* ptr = t->entries[i];
while (ptr) {
entry* p = ptr;
ptr = ptr->next;
free(p);
}
}
free(t->entries);
free(t);
}

We scan the backing array, taking apart the linked lists and freeing the entries, then free the
backing array, and finally free the table struct.

Why Did I Make You Go Through All This?

As I said, you don’t need to use this for the PSI project, and you can go far in life without ever
having to implement a C hash table. That said, you will use data structures backed by these
things throughout your career as a programmer. Python’s dictionaries, std::unordered_map in
C++, and JavaScript’s objects come to mind.

Hash table insertion, removal, and lookup performance can be described as O(1)”-ish.”
Constants matter, and degradation to O(N) time can occur, for example, if the hashing is poorly
implemented. In practice, hash table performance relies on a number of factors:

● how fast is the hash function?


● does the hash function, in fact, distribute results uniformly, or are certain values/moduli
more common than others?
● what is the ratio of insertions to lookups? how often are entries deleted?
● how often do resizings occur?
● is our hash table going to be a large global object, or will we be modifying—and
copying—these for local variables?

And we haven’t even discussed cache-related issues, because, well... ouch.

There are few data structures that perform as well as hash tables at scale, but one should
understand that they aren’t strictly optimal for all circumstances—there’s a lot of nuance that
goes into getting the best possible performance.
Assignment 9 Questions

9.1. A generic (typeless) hashtable could be built using void* as its key type. In this case, you
could use the identity hash function—the pointer’s address is its hash value. Is this a good
choice? Why or why not?

9.2. To avoid wasting space, might it be a good idea to set downsize_threshold to 0.5? What
change could be made to the implementation to make this a sound thing to do?

Submit your answers in a PDF; this assignment is due October 31.

You might also like