Type-Safe Generic Data Structures in C
Type-Safe Generic Data Structures in C
Subscribe
The rise of a new generation of low-level programming languages like Rust, Go and Zig
has caused C and its primitive type system to fall into some disrepute. Nonetheless,
with sufficient creativity it is possible to achieve surprisingly sophisticated results in C.
One such result is generic data structures. This post reviews two techniques for
implementing generic data structures in C: unsafely using raw memory and pointer
casts, and safely using code generation through macros.1
Our stack's internal state consists of a length, a capacity, and a pointer to the stack's
heap-allocated data. The length is the number of elements in the stack while the
capacity is the maximum number of elements that the allocated memory could hold
(not the size of the allocated memory in bytes, which is capacity * sizeof(int) ).
typedef struct {
size_t len, capacity;
int* data;
} IntStack;
Aside: You can find the full code for this post here.
We'll start with the push operation. First, the stack is resized if there is not enough
capacity for an additional element. Then, the stack's length is incremented and the value
is written to the end of the data array:2
if (!new_data) {
/* TODO: Handle memory error. */
return;
}
stck->capacity = new_capacity;
stck->data = new_data;
}
stck->len++;
stck->data[stck->len - 1] = value;
}
Note that realloc may return a null pointer if it cannot re-allocate the memory, so it's
not safe to assign the return value to stck->data without checking if it is null first. Also
note that the use of realloc assumes that stck->data has been previously allocated
with malloc , an assumption that will be upheld by the constructor that we'll write later.
The pop operation decrements the length field and returns the former last element. The
return value is wrapped in an IntResult data structure, since in the case that stck is
null or empty, there is no popped value to return.3
IntResult IntStack_pop(IntStack* stck) {
if (!stck || stck->len == 0) {
return IntResult_error();
}
stck->len--;
return IntResult_of(stck->data[stck->len]);
}
typedef struct {
bool error;
int result;
} IntResult;
IntResult IntResult_of(int v) {
IntResult r = { .error = false, .result = v };
return r;
}
IntResult IntResult_error() {
IntResult r = { .error = true };
return r;
}
We'll also provide a constructor for the convenient creation of IntStack objects. It
initializes the stack with a length of 0 and a small initial capacity of heap-allocated
memory.
IntStack IntStack_new() {
size_t capacity = 8;
int* data = malloc(capacity * sizeof(int));
if (!data) {
/* TODO: Handle memory error. */
}
IntStack stck = { .len = 0, .capacity = capacity, .data = data };
return stck;
}
Since the constructor allocates memory from the heap, we need a corresponding
destructor to free it:
The critical insight is that, even though at compile time neither the type nor the size of
the elements is known, the elements of the stack can still be stored as unstructured
binary data as long as we keep track of how much memory each element occupies.
In C, binary data can be stored by in an array of char s. To give an example, suppose we
define a Point type:
typedef struct {
int x, y;
} Point;
We can store a Point object in a char array like this, using memcpy to copy the bytes
into the array:
data[0] = 42
data[1] = 0
data[2] = 0
data[3] = 0
data[4] = 43
data[5] = 0
data[6] = 0
data[7] = 0
But the individual elements of the array must be treated as opaque because the
memory layout of Point is up to the compiler.
This technique is the foundation of our generic stack type, UnsafeStack .4 UnsafeStack
has a char* data field instead of int* data , and an additional objsize field to track
how many bytes each object occupies:
typedef struct {
size_t len, capacity;
size_t objsize;
char* data;
} UnsafeStack;
if (!new_data) {
/* TODO: Handle memory error. */
return;
}
stck->capacity = new_capacity;
stck->data = new_data;
}
stck->len--;
return stck->data + (stck->len * stck->objsize);
}
Errors can be signalled by returning a null pointer, so we don't need a Result object for
UnsafeStack .
UnsafeStack 's API is a bit different from IntStack 's. When pushing a value onto the
stack, we provide a pointer to it rather than the element itself, and when popping a
value, we cast the return value and then de-reference it.
UnsafeStack_push(&unsafe_stack, &i);
int v = *(int*)UnsafeStack_pop(&unsafe_stack);
Let's return to the fundamental problem. If we want type-safe stacks, then we have to
write a different, though virtually identical, implementation for each type. It's essentially
a problem of code duplication.
Fortunately, C has a mechanism for dealing with code duplication: macros. Rather than
writing out a full stack implementation for each type, we can use a macro to generate it
for us from a template. We will take our IntStack implementation, replace all uses of
int with a type parameter, and then wrap the whole implementation in a macro that is
parameterized on type . Whenever we want to use a stack for a new type, we'll call the
macro to generate the code for the implementation. As far as the C compiler is
concerned, it's as if we wrote a separate implementation for each type with the concrete
types in the code so that the compiler can type-check the code properly.
The macro code is a little hard to read, not the least because each line must end with a
backslash to continue the macro on to the next line, but it is recognizably almost the
same code as for IntStack . The syntax typename##_new , typename##_free , etc., tells
the preprocessor to glue typename , a macro parameter, to literal strings like _new or
_free , yielding results like FloatStack_new or StringStack_free . The macro parameter
type is substituted wherever we had int in the original IntStack code.
type* data; \
type data;
} typename; \
\
typedef struct { \
bool error; \
type result; \
} typename##Result; \
\
typename typename##_new() { \
size_t capacity = 8; \
type* data = malloc(capacity * sizeof(type)); \
if (!data) {} \
typename stck = { .len = 0, .capacity = capacity, .data = data }; \
return stck; \
} \
\
void typename##_free(typename* stck) { \
if (stck) { \
free(stck->data); \
} \
} \
\
size_t typename##_length(typename* stck) { \
return stck ? stck->len : 0; \
} \
\
void typename##_push(typename* stck, type value) { \
if (!stck) { \
return; \
} \
\
if (stck->len + 1 > stck->capacity) { \
size_t new_capacity = stck->capacity * 2; \
type* new_data = realloc(stck->data, new_capacity * sizeof(type)); \
\
if (!new_data) { \
return; \
} \
\
stck->capacity = new_capacity; \
We then call DECL_STACK to declare a new stack type, either in a header file or at the top
level of a program:
DECL_STACK(SafeIntStack, int)
The resultant API achieves the safety and convenience of IntStack and the generality
of UnsafeStack :
Note that SafeIntStack is still only as safe as C's type system, which will, for example,
allow you to use a string literal where an int is expected, with only a compiler warning.
This is a fundamental limitation that cannot be worked-around.
The safe stack data structure has no overhead above hand-written code, except that
each new declaration increases the program size by a constant amount (unlike the
unsafe stack, which uses the same code for all data structures). Incidentally, this code
generation technique is essentially how templates are implemented behind the scenes
in C++.
That it is possible to write type-safe generic data structures in a language whose type
system does not natively support them speaks to C's flexibility.5 But the techniques we
used—unchecked pointer casts and unsanitary lexical macros—are themselves quite
unsafe if used improperly. Flexibility, simplicity and safety: a language can attain at
most two out of three. Rust and Haskell choose flexibility and safety. Go chooses safety
and simplicity.6 C chooses flexibility and simplicity. Each choice has its trade-offs. The
tendency for modern languages to prefer safety is a direct consequence of the
innumerable bugs found in C code that could have been prevented by stronger compile-
time guarantees. Nonetheless, as this post has shown, flexibility and simplicity are a
powerful combination.
1. I assume the reader is proficient in C. The classic text on C is The C Programming Language by Brian Kernighan
and Dennis Ritchie, universally known as K&R after the author's initials. Look for the second edition. Though K&R
has aged remarkably well, best practices have been refined and some language features have changed since
1988. Modern C by Jens Gustedt is a great re-introduction to the modern language.↩
2. Here and throughout I've added TODO s to mark where a more robust implementation would need to handle an
edge case.↩
3. Traditionally, error handling in C is done either by returning a special error value, or "out-of-band" by setting the
global errno variable. I find the latter approach inelegant, and the former is impossible because any int value is
a possible legal return value of IntStack_pop .↩
4. GLib, the utility library for the GNOME desktop environment, uses this technique for its garray generic array
type.↩
5. For more evidence, see Daniel Holden's Cello, a framework for high-level programming in C.↩
. Debatable, perhaps. Go does allow some unsafe constructs like interface{} , but it largely abandons the reckless
permissiveness of C.↩