0% found this document useful (0 votes)
316 views64 pages

Peter Bex Blog Chicken Scheme PDF

This document provides information on calling C code from Scheme and vice versa in CHICKEN Scheme. It begins with a basic example of invoking a C function from Scheme to calculate the number of bits in an integer. It then analyzes the generated C code, showing how the C function is wrapped to interface with Scheme's calling conventions and garbage collection. The document dives into low-level details to explain how arguments, return values, and continuations are passed between the languages.

Uploaded by

David Ireland
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
316 views64 pages

Peter Bex Blog Chicken Scheme PDF

This document provides information on calling C code from Scheme and vice versa in CHICKEN Scheme. It begins with a basic example of invoking a C function from Scheme to calculate the number of bits in an integer. It then analyzes the generated C code, showing how the C function is wrapped to interface with Scheme's calling conventions and garbage collection. The document dives into low-level details to explain how arguments, return values, and continuations are passed between the languages.

Uploaded by

David Ireland
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

A (mostly) comprehensive guide to calling C

from Scheme and vice versa Posted on 2014-10-16


When you're writing Scheme code in CHICKEN it's sometimes necessary to make a little excursion to
C. For example, you're trying to call a C library, you're writing extremely performance-critical code, or
you're working on something that's best expressed in C, such as code that requires a lot of bit-
twiddling.
This post contains a lot of code, including generated C code. If you get too tired to absorb it, it's
probably best to stop reading and pick it up again later.

A basic example of invoking C code from CHICKEN


This is one of CHICKEN's strengths: the ability to quickly drop down to C for a small bit of code, and
return its result to Scheme:
(import foreign)

(define ilen
(foreign-lambda* long ((unsigned-long x))
"unsigned long y;\n"
"long n = 0;\n"
"#ifdef C_SIXTY_FOUR\n"
"y = x >> 32; if (y != 0) { n += 32; x = y; }\n"
"#endif\n"
"y = x >> 16; if (y != 0) { n += 16; x = y; }\n"
"y = x >> 8; if (y != 0) { n += 8; x = y; }\n"
"y = x >> 4; if (y != 0) { n += 4; x = y; }\n"
"y = x >> 2; if (y != 0) { n += 2; x = y; }\n"
"y = x >> 1; if (y != 0) C_return(n + 2);\n"
"C_return(n + x);"))

(print "Please enter a number")


(print "The length of your integer in bits is " (ilen (read)))

This example is taken from a wonderful little book called "Hacker's Delight", by Henry S. Warren. It
calculates the number of bits required to represent an unsigned integer (its "length"). By the way, this
procedure is provided by the numbers egg as integer-length. The algorithm is implementable in
Scheme, but at least a direct translation to Scheme is nowhere as readable as it is in C:
(define (ilen x)
(let ((y 0) (n 0))
(cond-expand
(64bit
(set! y (arithmetic-shift x -32))
(unless (zero? y) (set! n (+ n 32)) (set! x y)))
(else))
(set! y (arithmetic-shift x -16))
(unless (zero? y) (set! n (+ n 16)) (set! x y))
(set! y (arithmetic-shift x -8))
(unless (zero? y) (set! n (+ n 8)) (set! x y))
(set! y (arithmetic-shift x -4))
(unless (zero? y) (set! n (+ n 4)) (set! x y))
(set! y (arithmetic-shift x -2))
(unless (zero? y) (set! n (+ n 2)) (set! x y))
(set! y (arithmetic-shift x -1))
(if (not (zero? y)) (+ n 2) (+ n x))))

The performance of the Scheme version is also going to be less than that of the C version. All in all,
plenty of good reasons to prefer integration with C. There's no shame in that: most fast languages
forego "pure" implementations in favour of C for performance reasons. The only difference is that
calling C in other languages is often a bit more work.

Analysing the generated code


In this section we'll unveil the internal magic which makes C so easily integrated with Scheme. You can
skip this section if you aren't interested in low-level details.
As you might have noticed, the C code in the example above contains one unfamiliar construct: It uses
C_return() to return the result. If you inspect the code generated by CHICKEN after compiling it
via csc -k test.scm, you'll see that it inserts some magic to convert the C number to a Scheme
object. I've added some annotations and indented for readability:
/* Local macro definition to convert returned long to a Scheme object. */
#define return(x) \
C_cblock C_r = (C_long_to_num(&C_a,(x))); goto C_ret; C_cblockend

/* Prototype declaring the stub procedure as static, returning a


* C_word (Scheme object) and passing arguments through registers.
* It's not strictly necessary in this case.
*/
static C_word C_fcall stub7(C_word C_buf, C_word C_a0) C_regparm;

/* The stub function: it gets passed a buffer in which Scheme objects get
* allocated (C_buf) and the numbered arguments C_a0, C_a1, ... C_an.
*/
C_regparm static C_word C_fcall stub7(C_word C_buf, C_word C_a0)
{
C_word C_r = C_SCHEME_UNDEFINED, /* Return value, mutated by return() macro */
*C_a=(C_word*)C_buf; /* Allocation pointer used by return() macro */

/* Conversion of input argument from Scheme to C */


unsigned long x = (unsigned long )C_num_to_unsigned_long(C_a0);

/* Start of our own code from the foreign-lambda* body, as-is */


unsigned long y;
long n = 0;
#ifdef C_SIXTY_FOUR
y = x >> 32; if (y != 0) { n += 32; x = y; }
#endif
y = x >> 16; if (y != 0) { n += 16; x = y; }
y = x >> 8; if (y != 0) { n += 8; x = y; }
y = x >> 4; if (y != 0) { n += 4; x = y; }
y = x >> 2; if (y != 0) { n += 2; x = y; }
y = x >> 1; if (y != 0) C_return(n + 2);
C_return(n + x);
C_ret: /* Label for goto in the return() macro */
#undef return
return C_r; /* Regular C return */
}

/* chicken.h contains the following: */


#define C_return(x) return(x)
#define C_cblock do{
#define C_cblockend }while(0)

In the foreign-lambda*, I used C_return for clarity: I could have just used return with
parentheses, which will get expanded by the C preprocessor. This is somewhat confusing: return n
+ x; will result in an error, whereas return(n+x); will do the same as C_return(n+x);.

The return macro calls C_long_to_num, which will construct a Scheme object, which is either a
fixnum (small exact integer) or a flonum (floating-point inexact number), depending on the platform
and the size of the returned value. Hopefully, in CHICKEN 5 it will be either a fixnum or a bignum -
that way, it'll always be an exact integer.
Because these number objects need to get allocated on the stack to integrate with the garbage collector,
the calling code needs to set aside enough memory on the stack to fit these objects. That's what the
C_buf argument is for: it's a pointer to this area. In CHICKEN, a whole lot of type punning is going
on, so it's passed as a regular C_word rather than as a proper pointer, but let's ignore that for now.

The stub function above is used to do the actual work, but in order to integrate it into CHICKEN's
calling conventions and garbage collector, an additional wrapper function is generated. It corresponds
to the actual Scheme "ilen" procedure, and looks like this:
/* ilen in k197 in k194 in k191 */
static void C_ccall f_201(C_word c, C_word t0, C_word t1, C_word t2)
{
C_word tmp /* Unused */; C_word t3; C_word t4; C_word t5; /* Temporaries */
C_word ab[6], *a=ab; /* Memory area set aside on stack for allocation */

if(c != 3) C_bad_argc_2(c, 3, t0); /* Check argument count is correct */

C_check_for_interrupt; /* Check pending POSIX signals, and thread timeout */

if(!C_stack_probe(&a)) { /* Stack full? Then perform GC and try again. */


C_save_and_reclaim((void*)tr3, (void*)f_201, 3, t0, t1, t2);
}
t3 = C_a_i_bytevector(&a,1,C_fix(4)); /* Needed to have a proper object */
t4 = C_i_foreign_unsigned_integer_argumentp(t2); /* Check argument type */
t5 = t1; /* The continuation of the call to ilen */
/* Call stub7 inline, and pass result to continuation: */
((C_proc2)(void*)(*((C_word*)t5+1)))(2, t5, stub7(t3, t4));
}

The comment at the top indicates the name of the Scheme procedure and its location in the CPS-
converted Scheme code. The k197 in k194 etc indicate the nesting in the generated continuations,
which can sometimes be useful for debugging. These continuations can be seen in the CPS-converted
code by compiling with csc -debug 3 test.scm.
Much of the code you might sort-of recognise from the code in my article about the CHICKEN garbage
collector: The C_stack_probe() corresponds to that post's fits_on_stack(), and
C_save_and_reclaim() combines that post's SCM_save_call() and SCM_minor_GC().

All Scheme procedures get compiled down to C functions which receive their argument count (c), the
closure/continuation from which they're invoked (t0), so they can access local closure variables (not
used here) and in order to perform a GC and re-invoke the closure. Finally, they receive the
continuation of the call (t1) and any procedure arguments (everything after it, here only t2). If a
procedure has a variable number of arguments, that will use C's varargs mechanism, which is why
passing the argument count to every function is important. If a function is called with too many or too
few arguments, this will "just work", even if the arguments are declared in the function prototype like
here: The function is invoked correctly, but the stack will contain rubbish instead of the expected
arguments. That's why it's important to first check the argument count, and then check whether a GC
needs to be performed; otherwise, this rubbish gets saved by save_and_reclaim and the GC will
attempt to traverse it as if it contained proper Scheme objects, resulting in segfaults or other nasty
business.
The variable t3 will contain the buffer in which the return type is stored. It is wrapped in a byte vector,
because this makes it a first-class object understood by the garbage collector. That's not necessary here,
but this code is pretty generic and is also used in cases where it is necessary. The C_word ab[6]
declaration sets aside enough memory space to hold a flonum or a fixnum, which need at most 4 bytes,
plus 2 bytes for the bytevector wrapper. I will explain these details later in a separate post, but let's
assume it's OK for now.
The argument type gets checked just before calling the C function. If the argument is not of the correct
type, an error is signalled and the function will be aborted. The returned value is simply the input, so
t4 will contain the same value as t2. Similarly, t1 gets copied as-is to t5. Finally, the continuation
gets cast to the correct procedure type (again: a lot of type punning. I will explain this in another post),
and invoked with the correct argument count (2), the continuation closure itself, and the return value of
the stub function.

Returning complex Scheme objects from C


I've tried to explain above how the basic C types get converted to Scheme objects, but what if we want
to get crazy and allocate Scheme objects in C? A simple foreign-lambda* won't suffice, because
the compiler has no way of knowing how large a buffer to allocate, and the C function will return, so
we'll lose what's on the stack.
To fix that, we have foreign-safe-lambda*, which will allow us to allocate any object on the
stack. Before such a function is invoked, a minor garbage collection is triggered to clean the stack and
ensure we have plenty of allocation room. Let's look at a simple example. This program displays the
list of available network interfaces on a UNIX-like system:
(import foreign)
(foreign-declare "#include \"sys/types.h\"")
(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define interfaces
(foreign-safe-lambda* scheme-object ()
"C_word lst = C_SCHEME_END_OF_LIST, len, str, *a;\n"
"struct ifaddrs *ifa, *i;\n"
"\n"
"if (getifaddrs(&ifa) != 0)\n"
" C_return(C_SCHEME_FALSE);\n"
"\n"
"for (i = ifa; i != NULL; i = i->ifa_next) {\n"
" len = strlen(i->ifa_name);\n"
" a = C_alloc(C_SIZEOF_PAIR + C_SIZEOF_STRING(len));\n"
" str = C_string(&a, len, i->ifa_name);\n"
" lst = C_a_pair(&a, str, lst);\n"
"}\n"
"\n"
"freeifaddrs(ifa);\n"
"C_return(lst);\n"))

(print "The following interfaces are available: " (interfaces))

This functionality is not available in CHICKEN because it's not very portable (it's not in POSIX), so it's
a good example of something you might want to use C for. Please excuse the unSchemely way of error
handling by returning #f for now. We'll fix that in the next chapter.

Looking at our definition, the interfaces procedure has no arguments, and it returns a scheme-
object. This type indicates to CHICKEN that the returned value is not to be converted but simply
used as-is: we'll handle its creation ourselves.
We declare the return value lst, which gets initialised to the empty list, and two temporary variables:
len and str, to keep an intermediate string length and to hold the actual CHICKEN string. The
variable a is an allocation pointer. Then we have the two variables which hold the start of the linked list
of interfaces, ifa, and the current iterator through this list, i.

We retrieve the linked list (if it fails, returning #f), and scan through it until we hit the end. For each
entry, we simply check the length of the interface name string, we allocate enough room on the stack to
hold a pair and a CHICKEN string of the same length (C_alloc() is really just alloca()). The
C_SIZEOF... macros are very convenient to help us calculate the size of an object without having to
know its exact representation in memory. We then create the CHICKEN string using C_string,
which is put into the allocated space stored in a, and we create a pair which holds the string in the car
and the previous list as its cdr.

These allocating C_a_pair and C_string functions accept a pointer to the allocated space (which
itself is a pointer). This means they can advance the pointer's value beyond the object, to the next free
position. This is quite nice, because it allows us to call several allocating functions in a row, with the
same pointer, and at the end the pointer points past the object that was allocated last. Finally, we release
the memory used by the linked list and return the constructed list.
Analysing the generated code
Like before, if you're not interested in the details, feel free to skip this section.
The interfaces foreign code itself compiles down to this function:
/* Like before, but no conversion because we "return" a native object: */
#define return(x) C_cblock C_r = (((C_word)(x))); goto C_ret; C_cblockend

/* The prototype _is_ necessary in this case: it declares the function


* as never returning via C_noret, which maps to __attribute__((noreturn)).
*/
static void C_ccall stub6(C_word C_c, C_word C_self,
C_word C_k, C_word C_buf) C_noret;

/* The arguments to the stub function now include the argument count,
* the closure itself and the continuation in addition to the buffer
* and arguments (none here). This is a truly "native" CHICKEN function!
*/
static void C_ccall stub6(C_word C_c, C_word C_self, C_word C_k, C_word C_buf)
{
C_word C_r = C_SCHEME_UNDEFINED,
*C_a = (C_word *)C_buf;

/* Save callback depth; needed if we want to call Scheme functions */


int C_level = C_save_callback_continuation(&C_a, C_k);

/* Start of our own code, as-is: */


struct ifaddrs *ifa, *i;
C_word lst = C_SCHEME_END_OF_LIST, len, str, *a;

if (getifaddrs(&ifa) != 0)
C_return(C_SCHEME_FALSE);

for (i = ifa; i != NULL; i = i->ifa_next) {


len = strlen(i->ifa_name);
a = C_alloc(C_SIZEOF_PAIR + C_SIZEOF_STRING(len));
str = C_string(&a, len, i->ifa_name);
lst = C_a_pair(&a, str, lst);
}

freeifaddrs(ifa);
C_return(lst);

C_ret:
#undef return

/* Pop continuation off callback stack. */


C_k = C_restore_callback_continuation2(C_level);

C_kontinue(C_k, C_r); /* Pass return value to continuation. */


}

This is not much different from the foreign-lambda* example, but notice that the arguments are
different: this stub looks exactly like the C code generated from an actual Scheme continuation: it gets
passed the argument count, its own closure and its continuation. Instead of ending with a regular return
from C, it invokes a continuation. This is the crucial difference which integrates our code with the
garbage collector: by passing it to the next continuation's C function, the "returned" value is preserved
on the stack. In other words, it is allocated directly in the nursery.
Even though the stub is a "native" Scheme procedure, a wrapper is still generated: if the foreign-
safe-lambda is defined to accept C arguments, it'll still need to convert from Scheme objects, it
needs to check the argument count, and it needs to invoke the GC before the procedure can be called:
/* interfaces in k197 in k194 in k191 */
static void C_ccall f_201(C_word c, C_word t0, C_word t1){
/* This is the function that corresponds to the Scheme procedure.
* This is the first stage of the procedure: we invoke the GC with
* a continuation which will do conversions and call the C stub.
*/
C_word tmp; C_word t2; C_word t3;
C_word ab[3], *a = ab;

/* As before: */
if (c!=2) C_bad_argc_2(c, 2, t0);

C_check_for_interrupt;

if (!C_stack_probe(&a)) {
C_save_and_reclaim((void*)tr2,(void*)f_201,2,t0,t1);
}

/* Create the continuation which will be invoked after GC: */


t2 = (*a = C_CLOSURE_TYPE|2, /* A closure of size two: */
a[1] = (C_word)f_205, /* Second stage function of our wrapper, */
a[2] = t1, /* and continuation of call to (interfaces). */
tmp = (C_word)a, /* Current value of "a" must be stored in t2...*/
a += 3, /* ... but "a" itself gets advanced... */
tmp); /* ... luckily tmp holds original value of a. */

C_trace("test.scm:8: ##sys#gc"); /* Trace call chain */

/* lf[1] contains the symbol ##sys#gc. This invokes its procedure. */


((C_proc3)C_fast_retrieve_symbol_proc(lf[1]))(3, *((C_word*)lf[1]+1),
t2, C_SCHEME_FALSE);
}

/* k203 in interfaces in k197 in k194 in k191 */


static void C_ccall f_205(C_word c, C_word t0, C_word t1)
{
/* This function gets invoked from the GC triggered by the above function,
* and is the second stage of our wrapper function. It is similar to the
* wrapper from the first example of a regular foreign-lambda.
*/
C_word tmp; C_word t2; C_word t3; C_word t4;
/* Enough room for a closure of 2 words (total size 3) and a bytevector
* of 3 words (total size 4). This adds up to 7; The missing 1 is to
* make room for a possible alignment of the bytevector on 32-bit platforms.
*/
C_word ab[8], *a=ab;

C_check_for_interrupt;

if (!C_stack_probe(&a)) {
C_save_and_reclaim((void*)tr2, (void*)f_205, 2, t0, t1);
}

t2 = C_a_i_bytevector(&a, 1, C_fix(3)); /* Room for one pair */

t3 = (*a = C_CLOSURE_TYPE|2, /* Create a closure of size 2: */


a[1] = (C_word)stub6, /* Our foreign-safe-lambda stub function, */
a[2] = ((C_word)li0), /* and static lambda-info for same (unused). */
tmp = (C_word)a, /* Update "a" and return original value, */
a += 3, /* exactly like we did in f_201. */
tmp);

/* Trace procedure name generated by (gensym). Kind of useless :) */


C_trace("test.scm:8: g9");

t4 = t3; /* Compilation artefact; don't worry about it */

/* Retrieve procedure from closure we just created, and call it,


* with 3 arguments: itself (t4), the continuation of the call
* to "interfaces" (t0[2]), and the bytevector buffer (t2).
*/
((C_proc3)C_fast_retrieve_proc(t4))(3, t4, ((C_word*)t0)[2], t2);
}

Our foreign-lambda's wrapper function now consists of two stages. The first stage first creates a
continuation for the usual wrapper function. Then it calls the garbage collector to clear the stack, after
which this wrapper-continuation is invoked. This wrapper is the second function here, and it
corresponds closely to the wrapper function we saw in the ilen example. However, this wrapper
constructs a closure around the C stub function instead of simply calling it. This closure is then called:
C_fast_retrieve_proc simply extracts the function from the closure object we just created, it is
cast to a 3-argument procedure type and invoked with the continuation of the interfaces call site.

You can see how closures are created in CHICKEN. I will explain this in depth in a future blog post,
but the basic approach is pretty clever: the whole thing is one big C expression which stores successive
words at the free slots in the allocated space a, while ensuring that after the expression a will point at
the next free word. The dance with tmp ensures that the whole expression which allocates the closure
results in the initial value of a. That initial value was the first free slot before we executed the
expression, and afterwards it holds the closure. Don't worry if this confuses you :)

Calling Scheme from C


Now, with the basics out of the way, let's do something funkier: instead of calling C from Scheme, we
call Scheme from C! There is a C API for embedding CHICKEN in a larger C program, but that's not
what you should use when calling Scheme from C code that was itself called from Scheme.

The "easy" way


Our little interfaces-listing program has one theoretical flaw: the list of interfaces could be very long
(or the names could be long), so we may theoretically run out of stack space. So, we should avoid
allocating unbounded lists directly on the stack without checking for overflow. Instead, let's pass the
allocated objects to a callback procedure which prints the interface, in a "streaming" fashion.
As I explained before, a regular foreign-lambda uses the C stack in the regular way, it doesn't
know about continuations or the Cheney on the MTA garbage collection style, and there's no way to
call CHICKEN functions from there, because the GC would "collect" away the C function by
longjmp()ing past it. However, the foreign-safe-lambda has a special provision for that: it
can "lock" the current live data by putting a barrier between this C function and the Scheme code it
calls:
(import foreign)

(foreign-declare "#include \"sys/types.h\"")


(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define interfaces
(foreign-safe-lambda* scheme-object ((scheme-object receiver))
"C_word len, str, *a;\n"
"struct ifaddrs *ifa, *i;\n"
"\n"
"if (getifaddrs(&ifa) != 0)\n"
" C_return(C_SCHEME_UNDEFINED);\n"
"\n"
"for (i = ifa; i != NULL; i = i->ifa_next) {\n"
" len = strlen(i->ifa_name);\n"
" a = C_alloc(C_SIZEOF_STRING(len));\n"
" str = C_string(&a, len, i->ifa_name);\n"
" C_save(str);\n"
" C_callback(receiver, 1);\n"
"}\n"
"\n"
"freeifaddrs(ifa);\n"
"C_return(C_SCHEME_UNDEFINED);\n"))

(print "The following interfaces are available: ")


(interfaces print)

This will display the interfaces one line at a time, by using CHICKEN's print procedure as the
callback.
We won't look at the compiled source code for this implementation, because it is identical to the earlier
one, except for the changed foreign-lambda body. The implementation of C_callback() is of
interest, but it is a little hairy, so I'll leave it you to explore it yourself.
The basic idea is rather simple, though: it simply calls setjmp() to establish a new garbage
collection trampoline. This means that the foreign-lambda will always remain on the stack. The
callback is then invoked with a continuation which sets a flag to indicate that the callback has returned
normally, in which case its result will be returned to the foreign-lambda. If it didn't return normally, we
arrived at the trampoline because a GC was triggered. This means the remembered continuation will be
re-invoked, like usual.
However, when the callback did return normally, we can simply return the returned value because the
foreign-lambda's stack frame is still available due to the GC barrier we set up.
The C_save macro simply saves the callback's arguments on a special stack which is read by
C_do_apply. It is also used by callback_return_continuation: it saves the value and
triggers a GC to force the returned value into the heap. That way, we can return it safely to the previous
stack frame without it getting clobbered by the next allocation.

A harder way
The above code has another flaw: if the callback raises an exception, the current exception handler will
be invoked with the continuation where it was established. However, that might never return to the
callback, which means we have a memory leak on our hands!
If the callback doesn't return normally, the foreign-lambda will remain on the stack forever. How do we
avoid that little problem? The simplest is of course to wrap the callback's code in handle-
exceptions or condition-case. However, that's no fun at all.

Besides, in real-world code we want to avoid the overhead of a GC every single time we invoke a C
function, so foreign-safe-lambda is not really suitable for functions that are called in a tight
loop. In such cases, there is only one way: to deeply integrate in CHICKEN and write a completely
native procedure! Because truly native procedures must call a continuation when they want to pass a
result somewhere, we'll have to chop up the functionality into three procedures:
(import foreign)
(use lolevel) ; For "location"

(foreign-declare "#include \"sys/types.h\"")


(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"if (getifaddrs(ifa) != 0)\n"
" *ifa = NULL;\n"))

(define free-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"freeifaddrs(*ifa);\n"))

(define next-ifa
(foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"C_word len, str, *a;\n"
"\n"
"if (*ifa) {\n"
" len = strlen((*ifa)->ifa_name);\n"
" a = C_alloc(C_SIZEOF_STRING(len));\n"
" str = C_string(&a, len, (*ifa)->ifa_name);\n"
" *ifa = (*ifa)->ifa_next;\n"
" C_kontinue(C_k, str);\n"
"} else {\n"
" C_kontinue(C_k, C_SCHEME_FALSE);\n"
"}"))

(define (interfaces)
;; Use a pointer which the C function mutates. We could also
;; return two values(!) from the "next-ifa" foreign-primitive,
;; but that complicates the code flow a little bit more.
;; Sorry about the ugliness of this!
(let-location ((ifa (c-pointer "struct ifaddrs"))
(i (c-pointer "struct ifaddrs")))
(grab-ifa! (location ifa))
(unless ifa (error "Could not allocate ifaddrs"))
(set! i ifa)

(handle-exceptions exn
(begin (free-ifa! (location ifa)) ; Prevent memory leak, and
(signal exn)) ; re-raise the exception
(let lp ((result '()))
(cond ((next-ifa (location i)) =>
(lambda (iface)
(lp (cons iface result))))
(else
(free-ifa! (location ifa))
result))))))

;; We're once again back to constructing a list!


(print "The following interfaces are available: " (interfaces))

This compiles to something very similar to the code behind a foreign-safe-lambda, but it's
obviously going to be a lot bigger due to it being cut up, so I won't duplicate the C code here.
Remember, you can always inspect it yourself with csc -k.

Anyway, this is like the foreign-safe-lambda, but without the implicit GC. Also, instead of "returning"
the value through C_return() we explicitly call the continuation C_k through the C_kontinue()
macro, with the value we want to pass on to the cond. If we wanted to return two values, we could
simply use the C_values() macro instead; we're free to do whatever Scheme can do, so we can even
return multiple values, as long as the continuation accepts them.
If an exception happens anywhere in this code, we won't get a memory leak due to the stack being
blown up. However, like in any C code, we need to free up the memory behind the interface addresses.
So we can't really escape our cleanup duty!
You might think that there's one more problem with foreign-primitive: because it doesn't force
a GC before calling the C function, there's still no guarantee about how much space you still have on
the stack. Luckily, CHICKEN has a C_STACK_RESERVE, which defines how much space that is
guaranteed to be left on the stack after each C_demand(). Its value is currently 0x10000 (i.e., 64
KiB), which means you have some headroom to do basic allocations like we do here, but you shouldn't
allocate too many huge objects. There are ways around that, but unfortunately not using the "official"
FFI (that I'm aware of, anyway). For now we'll stick with the official Scheme API.

The die-hard way: calling Scheme closures from C


So far, we've discussed pretty much only things you can find in the CHICKEN manual's section on the
FFI. Let's take a look at how we can do things a little differently, and instead of passing the string or #f
to a continuation, we pass the callback as a procedure again, just like we did for the "easy" way:
(import foreign)
(use lolevel)

(foreign-declare "#include \"sys/types.h\"")


(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"if (getifaddrs(ifa) != 0)\n"
" *ifa = NULL;\n"))

(define free-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"freeifaddrs(*ifa);\n"))

(define next-ifa
(foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa)
(scheme-object more) (scheme-object done))
"C_word len, str, *a;\n"
"\n"
"if (*ifa) {\n"
" len = strlen((*ifa)->ifa_name);\n"
" a = C_alloc(C_SIZEOF_STRING(len));\n"
" str = C_string(&a, len, (*ifa)->ifa_name);\n"
" *ifa = (*ifa)->ifa_next;\n"
" ((C_proc3)C_fast_retrieve_proc(more))(3, more, C_k, str);\n"
;; Alternatively:
;; " C_save(str); \n"
;; " C_do_apply(2, more, C_k); \n"
;; Or, if we want to call Scheme's APPLY directly (slower):
;; " C_apply(5, C_SCHEME_UNDEFINED, C_k, more, \n"
;; " str, C_SCHEME_END_OF_LIST); \n"
"} else {\n"
" ((C_proc2)C_fast_retrieve_proc(done))(2, done, C_k);\n"
;; Alternatively:
;; " C_do_apply(0, done, C_k); \n"
;; Or:
;; " C_apply(4, C_SCHEME_UNDEFINED, C_k, done, C_SCHEME_END_OF_LIST);\n"
"}"))

(define (interfaces)
(let-location ((ifa (c-pointer "struct ifaddrs"))
(i (c-pointer "struct ifaddrs")))
(grab-ifa! (location ifa))
(unless ifa (error "Could not allocate ifaddrs"))
(set! i ifa)

(handle-exceptions exn
(begin (free-ifa! (location ifa))
(signal exn))
(let lp ((result '()))
(next-ifa (location i)
(lambda (iface) ; more
(lp (cons iface result)))
(lambda () ; done
(free-ifa! (location ifa))
result))))))

(print "The following interfaces are available: " (interfaces))


The magic lies in the expression ((C_proc3)C_fast_retrieve_proc(more))(3, more,
C_k, str). We've seen something like it before in generated C code snippets: First, it extracts the C
function pointer from the closure object in more. Then, the function pointer is cast to the correct type;
C_proc3 refers to a procedure which accepts three arguments. This excludes the argument count,
which actually is the first argument in the call. The next argument is the closure itself, which is needed
when the closures has local variables it refers to (like result and lp in the example). The argument
after the closure is its continuation. We just pass on C_k: the final continuation of both more and
done is the continuation of lp, which is also the continuation of next-ifa. Finally, the arguments
following the continuation are the values passed as arguments: iface for the more closure.

The done closure is invoked as C_proc2 with only itself and the continuation, but no further
arguments. This corresponds to the fact that done is just a thunk.

I've shown two alternative ways to call the closure. The first is to call the closure through the
C_do_apply function. This is basically a dispatcher which checks the argument count and uses the
correct C_proc<n> cast and then calls it with the arguments, taken from a temporary stack on which
C_save places the arguments. The implementation behind it is positively insane, and worth checking
out for the sheer madness of it.
The second alternative is to use C_apply, which is the C implementation of Scheme's apply
procedure. It's a bit awkward to call from C, because this procedure is a true Scheme procedure. That
means it accepts an argument count, itself and its continuation and only then its arguments, which are
the closure and the arguments to pass to the closure, with the final argument being a list:
(apply + 1 2 '(3 4)) => 10

In C this would be:


C_apply(6, C_SCHEME_UNDEFINED, C_k, C_closure(&a, 1, C_plus),
C_fix(1), C_fix(2), C_list2(C_fix(3), C_fix(4)));

It also checks its arguments, so if you pass something that's not a list as its final argument, it raises a
nice exception:
(import foreign)
((foreign-primitive ()
"C_word ab[C_SIZEOF_CLOSURE(1)], *a = ab; \n"
"C_apply(4, C_SCHEME_UNDEFINED, C_k, "
" C_closure(&a, 1, (C_word)C_plus), C_fix(1));"))

This program prints the following when executed:


Error: (apply) bad argument type: 1
Call history:
test.scm:2: g11 <--

And this brings us to our final example, where we go absolutely crazy.


The guru way: Calling Scheme closures you didn't receive
You might have noticed that the error message above appears without us passing the error procedure
to +, and if you had wrapped the call in an exception handler it would've called its continuation,
without us passing it to the procedure. In some situations you might like to avoid boring the user with
passing some procedure to handle some exceptional situation. Let's see if we can do something like that
ourselves!
It turns out to be pretty easy:
(import foreign)
(use lolevel)

(foreign-declare "#include \"sys/types.h\"")


(foreign-declare "#include \"sys/socket.h\"")
(foreign-declare "#include \"ifaddrs.h\"")

(define grab-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"if (getifaddrs(ifa) != 0)\n"
" *ifa = NULL;\n"))

(define free-ifa!
(foreign-lambda* void (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"freeifaddrs(*ifa);\n"))

(define (show-iface-name x)
(print x)
#t)

(define next-ifa
(foreign-primitive (((c-pointer (c-pointer "struct ifaddrs")) ifa))
"C_word len, str, *a, show_sym, show_proc;\n"
"\n"
"if (*ifa) {\n"
" len = strlen((*ifa)->ifa_name);\n"
" a = C_alloc(C_SIZEOF_INTERNED_SYMBOL(15) + C_SIZEOF_STRING(len));\n"
" str = C_string(&a, len, (*ifa)->ifa_name);\n"
" *ifa = (*ifa)->ifa_next;\n"
;; The new bit:
" show_sym = C_intern2(&a, C_text(\"show-iface-name\"));\n"
" show_proc = C_block_item(show_sym, 0);\n"
" ((C_proc3)C_fast_retrieve_proc(show_proc))(3, show_proc, C_k, str);\n"
"} else {\n"
" C_kontinue(C_k, C_SCHEME_FALSE);\n"
"}"))

(define (interfaces)
(let-location ((ifa (c-pointer "struct ifaddrs"))
(i (c-pointer "struct ifaddrs")))
(grab-ifa! (location ifa))
(unless ifa (error "Could not allocate ifaddrs"))
(set! i ifa)

(handle-exceptions exn
(begin (free-ifa! (location ifa))
(signal exn))
(let lp ()
;; next-ifa now returns true if it printed an interface and is
;; ready to get the next one, or false if it reached the end.
(if (next-ifa (location i))
(lp)
(free-ifa! (location ifa)))))))

(print "The following interfaces are available: ")


(interfaces)

This uses C_intern2 to look up the symbol for "show-iface-name" in the symbol table (or
intern it if it didn't exist yet). We store this in show_sym. Then, we look at the symbol's first slot,
where the value is stored for the global variable identified by the symbol. The value slot always exists,
but if it is undefined, the value is C_SCHEME_UNDEFINED. Anyway, we assume it's defined and we
call it like we did in the example before this one: extract the first slot from the closure and call it.
This particular example isn't very useful, but the technique can be used to invoke hook procedures, and
in fact the core itself uses it from barf() when it invokes ##sys#error-hook to construct and
raise an exception when an error situation occurs in the C runtime.
CHICKEN internals: the garbage collector
One of CHICKEN's coolest features has to be its unique approach to garbage collection. When
someone asked about implementation details (hi, Arthur!), I knew this would make for an interesting
blog post. This post is going to be long and technical, so hold on to your hats! Don't worry if you don't
get through this in one sitting.

Prerequisites
There's a whole lot of stuff that we'll need to explain before we get to the actual garbage collector.
CHICKEN's garbage collection (GC) strategy is deeply intertwined with its compilation strategy, so I'll
start by explaining the basics of that, before we can continue (pun intended) with the actual GC stuff.

A short introduction to continuation-passing style


The essence of CHICKEN's design is a simple yet brilliant idea by Henry Baker, described in his paper
CONS Should Not CONS Its Arguments, Part II: Cheney on the M.T.A.. The paper is pretty terse, but
it's well-written, so I recommend you check it out before reading on. If you grok everything in it, you
probably won't get much out of my blog post and you can stop reading now. If you don't grok it, it's
probably a good idea to re-read it again later.
Baker's approach assumes a Scheme to C compiler which uses continuation-passing style (CPS) as an
internal representation. This is the quintessential internal representation of Scheme programs, going all
the way back to the first proper Scheme compiler, RABBIT.
Guy L. Steele (RABBIT's author) did not use CPS to make garbage collection easier. In fact, RABBIT
had no GC of its own, as it relied on MacLISP as a target language (which compiled to PDP-10
machine code and had its own garbage collector). Instead, continuations allowed for efficient
implementation of nested procedure calls. It eliminated the need for a stack to keep track of this nesting
by simply returning the "next thing to do" to a driver loop which took care of invoking it. This made it
possible to write down iterative algorithms as a recursive function without causing a stack overflow.
Let's consider a silly program which sums up all the numbers in a list, and shows the result multiplied
by two:
(define (calculate-sum lst result)
(if (null? lst)
result
(calculate-sum (cdr lst) (+ result (car lst)))))

(define (show-sum lst)


(print-number (* 2 (calculate-sum lst 0))))

(show-sum '(1 2 3))

A naive compilation to C would look something like this (brutally simplified):


void entry_point() {
toplevel();
exit(0); /* Assume exit(1) is explicitly called elsewhere in case of error. */
}

void toplevel() {
/* SCM_make_list() & SCM_fx() allocate memory. "fx" stands for "fixnum". */
SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
show_sum(lst);
}

SCM_obj* show_sum(SCM_obj *lst) {


SCM_obj result = calculate_sum(lst, SCM_fx(0));
/* SCM_fx_times() allocates memory. */
return SCM_print_number(SCM_fx_times(SCM_fx(2), result));
}

SCM_obj* calculate_sum(SCM_obj *lst, SCM_obj *result) {


if (lst == SCM_NIL) { /* Optimised */
return result;
} else {
/* SCM_fx_plus() allocates memory. */
SCM_obj *tmp = SCM_cdr(lst);
SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
return calculate_sum(tmp, tmp2); /* Recur */
}
}

SCM_obj *SCM_print_number(SCM_obj *data) {


printf("%d\n", SCM_fx_to_integer(data));
return SCM_VOID;
}

This particular implementation probably can't use a copying garbage collector like CHICKEN uses,
because the SCM_obj pointers which store the Scheme objects' locations would all become invalid.
But let's ignore that for now.
Due to the recursive call in calculate_sum(), the stack just keeps growing, and eventually we'll
get a stack overflow if the list is too long. Steele argued that this is a silly limitation which results in the
proliferation of special-purpose "iteration" constructs found in most languages. Also, he was convinced
that this just cramps the programmer's style: we shouldn't have to think about implementation details
like the stack size. In his time people often used goto instead of function calls as a performance hack.
This annoyed him enough to write a rant about it, which should be required reading for all would-be
language designers!
Anyway, a compiler can transparently convert our Scheme program into CPS, which would look
something like this after translation to C:
/* Set up initial continuation & toplevel call. */
void entry_point() {
SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
SCM_call *call = SCM_make_call(0, cont);
SCM_driver_loop(call);
}

void SCM_driver_loop(SCM_call *call) {


/* The trampoline to which every function returns its continuation. */
while(true)
call = SCM_perform_continuation_call(call);
}

SCM_call *toplevel(SCM_cont *cont) {


SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
return SCM_make_call(1, next, lst);
}

SCM_call *show_sum(SCM_cont *cont, SCM_obj *lst) {


SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
return SCM_make_call(2, now, lst, SCM_fx(0));
}

SCM_call *calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {


if (lst == SCM_NIL) { /* Optimised */
return SCM_make_call(1, cont, result);
} else {
SCM_obj *tmp = SCM_cdr(lst);
SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
return SCM_make_call(2, now, tmp, tmp2); /* "Recur" */
}
}

SCM_call *show_sum_continued(SCM_cont *cont, SCM_obj *result) {


SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
return SCM_make_call(1, now, tmp);
}

SCM_call *SCM_print_number(SCM_cont *cont, SCM_obj *data) {


printf("%d\n", SCM_fx_to_integer(data));
return SCM_make_call(1, cont, SCM_VOID);
}

In the above code, there are two new data types: SCM_cont and SCM_call.

An SCM_cont represents a Scheme continuation as a C function's address, the number of arguments


which it expects (minus one) and another continuation, which indicates where to continue after the C
function has finished. This sounds recursive, but as you can see the very first continuation created by
entry_point() is a specially prepared one which will cause the process to exit.

An SCM_call is returned to the driver loop by every generated C function: this holds a continuation
and the arguments with which to invoke it. SCM_perform_continuation_call() extracts the
SCM_cont from the SCM_call and invokes its C function with its continuation and the arguments
from the SCM_call. We won't dwell on the details of its implementation now, but assume this is some
magic which just works.
You'll also note that the primitives SCM_car(), SCM_cdr(), SCM_fx_plus() and
SCM_fx_times() do not accept a continuation. This is a typical optimisation: some primitives can
be inlined by the compiler. However, this is not required: you can make them accept a continuation as
well, at the cost of further splintering the C functions into small sections; the calculate_sum()
function would be split up into 4 separate functions if we did that.
Anyway, going back to the big picture we can see that this continuation-based approach consumes a
more or less constant amount of stack space, because each function returns to driver_loop. Baker's
fundamental insight was that the stack is there anyway (and it will be used by C), and if we don't need
it for tracking function call nesting, why not use it for something else? He proposed to allocate all
newly created objects on the stack. Because the stack would hopefully fit the CPU's cache in its
entirety, this could give quite a performance benefit.

Generational collection
To understand why keeping new data together on the stack can be faster, it's important to know that
most objects are quite short-lived. Most algorithms involve intermediate values, which are accessed
quite a bit during a calculation but are no longer needed afterwards. These values need to be stored
somewhere in memory. Normally you would store them together with all other objects in the main
heap, which may cause fragmentation of said heap. Fragmentation means that memory references may
cross page boundaries. This is slow, because it will clear out the CPU's memory cache and may even
require swapping it in, if the machine is low on memory.
On top of that, generating a lot of intermediate values means generating a lot of garbage, which will
trigger many GCs during which a lot of these temporary objects will be cleaned up. However, during
these GCs, the remaining longer-lived objects must also be analysed before it can be decided they can
stick around.
This is rather wasteful, and it turns out we can avoid doing so much work by categorising objects by
age. Objects that have just been created belong to the first generation and are stored in their own space
(called the nursery - I'm not kidding!), while those that have survived several GC events belong to
older generations, which each have their own space reserved for them. By keeping different generations
separated, you do not have to examine long-lived objects of older generations (which are unlikely to be
collected) when collecting garbage in a younger generation. This can save us a lot of wasted time.

Managing data on the stack


The Cheney on the M.T.A. algorithm as used by CHICKEN involves only two generations; one
generation consisting of newly created objects and the other generation consisting of older objects. In
this algorithm, new objects get immediately promoted (or tenured) to the old generation after a GC of
the nursery (or stack). Such a GC is called a minor GC, whereas a GC of the heap is called a major GC.
This minor GC is where the novelty lies: objects are allocated on the stack. You might wonder how that
can possibly work, considering the lengths I just went through to explain how CPS conversion gets rid
of the stack. Besides, by returning to the trampoline function whenever a new continuation is invoked,
anything you'd store on the stack would need to get purged (that's how the C calling convention works).
That's right! The way to make this work is pretty counter-intuitive: we go all the way back to the first
Scheme to C conversion I showed you and make it even worse. Whenever we want to invoke a
continuation, we just call its function. That means that the example program we started out with would
compile to this:
/* Same as before */
void entry_point() {
SCM_cont *cont = SCM_make_cont(1, &toplevel, SCM_exit_continuation);
SCM_call *call = SCM_make_call(0, cont);
SCM_driver_loop(call);
}

SCM_call *saved_cont_call; /* Set by SCM_save_call, read by driver_loop */


jmp_buf empty_stack_state; /* Set by driver_loop, read by minor_GC */

void SCM_driver_loop(SCM_call *call) {


/* Save registers (including stack depth and address in this function) */
if (setjmp(empty_stack_state))
call = saved_cont_call; /* Got here via longjmp()? Use stored call */

SCM_perform_continuation_call(call);
}

void SCM_minor_GC() {
/* ...
Copy live data from stack to heap, which is a minor GC. Described later.
... */
longjmp(empty_stack_state); /* Restore registers (jump back to driver_loop) */
}

void toplevel(SCM_cont *cont) {


if (!fits_on_stack(SCM_CONT_SIZE(0) + SCM_CALL_SIZE(1) +
SCM_FIXNUM_SIZE * 3 + SCM_PAIR_SIZE * 3)) {
SCM_save_call(0, &toplevel, cont); /* Mutates saved_cont_call */
SCM_minor_GC(); /* Will re-invoke this function from the start */
} else {
/* The below stuff will all fit on the stack, as calculated in the if() */
SCM_cont *next = SCM_make_cont(1, &show_sum, cont);
SCM_obj *lst = SCM_make_list(3, SCM_fx(1), SCM_fx(2), SCM_fx(3));
SCM_call *call = SCM_make_call(1, next, lst);
SCM_perform_continuation_call(call);
}
}

void show_sum(SCM_cont *cont, SCM_obj *lst) {


if (!fits_on_stack(SCM_CONT_SIZE(0) * 2 +
SCM_CALL_SIZE(2) + SCM_FIXNUM_SIZE)) {
SCM_save_call(1, &show_sum, cont, lst);
SCM_minor_GC();
} else {
SCM_cont *next = SCM_make_cont(1, &show_sum_continued, cont);
SCM_cont *now = SCM_make_cont(2, &calculate_sum, next);
SCM_call *call = SCM_make_call(2, now, lst, SCM_fx(0));
SCM_perform_continuation_call(call);
}
}

void calculate_sum(SCM_cont *cont, SCM_obj *lst, SCM_obj *result) {


/* This calculation is overly pessimistic as it counts both arms
of the if(), but this is acceptable */
if (!fits_on_stack(SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE +
SCM_CONT_SIZE(1) + SCM_CALL_SIZE(2))) {
SCM_save_call(2, &calculate_sum, cont, lst, result);
SCM_minor_GC();
} else {
if (lst == SCM_NIL) {
SCM_call *call = SCM_make_call(1, cont, result);
SCM_perform_continuation_call(call);
} else {
SCM_obj *tmp = SCM_cdr(lst);
SCM_obj *tmp2 = SCM_fx_plus(result, SCM_car(lst));
SCM_cont *now = SCM_make_cont(2, &calculate_sum, cont);
SCM_call *call = SCM_make_call(2, now, tmp, tmp2);
SCM_perform_continuation_call(call); /* "Recur" */
}
}
}

void show_sum_continued(SCM_cont *cont, SCM_obj *result) {


if (!fits_on_stack(SCM_CONT_SIZE(1) + SCM_CALL_SIZE(1) + SCM_FIXNUM_SIZE)) {
SCM_save_call(1, &show_sum_continued, cont, result);
SCM_minor_GC();
} else {
SCM_cont *now = SCM_make_cont(1, &SCM_print_number, cont);
SCM_obj *tmp = SCM_fx_times(SCM_fx(2), result);
SCM_call *call = SCM_make_call(1, now, tmp);
SCM_perform_continuation_call(call);
}
}

void SCM_print_number(SCM_cont *cont, SCM_obj *data) {


if (!fits_on_stack(SCM_CALL_SIZE(1))) {
SCM_save_call(1, &show_sum_continued, cont, data);
SCM_minor_GC();
} else {
printf("%d\n", SCM_fx_to_integer(data));
SCM_call *call = SCM_make_call(1, cont, SCM_VOID);
SCM_perform_continuation_call(call);
}
}

Whew! This program is quite a bit longer, but it isn't that different from the second program I showed
you. The main change is that none of the continuation functions return anything. In fact, these
functions, like Charlie in the M.T.A. song, never return. In the earlier version every function ended
with a return statement, now they end with an invocation of
SCM_perform_continuation_call().

To make things worse, allocating functions now also use alloca() to place objects on the stack. That
means that the stack just keeps filling up like the first compilation I showed you, so we're back to
where we started! However, this program is a lot longer due to one important thing: At the start of each
continuation function we first check to see if there's enough space left on the stack to accommodate the
objects this function will allocate.
If there's not enough space, we re-create the SCM_call with which this continuation function was
invoked using SCM_save_call(). This differs from SCM_make_call() in that it will not
allocate on the stack, but will use a separate area to set aside the call object. The pointer to that area is
stored in saved_cont_call.

SCM_save_call() can't allocate on the stack for a few reasons: The first and most obvious reason
is that the saved call wouldn't fit on the stack because we just concluded it is already full. Second, the
arguments to the call must be kept around even when the stack is blown away after the GC has
finished. Third, this stored call contains the "tip" of the iceberg of live data from which the GC will
start its trace. This is described in the next section.
After the minor GC has finished, we can jump back to the trampoline again. We use the setjmp()
and longjmp() functions for that. When the first call to SCM_driver_loop() is made, it will call
setjmp() to save all the CPU's registers to a buffer. This includes the stack and instruction pointers.
Then, when the minor GC finishes, it calls longjmp() to restore those registers. Because the stack
and instruction pointer are restored, this means execution "restarts" at the place in driver_loop()
where setjmp() was invoked. The setjmp() then returns again, but now with a nonzero value (it
was zero the first time). The return value is checked and the call is fetched from the special save area to
get back to where we were just before we performed the GC.
This is half the magic, so please make sure you understand this part!

The minor GC
The long story above served to set up all the context you need to know to dive into the GC itself, so
let's take a closer look at it.

Picking the "live" data from the stack


As we've seen, the GC is invoked when the stack has completely filled up. At this point, the stack is a
complete mess: it has many stack frames from all the function calls that happened between the previous
GC and now. These stack frames consist of return addresses for the C function calls (which we're not
even using), stack-allocated C data (which we don't need) and somewhere among that mess there are
some Scheme objects. These objects themselves also belong to two separate categories: the "garbage"
and the data that's still being used and needs to be kept around (the so-called live data). How on earth
are we going to pick only the interesting bits from that mess?
Like I said before, the saved call contains the "tip of the iceberg" of live data. It turns out this is all we
need to get at every single object which is reachable to the program. All you need to do is follow the
pointers to the arguments and the continuation stored in the call. For each of these objects, you copy
them to the heap and if they are compound objects you follow all the pointers to the objects stored
within them, and so on. Let's take a look at a graphical representation of how this works. In the picture
below I show the situation where a GC is triggered just after the second invocation of calculate-
sum (i.e., the first recursive call of itself, with the list '(2 3)):
Stack
pointer to SCM_obj [result]

pointer to SCM_obj [lst] saved call


pointer to SCM_cont [cont]

calculate_sum (2) return addr: calculate_sum (1)

pointer to SCM_call [call]

pointer to SCM_obj

pointer to SCM_obj

pointer to SCM_cont

SCM_call(2 arguments)

pointer to SCM_cont [now]

pointer to SCM_cont

addr of calculate_sum

SCM_cont(2 arguments)

pointer to SCM_obj [tmp2]

SCM_obj(FIXNUM): 1

pointer to SCM_obj [tmp]

pointer to SCM_obj [result]

pointer to SCM_obj [lst]


pointer to SCM_cont [cont]

calculate_sum (1) return addr: show_sum

pointer to SCM_call [call]

pointer to SCM_obj
pointer to SCM_obj

pointer to SCM_cont

SCM_call(2 arguments)

SCM_obj(FIXNUM): 0

pointer to SCM_cont [now]

pointer to SCM_cont

addr of calculate_sum

SCM_cont(2 arguments)

pointer to SCM_cont [next]

pointer to SCM_cont

addr of show_sum_continued

SCM_cont(1 argument)

pointer to SCM_obj [lst]


pointer to SCM_cont [cont]

show_sum return addr: toplevel

pointer to SCM_call

pointer to SCM_obj
[call]
Heap
pointer to SCM_cont

SCM_call(1 argument)
After the initial shock from seeing this cosmic horror has worn off, let's take a closer look. It's like a
box of tangled cords: if you take the time to carefully untangle them, it's easy, but if you try to do it all
at once, it'll leave you overwhelmed. Luckily, I'm going to talk you through it. (by the way: this is an
SVG so you can zoom in on details as far as you like using your browser's zooming functionality).
Let's start with the big picture: On the left you see the stack, on the right the heap after copying and in
the bottom centre there's a small area of statically allocated objects, which are not subject to GC. To get
your bearings, check the left margin of the diagram. I have attempted to visualise C stack frames by
writing each function's name above a line leading to the bottom of its frame.
Let's look at the most recently called function, at the top of the stack. This is the function which
initiated the minor GC: the second call to calculate_sum(). The shaded area shows the pointers
set aside by SCM_save_call(), which form the tip of the iceberg of live data. More on that later.

The next frame belongs to the first call to calculate_sum(). It has allocated a few things on the
stack. The topmost element on the stack is the last thing that's allocated due to the way the stack grows
upwards in this picture. This is a pointer to an SCM_call object, marked with "[call]", which is
the name of the variable which is stored there. If you go back to the implementation of
calculate_sum(), you can see that the last thing it does is allocate an SCM_call, and store its
pointer in call. The object itself just precedes the variable on the stack, and is marked with a thick
white border to group together the machine words from which it is composed. From bottom to top,
these are:
• A tag which indicates that this is a call containing 2 arguments,
• a pointer to an SCM_cont object (taken from the now variable),
• a pointer to an SCM_obj object (the cdr of lst, taken from tmp) and
• a pointer to an SCM_obj object (a fixnum, taken from tmp2).

Other compound objects are indicated in the same way.


You'll also have noticed the green, white and dashed arcs with arrow tips. These connect pointers to
their target addresses. The dashed ones on the right hand side of the stack indicate pointers that are used
for local variables in C functions or SCM_call objects. These pointers are unimportant to the garbage
collector. The ones on the left hand side of the stack are pointers from Scheme objects to other Scheme
objects. These are important to the GC. The topmost pointer inside the call object we just looked at has
a big dashed curve all the way down to the cdr of lst, and the one below it points at the value of
result, which is the fixnum 1.

If you look further down the stack, you'll see the show_sum procedure which doesn't really allocate
much: an SCM_call, the initial intermediate result (fixnum 0), and two continuations (next and now
in the C code). The bulk of the allocation happens in toplevel, which contains the call to
show_sum and allocates a list structure. This is on the stack in reverse order: first the pair X = (3 .
()), then the pair Y = (2 . <X>) and the pair Z = (1 . <Y>). The () is stored as SCM_NIL
in the static area, to which the cdr of the bottom-most pair object on the stack points, which is
represented by a long green line which swoops down to the static area.
Copying the live data to the heap
The green lines represent links from the saved call to the live data which we need to copy. You can
consider the colour green "contagious": imagine everything is white initially, except for the saved call.
Then, each line starting at the pointers of the call are painted green. The target object to which a line
leads is also painted green. Then, we recursively follow lines from pointers in that object and paint
those green, etc. The objects that were already in the heap or the static area are not traversed, so they
stay white.
When an object is painted green, it is also copied to the heap, which is represented by a yellow line.
The object is then overwritten by a special object which indicates that this object has been moved to the
heap. This special object contains a forwarding pointer which indicates the new location of the object.
This is useful when you have two objects which point to the same other object, like for example in this
code:
(let ((a (list 3 2 1))
(b (cons 4 a))
(c (cons 4 a)))
...)

Here you have two lists (4 3 2 1) which share a common tail. If both lists are live at the moment of
GC, we don't want to copy this tail twice, because that would result in it being split into two distinct
objects. Then, a set-car! on a might only be reflected in b but not c, for example. The forwarding
pointers prevent this from happening by simply adjusting a copied object's constituent objects to point
to their new locations. Finally, after all data has been copied, all the newly copied objects are checked
again for references to objects which may have been relocated after the object was copied.
The precise algorithm that performs this operation is very clever. It requires only two pointers and a
while loop, but it still handles cyclic data structures correctly. The idea is that you do the copying I
described above in a breadth-first way: you only copy the objects stored in the saved call (without
touching their pointers). Next, you loop from the start of the heap to the end, looking at each object in
turn (initially, those are the objects we just copied). For these objects, you check their components, and
see whether they exist in the heap or in the stack. If they exist in the stack, you copy them over to the
end of the heap (again, without touching their pointers). Because they are appended to the heap, the end
pointer gets moved to the end of the last object, so the while loop will also take the newly copied
objects into consideration. When you reach the end of the heap, you're done. In C, that would look
something like this:
SCM_obj *slot;
int i, bytes_copied;
char *scan_start = heap_start;

for(i = 0; i < saved_object_count(saved_call); ++i) {


obj = get_saved_object(saved_call, i);
/* copy_object() is called "mark()" in CHICKEN.
It also set up a forwarding pointer at the original location */
bytes_copied = copy_object(obj, heap_end);
heap_end += bytes_copied;
}
while(scan_start < heap_end) {
obj = (SCM_obj *)scan_start;
for(i = 0; i < object_size(obj); ++i) {
slot = get_slot(obj, i);
/* Nothing needs to be done if it's in the heap or static area */
if (exists_in_stack(slot)) {
if (is_forwarding_ptr(slot)) {
set_slot(obj, i, forwarding_ptr_target(slot));
} else {
bytes_copied = copy_object(slot, heap_end);
set_slot(obj, i, heap_end);
heap_end += bytes_copied;
}
}
}
scan_start += object_size(obj);
}

This algorithm is the heart of our garbage collector. You can find it in runtime.c in the CHICKEN
sources in C_reclaim(), under the rescan: label. The algorithm was invented in 1970 by C.J.
Cheney, and is still used in the most "state of the art" implementations. Now you know why Henry
Baker's paper is called "Cheney on the M.T.A." :)
After the data has been copied to the heap, the longjmp() in minor_GC() causes everything on the
stack to be blown away. Then, the top stack frame is recreated from the saved call. This is illustrated
below:
Heap
SCM_obj(FIXNUM): 1

pointer to SCM_cont

addr of show_sum_continued

SCM_cont(1 argument)

pointer to SCM_obj(cdr)

pointer to SCM_obj(car)

Stack SCM_obj(PAIR)

SCM_obj(FIXNUM): 2
pointer to SCM_obj [result]

restored call pointer to SCM_obj [lst]


pointer to SCM_obj(cdr)

pointer to SCM_obj(car)
pointer to SCM_cont [cont]
SCM_obj(PAIR)
calculate_sum return addr: driver_loop
SCM_obj(FIXNUM): 3
pointer to SCM_call [call]

driver_loop return addr: entry_point

pointer to SCM_call [call]

pointer to SCM_cont

Static
SCM_call(0 arguments)
irreclaimable pointer to SCM_cont [cont]

pointer to SCM_cont SCM_obj(SCM_NIL)

addr of toplevel SCM_obj(SCM_VOID)

entry_point SCM_cont(1 argument) SCM_cont(exit_continuation)

Everything in the shaded red area below the stack frame for driver_loop() is now unreachable
because there are no more pointers from live data pointing into this region of the stack. Any live
Scheme objects allocated here would have been copied to the heap, and all pointers which pointed there
relayed to this new copy. Unfortunately, this stale copy of the data will permanently stick around on the
stack, which means this data is forever irreclaimable. This means it is important that the entry point
should consume as little stack space as possible.

The major GC
You might be wondering how garbage on the heap is collected. That's what the major GC is for.
CHICKEN initially only allocates a small heap area. The heap consists of two halves: a fromspace and
a tospace. The fromspace is the heap as we've seen it so far: in normal usage, this is the part that's used.
The tospace is always empty.
When a minor GC is copying data from the stack to the fromspace, it may cause the fromspace to fill
up. That's when a major GC is triggered: the data in the fromspace is copied to the tospace using
Cheney's algorithm. Afterwards, the areas are flipped: the old fromspace is now called tospace and the
old tospace is now called fromspace.
During a major GC, we have a slightly larger set of live data. It is not just the data from the saved call,
because that's only the stuff directly used by the currently running continuation. We also need to
consider global variables and literal objects compiled into the program, for example. These sorts of
objects are also considered live data. Aside from this, a major collection is performed the same way as
a minor collection.
The smart reader might have noticed a small problem here: what if the amount of garbage cleaned up is
less than the data on the stack? Then, the stack data can't be copied to the new heap because it simply is
too small. Well, this is when a third GC mode is triggered: a reallocating GC. This causes a new heap
to be allocated, twice as big as the current heap. This is also split in from- and tospace. Then, Cheney's
algorithm is performed on the old heap's fromspace, using one half of the new heap as tospace. When
it's finished, the new tospace is called fromspace, and the other half of the new heap is called tospace.
Then, the old heap is de-allocated.

Some practical notes


The above situation is a pretty rough sketch of the way the GC works in CHICKEN. Many details have
been omitted, and the actual implementation is extremely hairy. Below I'll briefly mention how a few
important things are implemented.

Object representation
You might have noticed that the stack grows pretty quickly in the CPS-converted C code I showed you.
That's because the SCM_obj representation requires allocating every object and storing a pointer to it,
so that's a minimum of two machine words per object.
CHICKEN avoids this overhead for small, often-used objects like characters, booleans and fixnums. It
ensures all allocated objects are word-aligned, so all pointers to objects have their lower bits set to zero.
This means you can easily see whether something is a pointer to an object or something else.
All objects in CHICKEN are represented by a C_word type, which is the size of a machine word. So-
called immediate values are stored directly inside the machine word, with nonzero lower bits. Non-
immediate values are cast to a pointer type to a C structure which contains the type tag and bits like I
did in the example.
Calls are not represented by objects in CHICKEN. Instead, the C function is simply invoked directly
from the continuation's caller. Continuations are represented as any other object. For didactic reasons, I
used a separate C type to distinguish it from SCM_obj, but in Scheme continuations can be reified as
first-class objects, so they shouldn't be represented in a fundamentally different way.

Closures
You might be wondering how closures are implemented, because this hasn't been discussed at all. The
answer is pretty simple: in the example code, a SCM_call object stored a plain C function's address.
Instead, we could store a closure instead: this is a new type of object which holds a C function plus its
local variables. Each C function receives this closure as an extra argument (in the CHICKEN sources
this is called self). When it needs to access a closed-over value, it can be accessed from the closure
object.

Mutations
Another major oversight is the assumption that objects can only point from the stack into the heap. If
Scheme was a purely functional language, this would be entirely accurate: new objects can refer to old
objects, but there is no way that a preexisting object can be made to refer to a newly created object. For
that, you need to support mutation.
But Scheme does support mutation! So what happens when you use vector-set! to store a newly
created, stack-allocated value in an old, heap-allocated vector? If we used the above algorithm, the
newly created element would either be part of the live set and get copied, but the vector's pointer would
not be updated, or it wouldn't be part of the live set and the object would be lost in the stack reset.
The answer to this problem is also pretty simple: we add a so-called write barrier. Whenever a value is
written to an object, it is remembered. Then, when performing a GC, these remembered values are
considered to be part of the live set, just like the addresses in the saved call. This is also the reason
CHICKEN always shows the number of mutations when you're asking for GC statistics: mutation may
slow down a program because GCs might take longer.

Stack size
How does CHICKEN know when the stack is filled up? It turns out that there is no portable way to
detect how big the stack is, or whether it has a limit at all!
CHICKEN works around this simply by limiting its stack to a predetermined size. On 64-bit systems,
this is 1MB, on 32-bit systems it's 256KB. There is also no portable way of obtaining the address of the
stack itself. On some systems, it uses a small bit of assembly code to check the stack pointer. On other
systems, it falls back on alloca(), allocating a trivial amount of data. The address of the allocated
data is the current value of the stack pointer.
When initialising the runtime, just before the entry point is called, the stack's address is taken to
determine the stack's bottom address. The top address is checked in the continuation functions, and the
difference between the two is the current stack size.

A small rant
While doing the background research for this post, I wanted to read Cheney's original paper. It was
very frustrating to find so many references to it, which all lead to a a paywall on the ACM website.
I think it's absurd that the ACM charges $15 for a paper which is over forty years old, and only two
measly pages long. What sane person would plunk down 15 bucks to read 2 pages, especially if it is
possibly outdated, or not even the information they're looking for?
The ACM's motto is "Advancing Computing as a Science & Profession", but I don't see how putting
essential papers behind a paywall is advancing the profession, especially considering how many
innovations now happen as unpaid efforts in the open source/free software corner of the world. Putting
such papers behind a paywall robs the industry from a sorely-needed historical perspective, and it
stifles innovation by forcing us to keep reinventing the wheel.
Some might argue that the ACM needs to charge money to be able to host high-quality papers and
maintain its high quality standard, but I don't buy it. You only need to look at USENIX, which is a
similar association. They provide complete and perpetual access to all conference proceedings, and the
authors maintain full rights to their work. The ACM, instead, has now come up with a new "protection"
racket, requiring authors to give full control of their rights to the ACM, or pay for the privilege of
keeping the rights on their own work, which is between $1,100 and $1,700 per article.
On a more positive note, authors are given permission to post drafts of their papers on their own
website or through their "Author-izer" service. Unfortunately, this service only works when the link is
followed directly from the domain on which the author's website is located (through the Referer
header). This is not how the web works: it breaks links posted in e-mail as well as search engines.
Secondly, the ACM are also allowing their special interest groups to provide full access to conference
papers of the most recent conference. However, this doesn't seem to be encouraged in any way, and
only a few SIGs seem to do this.
Luckily, I found a copy of the Cheney paper on some course website. So do yourself a favour and get it
before it's taken down :(
Update: If you are also concerned about this, please take a small moment to add your name to this
petition.
Update 2: I've become aware of a web site called Sci-Hub that makes papers freely available to all. It
bypasses paywalls through shared full-access accounts. Sadly, this is technically still illegal in many
countries and some of its domains have been seized in attempts at censoring them.

Further reading
Garbage collection is a fascinating subject whose roots can be traced back all the way to the origins of
LISP. There's plenty of information to be found:
• http://www.memorymanagement.org/ is a great one-stop reference.
• Felix Winkelmann, CHICKEN's original author, has presented about Scheme implementation
techniques at FrOSCon, which included a bit about CPS conversion and garbage collection. The
slides are also available for download.
• The original "Lambda papers" are a must-read if you want to study the beginnings of Scheme.
The read scheme website is chock full of other interesting resources, actually.
• Speaking about historical information: Matthias Felleisen teaches a course on the History of
Programming Languages, which includes some notes on the history of garbage collection, and
about concurrent garbage collection as well.
Designing Lispy DSLs, part 1: SCSS Posted on
2012-07-28
Setting up this blog was a good excuse to try out SCSS, which I'd been meaning to do for quite a long
time. Working with SCSS and exploring its limitations got me thinking about what makes a good Lispy
DSL (domain specific language). This post is the first of a series. Today we'll look at SCSS; in future
installments we'll explore other examples of Lispy DSLs.
The idea behind SCSS isn't unique; by generating CSS from a more powerful language you get to use
the abstraction systems provided by that language. Abstractions are sorely needed when writing
advanced CSS; for example, you often need to use one color in many different situations. In plain CSS,
you need to repeat this color value for every usage and, if a few instances need to change, you must go
find and replace them. You can imagine it's easy to forget one, or to replace too many! Where HTML
makes it easy to write semantically meaningful content (by assigning IDs and classes, for example),
CSS doesn't have any way to indicate how style elements logically relate.
As an interesting side note, one of the creators of CSS, Bert Bos, thinks that "real-language" features
are unnecessary in CSS. He goes as far as saying constants shouldn't even be added to CSS. His main
argument basically boils down to other people are stupid so you don't get to use advanced features,
either. Luckily, many people disagree and have written their own server-side preprocessing languages.
Some of these projects (like Less and Sass) take the approach of adding their own syntax extensions to
"plain" CSS, while others (like an older syntax of Sass) design their own custom language that's
inspired by the concepts in CSS but quite different in syntax. All these projects are purely about
generating CSS from another language. But we are smug Scheme weenies, and to us code and data are
one and the same. A typical Schemer would prefer not just to generate CSS from SCSS, but to represent
CSS in a first-class value, so that it can be manipulated at will. And that's exactly what SCSS offers... at
first glance.

The devil is in the many, messy details


When you first look at CSS, it seems like a simple enough language. Indeed, the core syntax is rather
simple. Each rule set has selectors separated by commas followed by declarations between curly
braces, separated by semicolons:
#my-id, p.my-class, div {
background-color: green;
width: 10em;
margin-left: 5px;
border: 1px solid rgb(0, 128, 0);
}

There are three selectors here: the first one selects any element with the id attribute "my-id", the
second one selects every p element (paragraph) which has "my-class" listed in its class attribute.
The third one simply selects all div elements. The declarations are simple property/value pairs which
determine how the selected elements will be displayed.
In Scheme, we can easily represent this as lists of items, where each item is a list of selectors and
values, and that's exactly what SCSS does:
`(css+
(((= id "my-id") (p (= class my-class)) div)
(background-color "#008000") ; Should we use string values
(color green) ; for classes and colors, or symbols?
(width "10em")
(margin-left "5px")
(border "1px solid rgb(0, 128, 0)")))

One neat feature that's added by most of these CSS preprocessors is that you can nest items. This places
the full expression of their parent before the sub-item, which means that item will only match the
selector within its parent:
`(css+
(div
(border "1px solid rgb(0, 128, 0)")
(((// (= class "some-child")) (// (= id "some-other-child")))
(color orange))))

This compiles to the following CSS:


div {
border: 1px solid rgb(0, 128, 0);
}

div .some-child,
div #some-other-child {
color: orange
}

When looking at the examples we should start to get a funny feeling. Aside from the fact that the
selector syntax is rather heavy on parens which makes it hard to read even for a Schemer, there are a
few problems. The first problem is the fact that we are representing the property values as flat strings
(or symbols). This means you can't easily, say, find all the elements that have a particular color
somewhere in their values without very heavy additional parsing (in CSS, green, rgb(0,128,0)
and #008000 all mean exactly the same thing). You also can't easily compose declarations with
variables without doing string manipulations, which mostly defeats the point of using a first-class
representation:
(let ((company-color "#008000")
(page-width 1000)
(logo-size 20))
`(css+
((= class "menu")
(border-left ,(sprintf "1px solid ~A" ,company-color))
(width ,(sprintf "~Apx" (- page-width logo-size)))
((= class "whatever")
(background ,(sprintf "url(\"img/back.png\") no-repeat 10px 20px ~A"
,company-color))))))
The second problem is that strings, being directly injected into the CSS, don't get "escaped". This
means you can't take any user input (let's say a font name, or a color value) and use this in a declaration
value; this can destroy your entire layout if it contains a semicolon or curly brace - at best an annoying
bug, at worst, a security issue. You might just put everything in one string for all the difference it
makes:
`(css+
(.my-class
(color "#222; list-style-type: circle; margin-left: 5px")))

The third "problem" points us in the right direction. The border-property is actually a shorthand
property. The border declaration from the first example breaks down into the following full
declarations:
html {
border-top: 1px solid rgb(0, 128, 0);
border-right: 1px solid rgb(0, 128, 0);
border-bottom: 1px solid rgb(0, 128, 0);
border-left: 1px solid rgb(0, 128, 0);
}

Unfortunately, this decomposition is impossible to do in SCSS without parsing the property's string
values. Besides, even if we were to do that, these properties themselves are shorthands, too! For
example, the border-top declaration itself breaks down into these declarations:
html {
border-top-width: 1px;
border-top-style: solid;
border-top-color: rgb(0, 128, 0);
}

This is similar to how in Scheme macros can rewrite convenient notation to a simpler core language.
The better approach would be to compile down to the core CSS forms rather than trying to use these
complex properties directly.
To get this far, we'd have to decompose everything to its simplest form and assemble more complex
properties in terms of simpler ones. In CSS, each property basically has its own free-form "value"
syntax which can get quite complex. Some examples:
html {
/**
* Images can be full URIs (dragging in another pretty large RFC), which can
* *optionally* be quoted (why all this unnecessary optional stuff?)
*/
background-image: url("path/to/image.png");

/* You can use named "counters" (what, there are no variables in CSS?!) */
content: "Chapter " counter(my-chapter-counter) ". ";
counter-increment: my-chapter-counter; /* Add 1 to chapter */

/**
* Lists of font names (strings), separated by spaces and possibly quoted.
* Also, a restricted set of specially-defined "generic font families"
* like serif, fantasy (WTF) and monospace, and even specially-defined
* "system fonts" like status-bar, small-caption, icon, and menu.
*/
font-family: Helvetica, "Comic Sans MS", fantasy, small-caption;

/**
* Different size types: em, ex, px, pt, in, cm, mm, percentage, unit-less.
* Margins and paddings take 1, 2, 3 or 4 values which expand into -top,
* -right, -bottom and -left.
*/
margin: 1px 2em 30% 0;
}

Seriously, who comes up with this stuff? I'm not saying any of these things are useless, but from a
language design standpoint, this seems rather excessive. CSS 3 is even more extreme; there, "image"
value-types get so complex that they need their own separate document to specify. The background
shorthand property grew in complexity as well. Two examples from these drafts (quick, what visual
effect do these have? No cheating):
html {
list-style-image:
radial-gradient(circle, #006, #00a 90%, #0000af 100%, white 100%);
background: url("chess.png") 40% / 10em gray round fixed border-box;
}

Finally, the CSS3 animations draft spec adds a completely new syntax element for key frames. This is
the only place in plain CSS where curly brace sections are nested inside other curly brace sections.
This highly variable and ever-changing aspect of the syntax means that it's quite an open-ended
language. This makes it quite hard to cover all future extensions. The one point that gives me hope is
the fact that all this complexity is built up out of a set of core "atoms" like length units, URIs and
colors. These atoms do not seem to change too much.
This observation shows us an opportunity for a better CSS DSL; we could try to map these atoms to
suitable Scheme values, possibly ignoring the details of how complex values are composed out of these
atoms. This is basically what the W3C did with their CSS DOM API. Taking a good look at this DOM
API might help to get some inspiration, even if the API itself is unwieldy and un-Lispy (it's very OOP-
ish).
In a language without a small set of well-defined atoms, you will need special parsers and generators
for each separate type. This is very confusing to people. I know, because this is exactly the approach I
took for representing HTTP headers in intarweb. I don't consider intarweb to be a true DSL since it
doesn't really have "native" syntax for its header values. Everything passes through construction
procedures which do accept "native" values. However, it does illustrate the point; I've had several
requests for explanation of how to do common (what I thought were) simple things or "just give me a
way to write out the raw header". That's a DSL failure; DSLs ought to be straightforward and easy to
understand, yet powerful.
I like to think that Intarweb isn't a complete failure, because when working with intarweb, once
everything is parsed, it's often rather nice not to have to deal with parsing anymore. Things like cookies
or authentication attributes are notoriously hard to parse correctly, and if everyone up the entire server-
side HTTP stack needs to roll their own parser, that's a lot of wasted effort, and a lot of inconsistent
implementations with their own bugs. Manipulating these values is also a breeze and never involves
string manipulation.

What might a better SCSS look like?


From our new understanding of the nature of CSS, let's try improving it iteratively. For starters, we
would like to use parenthetical notation for everything. Plain strings should be disallowed except where
they are appropriate and are always quoted and escaped. Making this simple change gives us the
following:
`(scss+
(((= class "foo") (= class "bar"))
(border-left-color (rgb 0 128 0))
(border-left-width (em 1))
;; unsure whether we should allow this shorthand..
(border-right (px 1) solid ,orange)
(width (px ,(- page-width sidebar-width)))
((// p)
(color green)
(font-family #("Helvetica" "Comic Sans MS" sans-serif)))))

I've used vectors to describe sequences of things, whereas composite declarations like border-right are
simply expressions with more than two subexpressions. Built-ins like sans-serif and green are
symbols. As you can see, because there are no strings, lengths can be calculated without having to
perform string manipulation. Another valid approach would be having a special "color" object type
with associated procedures that operate on them. If we wanted to do this, SCSS could export variables
with color definitions so that green is simply an alias for (rgb 0 128 0), and you could perform
"color-algebraic" operations:
`(scss+
(((= class "foo") (= class "bar"))
(border-left-color ,(rgb 0 128 0)) ; "rgb" is a constructor procedure now
(border-left-width ,(em 1)) ; So are "em"...
(border-right ,(px 1) solid ,orange) ; .. and "px"
((// p)
(color ,green)
;; A green background which is darker by 50%
(background-color ,(darken green .5))
(font-family #("Helvetica" "Comic Sans MS" sans-serif)))))

I can't think of any useful operations on font types, so I've kept sans-serif a symbol here. How far
you want to go depends on your goals, and involves striking a balance between ease of use, safety, and
power. For instance, you could define a separate type for everything, including fonts, but that would
make it harder to use. It would also make it harder to introduce mistakes, especially if the CSS
generator will validate while rendering. However, strict validation also means allowing extensions (like
those from CSS3) becomes harder!
The selector syntax could use some love too, but I'm less critical of that. The basic idea is fine; it can
extend to include arbitrary selectors. It currently supports the + sibling and > child selector as well as
the class and id comparisons. Because these operators are in the operator position of a list, adding new
ones is as simple as adding a new procedure in Scheme. A pseudo-selector like p:first-child for
example could simply be translated to (: p first-child) without breaking anything else.

Right now selectors are simply grouped by adding an extra set of parens around them to put them in a
list. Using a visual cue like and or or to indicate grouping might help for readability, as would getting
rid of the // selector for hierarchical nesting. As long as we make sure all selectors are unused
property names there's no ambiguity in simply nesting a new rule inside another one:
`(scss+
((= class "foo")
(color ,orange)
(div
(margin-left ,(px 1))
((or (= class (or "foo" "bar"))
(= id qux))
(border-left-color ,(rgb 0 128 0))
(font-family #("Helvetica" "Comic Sans MS" sans-serif))))))

Instead of repeating the class selection, we just put the (or ...) around the class, which is a nice
abbreviation, but overall I'm not too happy about this version, so let's back up a step.
We can't guarantee that the selector symbols will remain unused as property values, because we don't
know what property names the CSS spec might add in the future. We should strive to avoid potential
clashes with future extensions. Also, dropping the // makes it harder to traverse an SCSS tree and
perform manipulations since the traversal code would need a full list of all known selectors. So after
all, it looks like it's better to keep the //. But we can drop some unnecessary parens by taking the
previous example and just putting the // before the selector. Since it's been modified to be one s-
expression, we can do that. We can also allow the = selector to accept any attribute (not just classes).
While we're at it, this selector should also accept multiple values to avoid repetition:
`(scss+
((or (~= p class "foo") ; Change to (has-word? p class "foo") ?
(+ div (= p class "bar" my-attr "qux")))
(border-left-color ,(rgb 0 128 0))
(font-family #("Helvetica" "Comic Sans MS" sans-serif))

(// (= * class (or "foo" "bar"))


(color ,orange)))

(div
(display block)
(// span
(text-align left))))

The example above also shows the extensibility of operators by adding the ~= selector (a very
unschemely name...). Let's see the CSS this would compile to:
p[class~="foo"],
div + p.bar[my-attr="qux"] {
border-left-color: rgb(0, 128, 0);
font-family: "Helvetica", "Comic Sans MS", sans-serif;
}

p[class~="foo"] *.foo,
p[class~="foo"] *.bar,
div + p.bar[my-attr="qux"] *.foo,
div + p.bar[my-attr="qux"] *.bar {
color: orange;
}

div {
display: block;
}

div span {
text-align: left;
}

That's not too bad! There's a lot of redundancy in the resulting CSS that we abstracted away via the
combination of shortened or-alternatives and hierarchical nesting. The original SCSS also had this
hierarchical nesting, by the way, so this type of redundancy is already avoided even by using a slightly
flawed DSL.
In CSS, the #foo and .bar syntaxes are shorthands for selecting on IDs and classes, because these
are so common. There is no technical need to support these shortcuts, so if this makes your design less
clean, you can always drop them and opt to use the generic selectors everywhere. For IE6 and other
crippled browsers, the renderer could detect class selection and rewrite it to the short syntax. You could
always consider extending the Scheme reader to get the same brevity at a higher level, while keeping
SCSS itself simple (not that I would recommend doing that, but the option exists).

Lessons learned
I will try to wrap up each blog post in this series by listing the general design rules that we can extract
from the DSL under discussion. To wrap an existing language like CSS into a DSL, the following
approach seems useful:
• First, identify the atomic building blocks. If there are many, this may spell trouble.
• Decide which building blocks are essential to be represented "first class" in a structured way,
and which can be unstructured strings or symbols (Lisp's atoms).
• Determine the combination rules of these atoms and how to translate this to s-expressions.
• Think about whether you want to rely on the host language and expose shorthands and
abstractions directly, or if you want to rely on Scheme's abstraction facilities.
• If possible, look in what direction the language evolved, and how it has been extended in the
past. Your design must be able to accommodate changes in these directions.
• Finally, use parentheses and "noise symbols" sparingly, but effectively! Try striking a balance
between notation and manipulation convenience.
I realize that some of the things I've said in this post might be contradictory. I might be too vague and
hand-wavery in some places. Hell; many things are probably bloody obvious to some of you. But the
main point is that it's important to remember that design is hard, and will always involve trade-offs.
I hope that you understand that when designing a DSL you'd better think about what use cases you
want it to support before considering how to answer a particular design question. It's very easy to get
carried away and overdesign a DSL, but another pitfall is to have too little design (like SCSS, in my
opinion). Next time we'll look at a design that's pretty close to ideal, and show that even with that, there
are some problems.
Designing Lispy DSLs, part 2: SXML Posted on
2012-08-05
After last time's example of SCSS, I'd like to take a look at SXML, another Lispy DSL I'm using in this
blog. It's more successful and more widely-used than SCSS and even has an official specification!
The observation that XML is really an obnoxiously verbose Lisp without parens is common, but the
details are (of course) hairier than that. Let's look at an XHTML example:
<div>
<span>Hello, <strong>dear</strong> friends.</span>
<span>This is a &lt;simple&gt; example.</span>
</div>

Converting this HTML fragment to an S-expression is straightforward:


'(div
(span "Hello, " (strong "dear") " friends.")
(span "This is a <simple> example."))

It's a bit more cumbersome to type because you have to break up the strings for the "strong" element,
but aside from that it's simpler, shorter, and less error-prone; "special" characters can be written as-is
since they are automatically escaped when the XML document is written. Especially when dealing with
large templates and generated content it can be a big time-saver to represent XML as S-expressions;
doubly so if you're using paredit. Plus, Scheme is your templating language, and Lisps are rather good
at processing lists :)
You might be wondering about XML attributes; S-expressions don't have anything that maps naturally
to these. Some XML-in-Lisp variants use keywords for attributes, others use alternating symbols and
strings to indicate attributes. SXML takes the more interesting approach that attributes in XML were a
mistake; there should only be elements. To compensate, SXML uses a tag name that can't exist in XML
(the "@"-sign) and it has the convention that this element can appear as the first child of any element.
Child element names represent attribute names, their text contents represent values. The "@"-sign is
particularly well-chosen because W3C also uses it elsewhere to indicate attributes (e.g. in XPath and
XSLT).
<div id="welcome" class="section">
<span>Hello, <strong class="affectionately">dear</strong>friends.</span>
<span>This is a &lt;simple&gt; example.</span>
</div>

becomes:
'(div (@ (id "welcome") (class "section"))
(span "Hello, " (strong (@ (class "affectionately")) "dear") " friends.")>
(span "This is a <simple> example."))

Let's look at what makes SXML such a good DSL. First, XML has a hierarchical structure, which maps
well to S-expressions. It is built up out of only a handful of atoms: it has start tags with attributes, end
tags, entities, and textual content in between. In SXML, tag names are mapped to symbols, which can
represent any string, so this naturally extends to all possible XML tag names.
When building websites, the fact that regular HTML is less strict than XML is irrelevant; you don't
need features like, say, omitting an end tag. In fact, end tags don't even exist in SXML; it models the
underlying concept of elements rather than tags; it simply treats tags as artifacts of the serialized textual
representation of an element. S-expressions can be seen as an alternative serialized textual
representation of the same document described by the "angular brackets and tags" notation.
This is another important aspect of good DSLs; they tend to ignore surface syntax. Instead, they map
the underlying tree-like structure to S-expressions. SXML uses elements instead of letting itself get
distracted by tags, and it generalizes attributes to fit the tree structure. By representing the structure in
S-expressions, you know what parts need to be "escaped" in order to preserve this structure. When
writing SXML to XML, all string elements in an SXML document get their angular brackets < and >
converted to &lt; and &gt;. The only angular brackets ending up in the output are those that result
from serialization of elements to start/end tags. When reading XML, all entities are automatically
converted to the characters they represent, so in Scheme you get to work directly with the text contents
at the conceptual level. CDATA sections are also eliminated; they are simply represented by their string
value.

Some complications
XML isn't as simple as you'd think at first glance. Remember the same observation about CSS? This is
a common theme with web technology. Don't even get me started about HTTP! In the words of Oleg
Kiselyov, author of the SXML specification and many tools in the SSAX project:
There exists a myth that parsing of XML is easy. An article "Parsing XML"
in the January 2000 issue of Dr.Dobb's Journal states the ease of parsing
as an alleged fact. The author of that article must have overlooked that
there is more to XML than the grammar presented in the XML Recommendation.
There are attribute normalization rules, well-formedness constraints, let
alone validation constraints. XML Namespaces add another layer of complexity.

You can almost hear his frustration... Here's an example to illustrate some things that so far we have
glossed over. This isn't a fragment, but a full XML document (with thanks to Jim Ursetto):
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<content type="xhtml">
<div xmlns="http://www.w3.org/1999/xhtml">
<p>I'm invincible!</p>
</div>
</content>
</entry>
</feed>

There are two new things to notice: The document starts with a "special" syntax to indicate that we're
using XML version 1.0. These so-called processing instructions provide a generic way of passing
(meta-)information to the application, outside the XML document itself. The second new thing is that
the Atom feed in this example holds an XHTML document fragment as a sub-document. It uses a
namespace declaration to indicate that the div and p tags are taken from a different XML schema than
the main document.
Let's see how SXML deals with these new concepts. This document can be represented in several ways,
but the following is arguably the simplest:
(*TOP* (@ (*NAMESPACES* (atom "http://www.w3.org/2005/Atom")
(xhtml "http://www.w3.org/1999/xhtml")))
(*PI* xml "version=\"1.0\"")
(atom:feed
(atom:entry
(atom:content (@ (type "xhtml"))
(xhtml:div (xhtml:p "I'm invincible!"))))))

If you look carefully at the original XML document, you'll see that while feed is the document's root
node, it still has a sibling: the processing instruction! There's a "virtual" root element that holds these
two, which XML calls the "document entity". SXML generalizes this to an element called *TOP* (for
"top-level"). The *NAMESPACES* element (an "attribute" of *TOP*) stores an association list of
element name prefixes used to indicate a namespace. The *PI* element is SXML's way of
representing the processing instruction (its version "attribute" isn't parsed because it isn't a true
attribute in terms of XML; it just looks like one. Don't ask...).
All three pseudo-elements are represented by symbols that are invalid XML tag names due to the
asterisk. Just like the @ for attributes, this ensures that they can't possibly clash with any tag name that
might occur in a particular document type or future versions of XML.
One disadvantage of encoding namespaces as part of the tag name is that you can't see what namespace
a particular element belongs to without first converting the symbol to a string, and then splitting it at
the colon. This means namespaces aren't really first class. Most likely this is the case because
namespaces were added at a later stage when most of the SXML syntax was already set in stone, and
modifying it in some other way to support namespaces would be too invasive.

Tool set
You only realize the absolute brilliance of SXML when you look at its tool set. The XML ecosystem is
an entire zoo of mini-languages. Most of these languages (for example XPath, XLink, XSLT) have
some kind of corresponding DSL in the SSAX project. This makes it a complete toolbox for anyone
working with XML. Sure, the documentation is incoherent and a little on the "academic" side, and the
SSAX SourceForge project is a random collection of loosely-related tools that aren't exactly idiomatic
Scheme (if there even is such a thing), but go ahead and compare it with tools in other languages.
Most "stock" XML libraries are awkward contraptions. Usually they expose highly verbose object-
oriented APIs based on the W3C DOM specification, where constructing even a small tree takes several
lines of code. It's so awkward that many programmers will tend to prefer generating the XML
manually, by writing out strings.
Dynamic languages tend to do better. First, in Perl, there's XML::Simple. This is a little awkward due
to Perl's hash and array syntax, but other than that it is a lot like SXML. However, this library is
deprecated in favor of one of those awkward OO libraries, XML::LibXML.
Ruby and Python have convenient "builder" objects which can really speed up generation of XML, but
as the name says, these are for building. The format in which you build isn't directly the first-class
representation, which makes the API slightly disparate. For both languages, these are not the default
libraries either, which makes them less likely to be used by people who want to minimize
dependencies.
Finally, even though it's a very static language, Haskell has a pretty good builder-like library too, which
seems quite popular. If you ever need to generate XML (or "just" HTML) with one of these languages,
do yourself a favor and use one of these libraries.
But back to SXML; let's see how you'd read in a document, manipulate it, and write it back out again.
For a change, the code presented is a complete (Chicken) Scheme program. This will read one of the
first XML documents in this post, change it, and send it to standard output.
;; First, run the following from the shell to ensure this program will work:
;; $ chicken-install ssax sxml-modifications sxml-serializer
(use ssax sxml-modifications sxml-serializer)

(define doc (ssax:xml->sxml (current-input-port) '()))

(define change
(sxml-modify
`("div/@id" replace (id "good-bye"))
`("../span[1]/text()[1]" replace "Goodbye, ")
`("../following::*" replace (span "This was more " (em "complex")))
`("self::*" insert-into ", don't you think?")))

(serialize-sxml (change doc) output: (current-output-port))

The arguments to sxml-modify comprise a mini-DSL, representing actions to take on the XML
document. Each action is a list; first an XPath expression which selects node(s) from the document,
then the name of the action to perform, followed by the value to use for this action. Each action is
executed in sequence, so the XPath expression is relative to the previous action's node set, a little like
how chaining works in the excellent jQuery library. Actually, I think there are still some lessons in
convenience to learn from jQuery, but that's a different story.
Let's invoke the program and see what happens:
$ cat welcome.xml
<div id="welcome" class="section">
<span>Hello, <strong class="affectionately">dear</strong>friends.</span>
<span>This is a &lt;simple&gt; example.</span>
</div>
$ csi -s convert.scm < welcome.xml
<div id="good-bye" class="section">
<span>Goodbye, <strong class="affectionately">dear</strong>friends.</span>
<span>This was more <em>complex</em>, don't you think?</span>
</div>
Not too shabby for a 10-line program!
Unfortunately, there's a catch. Originally, I planned to use the Atom feed from the previous section as
input, but it turns out that the modifications sub-language doesn't support passing a namespace map to
the underlying XPath library. Also, I was unable to use an sxpath expression instead of a standard string
XPath expression. This could be bad documentation (the docs for SXML modifications are pretty
sparse), or perhaps it's a lack of support for namespaces. A quick look at the source seems to confirm
the latter. The lack of support for sxpath expressions is also serious and indicates how "random" the
selection of tools in SSAX really is; some of these tools don't even support each other! Luckily, it looks
like both limitations aren't fundamental, and could be addressed by a (small?) change in the tools.
I mentioned earlier that the SSAX project is a loose collection of tools with incoherent documentation.
My failure in figuring out how to combine namespaces with SXML modifications or use the sxpath
DSL from the "SXML modifications" DSL helps point out the importance of a good, robust, and well-
documented tool set. This might possibly be more important than a good DSL; if nobody can use your
DSL, it might just as well not exist.

Wrapping up
The following rules can be distilled from the SXML design:
• Do not slavishly translate surface syntax to S-expressions, but model the structure.
• Eliminate or generalize all features that are strictly unnecessary.
• When generalizations demand new names, pick ones that are invalid in the source language, but
try to borrow familiar conventions from the domain.
• When generating output, ensure structural integrity by escaping all content.
• People will avoid clumsy DSLs, to the point of falling back on string manipulation.
• No matter how well-designed your DSL is, it needs good tools and documentation.
• DSLs within the same domain should be mutually supportive.
There are still many aspects of XML we barely touched upon. However, this post is already long
enough, and my knowledge of XML (and SXML) only goes so far, so we won't go into more detail. Of
course, you can always dig in and find out more yourself; there are plenty of links in this document you
can use to study the subjects.
Designing Lispy DSLs, part 3: SRE Posted on 2012-
08-14
Today I'd like to discuss Scheme Regular Expressions (SRE). Originally introduced in a library for the
Scheme Shell, this DSL has recently been gaining some popularity due to the release of Irregex, a pure
R5RS Scheme regex engine with SRE as its native syntax. Irregex has been integrated as the core regex
system in Chicken Scheme and Jazz Scheme, and you can easily use it from any other Scheme due to
its portability.
Back in 1998, the author of the first SRE implementation (Olin Shivers, one of the funniest Schemers
around) posted an announcement to several newsgroups about this new syntax. It's well worth a read;
especially the preamble about 100% and 80% solutions is a very inspiring call to arms which provides
a good insight into the Scheme way of thinking. By the way, if you liked this, you'll also want to read
the classic essay "Worse Is Better", if you haven't already. The bit about "The Right Way" is especially
relevant. Consider yourself warned, Schemer!

Figuring out the rules


The "Discussion and design notes" section from the announcement is particularly interesting as it
discusses the DSL from a point of view similar to this series of blog posts. It's also the largest section,
so we'll just touch upon the important points here. The first thing that really jumps out is the fact that
the author has taken a look at many different regex packages for various languages, and even asked
Tom Lord and Henry Spencer (both wrote their own regex engines) about obscure details. Doing this
kind of in-depth research is a great way to get started when designing a DSL since it provides you with
a nicely broad perspective on the various viewpoints of others who went before you. This will reduce
your "blindness to complexity". Initially every target language seems simple but there are always
pitfalls which, if overlooked, would result in a DSL that's hard to extend or doesn't provide all the
features of the target language which a user would need. By looking at other implementations you see
how they deal with the more complex nooks and crannies of the target language.
The other main point is that he drew a very clear line in the sand of what features would go into SRE
and which wouldn't. The SRE syntax doesn't support any "extended regex" features which would force
a particular implementation strategy. This makes the SRE syntax independent of the underlying regex
engine, which allows for greater portability and generality, but more importantly, it leaves open the
possibility of efficient implementations. This was misunderstood by many people; he had to educate
Richard Stallman about why supporting back references in the general SRE syntax is a bad idea:
My feeling about back references is as follows. Regexps are based on a deep
theory -- regular sets and DFA's -- that has tremendous implications about
the operations you can perform on them and the ways you can implement them.
Back-references completely shatter this framework. They rule out certain
extremely efficient implementations. They rule out certain operations. They
have nothing to do with the idea of a "regular expression." They are not one
with the deep structure of the system.
Repeat that in your mind: They are not one with the deep structure of the system. DSL design notes
don't get any more philosophical than that! This points right to the core design principle of SRE (and
regexes in general). If you are designing a DSL, you can consider yourself very lucky when you find a
guiding principle which is that strong. You should let it inform all your design choices because it will
help you achieve a good, cohesive design. This also makes it easier to defend your choices when users
start complaining about missing features...

Representational issues
After my SCSS post, I was asked why you'd want to represent CSS using first-class values. I think the
SXML DSL example illustrates how powerful a first-class representation can be, but I must admit, I
don't see many valid use-cases for "first-classing" a CSS DSL.
However, one important lesson in programming is that you never know what clever things people are
going to do with your code; clever things you only wish you thought about. You should see first-class
values as an enabler for other people to take your DSL's usefulness to new heights. The "Prime
Clingerism" applies to DSLs as well; without a first-class representation, additional features will appear
necessary to perform useful operations.
One interesting aspect of the design of the original SRE library for SCSH is how it deals with first-class
regexps. It contains a large set of procedures to manipulate the underlying regular expression ADT
(abstract data type). Olin believed a separate ADT was required for easy manipulation of regex objects,
and it would also allow extension of the supported operator set. Directly operating on the SRE
expressions would be harder for programmatic extensions. This distinction allows for a baroque but
user-friendly SRE syntax in which it is possible to write one thing in many different ways, while also
offering ease of manipulation from user code.
Olin is quick to point out that this does cause massive complexity in the code (a point also raised by
Richard Stallman), but says the work is done now and anyone is free to take his code and re-use it (this
fits the "100% solution" ideology mentioned at the beginning). This ADT approach is comparable to
how Lisp/Scheme compilers internally rewrite the full language to a simpler to manipulate "core
language", so it isn't completely unique to SRE.
At first glance, this seems a little hard to defend, especially the fact that there's also a seemingly
unnecessary rx macro in his design, while Irregex gets by fine without these. I've asked Alex Shinn,
the author of Irregex, about this, and he mentioned that the macro and the ADT were needed in SCSH
because it depended on an underlying POSIX regex engine rather than implementing it natively in
Scheme. SCSH first reads SREs at compile-time and the macro tries to compile down to the ADT as
much as possible. Then, at run-time, this ADT is converted to a POSIX regex string which is compiled
by the underlying C regex engine.
Because Irregex is written natively in Scheme, this extra step is not necessary and Alex decided to get
rid of these distractions and implement only the SRE syntax, without the rx macro and ADT. The
result is a very compact implementation, having about the same size as the SCSH package, but
including a full matching engine! As far as I know there isn't any widely-used extension for SCSH
which makes use of the ADT interface, and the SRE syntax hasn't been extended as much as Olin
foresaw might happen. For these reasons, the choice to drop all that complexity seems like a wise one.
However, only time will tell whether that's really the case.

Wrapping up
Let's see what we have learned from the design of SRE:
• Do your research! Inspect as many libraries and DSLs as possible to gain a broad perspective
and avoid "blindness to complexity". If in doubt, ask a domain guru.
• Relentlessly strip all features that preclude efficient implementation strategies. When users
request them back, resist!
• Design for extensibility and programmability; strive to support a first-class representation.
• When things have settled down, re-evaluate the design and drop unnecessary features.
• You don't always need nitty-gritty details from code examples to analyse a DSL :)
I admit, some of these rules are not for the faint of heart, but they make for a very strong and coherent
DSL which might see wider adoption than just your initial implementation.
Designing Lispy DSLs, part 4: SSQL Posted on
2012-08-20
Today we'll look at an old, experimental DSL of my own design. I've always referred to it as a failed
experiment, but perhaps it's really a successful experiment, because it helped me figure out why this
type of DSL doesn't work too well. Whatever the status, I'll use it as an example of what makes a bad
DSL.
The DSL in question is SSQL, a way of embedding SQL as S-expressions into Scheme code.
Interestingly, it seems I had a bad feeling about it from the start; the initial commit had the following
message:
Add another doomed project - ssql

It turned out not to be completely doomed, because Moritz Heidkamp has kindly taken over
maintenance and has been improving and polishing the library. I might even use it again for my own
projects if I ever get tired of working directly with SQL.

Scoped access
For my day job I used to write a lot of Rails code, and I got tired of the restrictions in
ActiveRecord. I have to mention that this was in the days before Arel, which is a great
improvement in the way you can use custom queries in Rails.
With ActiveRecord, you could write code that would automatically prevent users from accessing
things they shouldn't be able to access with the scoped_access plugin. This allowed you to write things
in your controller like the following:
scoped_access Customer

def method_scoping
ScopedAccess::ClassScoping.new(Customer, :user => {:id => current_user.id})
end

I don't recall exactly how it worked, but when you had a complex query, this could cause clashes when
the same table was joined in twice, especially if the condition was complex. In different situations,
different queries could be generated. Back then, you also needed to know internally-generated join
aliases in order to scope related tables. Remember, this was quite a while ago, and I was a bit of a
newbie and had been programming Ruby and Rails for only a year or two. There may have been better
ways to do this even then.
In any case, this scoping problem annoyed me no end and I knew there had to be a better way. It was
obvious that if you represent the query in a more complex data structure than a simple string, you can
easily fetch all the references to a particular table (even if it is aliased), and add some scoping to it. This
could be done even if it required the addition of joins, and even if those tables were already joined
under arbitrary names, as long as you would alpha-rename all aliases to avoid clashes with user-created
aliases.
Here's an example of the SSQL syntax. This example is based on a toy data model for an IMDB-clone
with films, actors and their roles in them:
'(select (columns (col actors id firstname lastname)
(col roles character movie_id))
(from actors roles)
(where (and (= (col actors firstname) "Bruce")
(= (col actors lastname) "Campbell")
(= (col actors id) (col roles actor_id)))))

The regular SQL equivalent of this:


SELECT actors.id, actors.firstname, actors.lastname,
roles.character, roles.movie_id
FROM actors, roles
WHERE actors.firstname = 'Bruce'
AND actors.lastname = 'Campbell'
AND actors.id = roles.actor_id;

The SSQL for column selection can be a little ugly or verbose, so it's also allowed to specify columns
with a dot instead of the col form (probably a mistake, complicating the DSL design):
'(select (columns actors.id actors.firstname actors.lastname
roles.character roles.movie_id)
(from actors roles)
(where (and (= actors.firstname "Bruce")
(= actors.lastname "Campbell")
(= actors.id roles.actor_id))))

The columns "noise word" is still required, because that makes it easier to walk the expression and
programmatically manipulate it. In any case, scoping a table is easy, even for arbitrarily complex cases:
(let ((query
'(select (columns actors.firstname actors.lastname
roles.character movies.title)
(from (join left
(join left actors
(join inner roles (as movies m2)
(on (and (= m2.id roles.movie_id)
(> m2.year 2000))))
(on (= roles.actor_id actors.id)))
movies
(on (= movies.id roles.movie_id)))))))
(scope-table 'movies '(< (col movies year) 2005) query))

;; Results in the following:


(select (columns actors.firstname actors.lastname
roles.character movies.title)
(from (join left
(join left actors
(join inner roles (as movies m2)
(on (and (= m2.id roles.movie_id)
(> m2.year 2000))))
(on (= roles.actor_id actors.id)))
movies
(on (= movies.id roles.movie_id))))
(where (and (< (col m2 year) 2005)
(< (col movies year) 2005))))

The initial query selects all the films in the database, including all actors with the roles they played in
that film. However, the actors are only included for films that were released after the year 2000. Earlier
films are returned without the actors.
Now, the magic happens in the call to scope-table, which returns the same query, but with all
occurrences of the movies table scoped to include only films released before the year 2005. Note that
this scopes both the main query and the joined table m2 even though it's aliased.

It's all about the syntax


Okay, so it turns out that this idea works beautifully. Let's look at why I think this DSL was a failure.
One reason is the fact that SQL is a huge language, especially when you consider all the extensions
provided by various implementations.
You could say "but you don't have to support the full language". That's true, but the problem with a
language that maps directly to SQL is that users will expect being able to do all the things they can do
in regular SQL. For example, when Common Table Expressions were first introduced into PostgreSQL,
I started seeing many places in my code bases at work where those would be useful. The same was true
for Window Functions. These are both extremely useful extensions, and I'm now making regular use of
them. I wouldn't want to miss them, so any SQL DSL really needs to support them for me to take it
seriously.
The thing both extensions have in common is that they introduce completely new syntax. That's
because there are absolutely no common building blocks for language constructs; every feature is a set
of arbitrarily-placed keywords to help a parser make sense of it (with many optional "noise" keywords
to help a human make sense of it). This means each feature has to be taught separately to SSQL,
resulting in a large set of rules on how to convert them to SQL.
The SQL grammar is so complicated that its sheer size has serious performance implications on a
parser, as pointed out by this blog post. Because EXPLAIN is a PostgreSQL extension, they simply
decided to change this command's syntax to make it faster to parse. The old syntax is still supported for
backwards-compatibility, but this change is a great illustration of how much of a moving target the
SQL syntax really is. Other SQL implementations don't generally move as fast as PostgreSQL in
adding features, but as I indicated earlier, I really like these features and use them on a regular basis.

Database independence with SQL-based syntax?


Another complication is supporting multiple databases. SSQL supports ANSI SQL as a baseline, with
optional extensions that are available if the back-end supports it. The nice thing is that this provides a
degree of database independence. All back-ends can automatically quote strings and table names
correctly depending on the database, making SQL injection bugs effectively impossible. For example,
'(select (columns (col actors firstname lastname birth-date))
(from actors)
(where (= actors.lastname "O'Neill")))

gets output as the following in PostgreSQL and SQLite:


SELECT actors.firstname, actors.lastname, actors."birth-date"
FROM actors
WHERE actors.lastname = 'O''Neill';

The MySQL back-end outputs the following:


SELECT actors.firstname, actors.lastname, actors.`birth-date`
FROM actors
WHERE actors.lastname = 'O\'Neill';

These differences are relatively small and don't affect the syntax of the S-expression version. However,
there are other examples that do. For example, MySQL's INSERT statement allows syntax which
mirrors the UPDATE statement, using SET:
INSERT INTO movies SET title = 'Alien', year = 1979;

whereas PostgreSQL only allows the standard syntax (which MySQL also supports):
INSERT INTO movies (title, year) VALUES ('Donnie Darko', 2001);

The question then becomes whether the (unnecessary) syntax with SET should be allowed, and, if so,
whether this should be emulated in PostgreSQL by rewriting it to the standard syntax. There are tens of
such silly examples (CONCAT versus || versus logical OR, case insensitive LIKE versus ILIKE, etc),
but there are a lot of more fundamental differences, too. Finally, using ANSI as a baseline is nice, but
many of ANSI's features aren't widely implemented. Common Table Expressions are a good example;
they're standardized, but neither MySQL nor SQLite support them, and Postgres only started
supporting them very recently. Oh, and fuck proprietary RDBMSes; Oracle long ignored ANSI and
invented more nonstandard extensions than MySQL ever did, and as a result, their users are as clueless
about ANSI SQL as the average MySQL user. Finally, there are many ANSI features that none of these
databases support. This means you have to implement a feature in ANSI, then override it to produce an
error message saying it's unsupported in this dialect for all implementations that don't support it. An
alternative approach is to implement no base but make everything completely implementation-specific.
However, this results in a bigger risk of producing DSL inconsistencies between dialects.

A better approach
Recently, the relational algebra has been gaining some more interest. For example, there's Alf for Ruby
and the UNIX shell, and of course Arel, which I mentioned earlier.
I think this is a better approach; relational algebra has just a handful of concepts and there's no syntax
associated with it, so you can invent your own syntax to best fit your DSL. It also prevents you from
getting distracted by the differences in various SQL implementations. You can see this with Alf already;
it has total abstraction over the DBMS. It can use flat files or SQL, or any other back-end you'd like, as
long as it fits the relational model (to be fair, so can PostgreSQL with SQL/MED foreign data
wrappers). The flip side of such a high level of abstraction is that it will be harder to make use of any
killer features offered by your RDBMS; you get the lowest common denominator in features.
Optimizing queries also becomes hard. You can no longer hand-optimize them when writing them, and
you'd probably end up with an optimizer in your library. This is pretty insane, since there's also an SQL
optimizer and query planner inside your RDBMS, so you're doing twice the work, and there's twice the
opportunity for getting it wrong.
Despite these disadvantages, the "relational algebra DSL" approach is more viable than the "SQL DSL"
approach. ClojureQL also initially took the approach of providing a DSL closely modeled on SQL, but
later completely revised the DSL to be more abstract and closer to relational algebra than to SQL.
I think it's interesting to see what other SQL-like DSL projects will do. For example, Clojure also has
Korma, which is rather close to SQL and looks like it can currently only perform a limited subset of all
possible queries. I wonder what they'll do when users start clamoring for richer back-end support?
Racket used to have SchemeQL, but that project seems to have vanished from the web. The website of
its parent project, Schematics, doesn't mention it at all anymore. The same seems to have happened to a
Common Lisp interface called CL-RDBMS (at least the "homepage" link currently points to a broken
web site).
There's a popular library for Common Lisp called CLSQL. It looks like an enormous amount of
engineering went into it. If that's required to get a useful SQL DSL, it might not be worth it unless the
advantages outweigh the effort required. Note that even after 10 years of development, CLSQL still has
no outer join support. I think that's indicative of how hard it is to properly support SQL from a DSL.

Wrap-up
The lessons I learned from the SSQL experiment are in retrospect rather simple, and seem to echo
earlier blog posts:
• The language you're targeting should be small and have few core concepts.
• The relevant standards should be fully implemented in all back-ends you want to support.
• Back-ends shouldn't have any arbitrary extensions that you're expected to support.
• Look for an underlying theory; this may be a better abstraction than the target language.
• Try to find examples of similar libraries. Did others try, and fail or give up? If so, why? How
complex are existing implementations? Are they complete?
This post will be the last post in this series, at least for a while. There aren't that many other interesting
DSLs with which I'm familiar, and I've exhausted the list of novel design concepts that I'm able to
distill from existing DSLs.

• A few surveys of garbage collection techniques: By Wilson, by Chase and by Zhong. The
Zhong paper has missing images, unfortunately. The original (with images!) can be found as a
zipped MS Word document.
Lessons learned from NUL byte bugs Posted on
2012-12-10
Last time I explained how sloppy representations can cause various vulnerabilities. While doing some
research for that post I stumbled across NUL byte injection bugs in two projects. Because both have
been fixed now, I feel like I can freely talk about them with a clear conscience.
These projects are Chicken Scheme and the C implementation of Ruby. The difference in the way these
systems deal with NUL bytes clearly shows the importance of handling security issues in a structural
way. We'll also see the importance of truly grokking the problem when implementing a fix.

A quick recap
Remember that C uses NUL bytes to delimit strings. Many other languages store the length of the
string instead. In these languages, NUL bytes can occur inside strings. This can cause unintended
reinterpretation when strings cross the language border into C.
In my previous post I already pointed out how Chicken automatically prevents this reinterpretation in
its foreign function interface (FFI). You just describe to Scheme that your C function accepts a string,
and it will take care of the rest:
(define my-length (foreign-lambda int "strlen" c-string))

;; Prints 12:
(print (my-length "hello, there"))

;; Raises an exception, showing the following message:


;; Error: (##sys#make-c-string) cannot represent string with NUL
;; bytes as C string: "hello\x00there"
(print (my-length "hello\x00there"))

The FFI's feature of automatically checking for NUL bytes in strings before passing them on to C was
only added in late 2010 (Chicken 4.6.0). However, because everything uses this interface, this
mismatch could easily be fixed, in a central location, securing all existing programs in one fell swoop.
Now, you may be thinking "well, that's nothing special; it's good engineering practice that there must
be a single point of truth, and that you Don't Repeat Yourself". And you'd be right! In fact, this is a key
insight: solid engineering is a prerequisite to secure engineering. It can prevent security bugs from
happening, and help to fix them quickly once they are discovered. A core tenet of "structural security"
is that without structure, there can be no security.

When smugness backfires


To drive home the point, let's take a look at what I discovered while writing my previous blog post.
After describing Chicken's Right Way solution and feeling all smug about it, I noticed an embarrassing
problem: for various reasons (some good, others less so), there are places in Chicken where C functions
are called without going through the FFI. Some of these contained hand-rolled string conversions!
It turns out that we overlooked these places when first introducing the NUL byte checks, and as a
consequence several critical procedures (standard R5RS ones like with-input-from-file) were
left vulnerable to exactly this bug:
;; This program outputs "yes" twice in Chickens < 4.8.0
(with-output-to-file "foo\x00bar" (lambda () (print "hai")))
(print (if (file-exists? "foo") "yes" "no"))
(print (if (file-exists? "foo\x00bar") "yes" "no"))

To me, this just validates the importance of approaching security measures in a structural rather than an
ad-hoc way; the bug was only in those parts of the code that didn't use the FFI. Deviation from a rule is
where bugs are often found!
You can also see that we fixed it as thoroughly as possible, especially given the at times awkward
structure of the Chicken code. We commented every special situation extensively, assigned a new error
type C_ASCIIZ_REPRESENTATION_ERROR for this particular error, and added regression tests for
at least each class of functionality (string to number conversion, file port creation, process creation,
environment access, and low-level messaging functionality). There's definitely room for improvement
here, and I hope to one day reduce the special cases to the bare minimum. By documenting special
cases it's easy to avoid introducing new problems. It also makes them easier to find when refactoring.
The tests help there too, of course.
When you run the above program in a Chicken version with the fix, it behaves like expected:
Error: cannot represent string with NUL bytes as C string: "foo\x00bar"

Another approach
The Ruby situation is a little more complicated. It has no FFI but a C API, so it works the other way
around: you write C to interface "up" into Ruby. It has a StringValueCStr() macro, which is
documented as follows (sic):
You can also use the macro named StringValueCStr(). This is just
like StringValuePtr(), but always add nul character at the end of
the result. If the result contains nul character, this macro causes
the ArgumentError exception.

However, this isn't consistently used in Ruby's own standard library:


File.open("foo\0bar", "w") { |f| f.puts "hai" }
puts File.exists?("foo")
puts File.exists?("foo\0bar")

In Ruby 1.9.3p194 and earlier, this shows the following output, indicating it's vulnerable:
true
test.rb:4:in `exists?': string contains null byte (ArgumentError)
from test.rb:4:in `<main>'

It turns out that internally, Ruby strings are stored with a length, but also get a NUL byte tacked onto
the end, to prevent copying when calling C functions. This performance hack undermines the safety of
Ruby to C string conversions, and is the direct cause of these inconsistencies. True, there is a safe
function that extracts the string while checking for NUL bytes, but there are also various ways to
bypass this, and if you accidentally use the wrong macro to extract the (raw) string, your code won't
break. Of course, this is only true for benign inputs...
The complexity of Ruby's implementation makes it hard to ensure that it's safe everywhere. Indeed, the
various places where strings are passed to C all do it differently. For example, the ENV hash for
manipulating the POSIX environment has its own hand-rolled test for NUL, which you can easily
verify; it produces a different error message than the one exists? gave us earlier:
irb(main):001:0> ENV["foo\0bar"] = "test"
ArgumentError: bad environment variable name

There is no reason this couldn't just use StringValueCStr(). So, even though Ruby has this safe
macro, which provides a mechanism to check for poisoned NUL bytes in strings, it's rarely used by
Ruby's own internals. This could be fixed just like Chicken; here too, the best way to do that would be
to generalize and eliminate all special cases. Simpler code is easier to secure.

A fundamental misunderstanding
When I reported the bug in the File class to the Ruby project, they quickly had a fix, but unfortunately
they seemed uninterested in going through Ruby's entire code to fix all string conversions (quoting
from private e-mail conversation):
> I agree that this looks like a good place to fix the File/IO
> class, but there are many other places where strings are passed to C.
> Are all of those secured?
All path names should be converted with "to_path" method if possible.
If any methods don't obey the rule, it is another bug. Please let us
know if you find such case.

In retrospect, there is the possibility that I didn't quite make myself clear enough. Perhaps this person
thought I was referring to other path strings in the code. However, to me it sounds a lot like they made
the same conceptual mistake that the PHP team made when they "fixed" NUL injections.
The PHP solution was to add a special "p" flag for converting path strings. This happens for all PHP
functions declared in C (via zend_parse_parameters()). By the way, notice how this is a new
flag. There probably are tons of PHP extensions out there which aren't using this flag yet. Also, who
can verify that they managed to find all the strings in PHP which represent paths?
The PHP team was completely missing the point here. This fix means that path arguments aren't
allowed to have embedded NUL bytes. Other string type arguments are not checked. They are missing
the fact that this isn't just a path issue. Rather, as I described before, it's a fundamental mismatch at the
language boundary where strings are translated from the host language to C. However, there seems to
be a widespread belief that this can only be exploited in path strings.
I'm not entirely sure why this is, but I can guess. First off, "poisoned NUL byte" attacks have been
popularized by a 1999 Phrack article. This article shows a few attacks, but only the path examples are
really convincing. Of course, another reason is that injecting NUL bytes in path strings really is the
most obvious and practical way to exploit web scripts.
Recently, however, different NUL byte attacks have been documented. For example, they can be used
to truncate LDAP and SQL queries and to bypass regular expression filters on SQL input, but you could
argue these are all examples of failure to escape correctly. I found a more convincing example in the
(excellent!) book The Tangled Web: it contains a one-sentence warning about using HTML sanitation C
libraries from other languages. Also, NUL bytes can sometimes be used to hide attacks from log files.
However, the most impressive recent exploit is without a doubt this common vulnerability in SSL
certificate verification systems. In an attack, an embedded NUL byte causes a certificate to be accepted
for "www.paypal.com", when the CN (Common Name) section (that is, the server's hostname) actually
contains the value "www.paypal.com\0.thoughtcrime.org". Certificate authorities generally just
accepted this as a valid subdomain of "thoughtcrime.org", ignoring the NUL byte. Client programs
(like web browsers) tended to use C string comparison functions, which stop at the NUL byte. Luckily,
this was widely reported, and has been fixed in most programs.
I believe that NUL byte mishandling represents a big and mostly untapped source of vulnerabilities.
High-level languages are gaining popularity over C for client-side programs, but many crucial libraries
are still written in C. This combination means that the problem will grow unless this is structurally
fixed in language implementations.

Structurally fixing injection bugs Posted on 2012-09-


23
The two biggest threats to the web are caused by the same underlying mistake. It is time this problem
was fixed at its root. This article will attempt to provide the tools do so.

Input sanitation, input filtering or output escaping?


The Open Web Application Security Project (OWASP) does a great job at educating people and
suggesting practical solutions to avoid common weaknesses. Unfortunately, most security bloggers
focus on vulnerabilities rather than the prevention of attacks, and those that do often give bad advice.
For example, common advice is to avoid XSS (cross-site scripting) and SQL injection bugs by
"sanitizing" or "validating" input. Now, by itself this is good advice.
Unfortunately, the phrase "sanitize your inputs" is often misunderstood and the advice itself can be
misguided. For example, Chris Shiflett says:
If you reject [anything but alphanumerics], Berners-Lee and O'Reilly will be
rejected, despite being valid last names. However, this problem is easily
resolved. A quick change to also allow single quotes and hyphens is all you
need to do. Over time, your input filtering techniques will be perfected.

I think this advice is a little unhealthy. Those are valid names, and rejecting them will only scare away
customers and reinforce the idea that the "security Nazis" are out to inconvenience people. I wish
people would place less emphasis on filtering and sanitizing. Citing this XKCD comic has become a
cliché, which (while funny) makes it worse:

Validating and sanitizing input is good when it refers to parsing input into meaningful values
immediately when receiving it, so that you don't, say, get a URL when you are expecting an integer.
The horror story of ROBCASHFLOW shows how important input restrictions can be (but see also this
cautionary list. Tl;dr: you're doomed either way).
However, input sanitation will (in general) not prevent XSS or SQL injection. The OWASP XSS
prevention "cheat-sheet" recognizes input validation and sanitation for what it is; a good secondary
security measure in a broader "defense in depth" strategy.
Instead, there are only two correct ways to prevent "injection" bugs. The best is often even omitted
from advice, which is to avoid the problem entirely (see below). The other is to escape output.
Unfortunately, advice to escape often seems to imply that you should manually call escaping
procedures; "just use mysql_real_escape_string()". This is a very bad idea; it's tedious, it's
easy to forget, it makes code less readable and it requires everyone working on the code to be equally
informed about security issues (a great idea, but not very realistic).
Let's investigate how we can prevent these vulnerabilities easily and automatically. This will help us
secure applications in a structural rather than an ad-hoc way.

The trouble with strings


The underlying problem of all these vulnerabilities is that a tree structure (e.g., the SQL script's AST or
the HTML DOM tree) is represented as a string, and user input which should be a node in the tree is
inserted into this string. If this includes characters from the meta-language which describes the tree's
structure, it can influence that structure. Here's an example:
<p>{username} said the following: {message}</p>
When message is "So you see, if a<b and c<a, then b>c.", you get output like this (depending on the
browser, HTML version and phase of the moon):
Math teacher said the following: So you see, if ac.

This code is simply incorrect, and this bug will frustrate users like the math teacher. But this can turn
into a security nightmare; any punk can make you look like a fool by making your images dance
around, taking over your users' sessions by stealing cookies, or do much worse. The underlying reason
this nonsense is possible at all is the fact that you are mixing user input strings with HTML.
In other words, you're performing string surgery on the serialized representation of a tree structure.
Just stop and think how insane that really sounds! Why don't we use real data types? While researching
this topic, I found an insightful article called "Safe String Theory for the Web". The author has a good
grasp on the problem and comes close to the solution, but he never transcends the idea of representing
everything as a string.
Many people don't, so despite the flawed concept, there are several solutions that take string splicing as
a given. For instance, some frameworks have a sort of "safe HTML buffers", which automatically
HTML-escape strings. These solutions don't deal with the context problem from "Safe String Theory
for the Web". There's only one built-in string type, and making it context-aware is extremely hard,
maybe even impossible. Strongly typed languages have an advantage here, though!
Representing HTML as a tree helps preventing injection bugs, and has other advantages over automatic
escaping. For example, we need to worry less about generating invalid HTML; our output is always
guaranteed to be well-formed. The essence of an XSS attack is that it breaks up your document
structure and re-shapes it. These are just two sides to the same coin: By taking control of the HTML's
shape, XSS is also avoided.
There's another, more insidious problem with splicing HTML strings, which I haven't seen discussed
much either. It's another context problem; if your complex web application contains many "helper"
functions, it becomes very hard to keep track of which helper functions accept HTML and which accept
text. For example, is the following PHP function safe?
function render_latest_topicslist() {
$out = '';
foreach(Forum::latestPosts(10) as $topic) {
$link = render_link('forum/show/'.(int)$topic['id'], $topic['title']);
$out .= "<li>{$link}</li>";
}
return "<ul id=\"latest-topics\">{$out}</ul>";
}

This is (of course) a trick question. Consider:


$dest = htmlspecialchars($dest, ENT_QUOTES, 'UTF-8');
echo render_link($dest_url, "<span>Go to <em>{$dest}</em> directly.</span>");

Either this second example is wrong and the tags will come out literally (i.e., as
&lt;span&gt;...&lt;/span&gt; in the HTML source), or the first example was wrong and you
have an injection bug. You can't tell without consulting render_link's API documentation or
implementation. With many helper procedures, how can you keep track of which accept fully formed
HTML and which escape their input? What happens when a function which auto-encodes suddenly
needs to be changed to accept HTML?
This style of programming results in ad-hoc security. You add escaping in just the right places, decided
on a case-by-case basis. This is unsafe by default; you must remember to escape, which makes it error-
prone. It's also hard to spot mistakes in this style. The alternative to ad-hoc security is structural
security: a style which makes it virtually impossible to write insecure code by accident, thus
eliminating entire classes of vulnerabilities.
For example, in PHP we could use the DOM library to represent an HTML tree:
function get_latest_topicslist($document) {
$ul = $document->createElement('ul');
$ul->setAttribute('id', 'latest-topics');

foreach(Forum::latestPosts(10) as $topic) {
$title = $document->createTextNode($topic['title']);
$link = get_link($document, 'forum/show/'.(int)$topic['id'], $title);

$li = $document->createElement('li');
$li->appendChild($link);
$ul->appendChild($li);
}
return $ul;
}

And the second example:


$contents = $document->createElement('span');
$contents->appendChild($document->createTextNode('Go to '));
$em = $document->createElement('em');
$em->appendChild($document->createTextNode($dest));
$contents->appendChild($em);
$contents->appendChild($document->createTextNode(' directly.'));
$link = get_link($document, $dest_url, $contents);

Unfortunately, this code is very verbose. The stuff that really matters gets lost in the noise of DOM
manipulation. The advantage is that this is safe; text content cannot influence the tree structure, since
the type of every function argument is enforced to be a DOM object and string contents are
automatically XML-encoded on output.

Language design to the rescue!


Language design can help a great deal to improve security. For example, domain-specific languages
like SXML and SSQL can save the programmer from having to remember to escape while writing most
"normal", day-to-day code. This frees precious brain cycles to think about more essential things, like
the program's purpose. Here's the example again, using SXML:
(define (latest-topics-list)
`(ul (@ (id "latest-topics"))
,(map (lambda (topic)
`(li ,(make-link `("forum" "show" (alist-ref 'id topic))
(alist-ref 'title topic)))))
(forum-latest-posts 10)))

And the second example:


(make-link destination-url `(span "Go to " (em ,destination) " directly."))

This code is safe from XSS, like the PHP DOM example. However, this code is (to a Schemer) just as
readable as the naive PHP version. And, most importantly, the safety is achieved without any effort
from the programmer.
This shows the immense safety and security advantages that can be gained from language design. Of
course, this isn't completely foolproof: We still need to ensure URIs used in href attributes have an
allowed scheme like http: or ftp: and not, say, javascript:. Note that input filtering and
sanitation can help in situations like these! Also, just like with automatic escaping, strings in sub-
languages (like JS or CSS) aren't automatically escaped. However, there is less "magic" involved; this
is a representation for HTML, so it's obvious that only HTML meta-characters will be encoded. If we're
also using DSLs for sub-languages, this auto-escaping effect can be nested, solving the "context
problem" in a way automatic escaping cannot.
SXML rewards programmers for writing safe code by making it look clean, concise, and easy to write.
String splicing looks ugly and verbose in Scheme. In plain PHP this looks clean and simple, while
DOM manipulation looks ugly. This subtly guides programmers into writing unsafe code. However,
there are some PHP libraries that make safe code look clean. For example, Drupal has a "Forms API".
It's a little ugly, but it's idiomatic in Drupal, which means code that uses it is considered cleaner than
code that doesn't. Facebook is another attractive target for attackers, so they had to come up with a
structural solution. Their solution is a language extension called XHP which adds native support for
HTML literals.
These solutions are all specific to some codebase, not part of basic PHP. A framework or an existing
codebase has "default" libraries, but when writing from scratch most programmers prefer to use what's
available in the base language. This means a language should only include libraries that are safe by
default. Otherwise, alternative safe libraries have to compete with the standard ones, which is an unfair
disadvantage!

Sidestepping the SQL injection problem entirely


Even though it's possible to write safe code in almost any language if you try hard enough, the basic
design of a language itself subtly influences how people will program in it by default. Consider the
following example, using the Ruby PG gem:
# This code is vulnerable to SQL injection if the variables store user input
res = db.query("SELECT first, last FROM users "
"WHERE login = '#{login}' "
"AND customer = '#{customer}' "
"AND department = '#{department}'")
Here we're using string interpolation, which is the expansion of variable names within a string. We saw
this before, in PHP, but in Ruby you can drop back to the full language, which makes the safe solution a
little easier to write:
# This code is safe
res = db.query("SELECT first, last FROM users "
"WHERE login = '#{db.escape_string(login)}' "
"AND customer = '#{db.escape_string(customer)}' "
"AND department = '#{db.escape_string(department)}'")

Still, it looks uglier than the first example.


The documentation says the escape_string method is considered deprecated. That's because
sidestepping the problem entirely is much smarter than escaping. This is done by passing the user-
supplied values completely separate ("out of band") from the SQL command string. This way, the data
can't possibly influence the structure of the command. They are kept separate even in the network
protocol, so it is enforced all the way up into the server. As an added bonus, this is only slightly more
verbose than the naive version:
# This code is even safer
res = db.query("SELECT first, last FROM users "
"WHERE login = $1 AND customer = $2 AND department = $3",
[login, customer, department])

This scales only to about a dozen parameters. With more, it becomes hard to mentally map the correct
parameter to the correct position. A DSL can do this automatically for you. For example, Microsoft's
LinqToSQL language extension seems to do this. SSQL currently auto-escapes, but it could
transparently be changed to use positional parameters.

Pervasive (in)security through (bad) design


I'm not a native English speaker, so I looked up the word "interpolation" on Merriam-Webster:
interpolate, transitive verb:
To alter or corrupt (as a text) by inserting new or foreign matter

To corrupt, indeed!
Interpolation of user-supplied strings is rarely correct, and it puts almost any conceivable safe API at a
disadvantage by making the wrong thing easier and shorter to write than the right thing. Beginners,
unaware of the security risks, will try to use this "neat feature". It's put in there for a reason, right?
Some people are trying to fix string interpolation, which is a noble goal but I wouldn't expect this to be
adopted as the "native" string interpolation mechanism in a language any time soon.
The Ruby examples show the importance of good documentation and library design. The docs pointed
us in the right direction by marking the escape_string method as deprecated. Its good design is
more apparent when contrasting it with the MySQL gem. This has no support for positional arguments
in query, having only escape_string and prepare. The latter allows you to pass parameters
separately, but it conflates value separation with statement caching and has an unwieldy API. Finally,
the docs are quite sparse. Taken together, this all gently nudges developers into the direction of string
interpolation by making that the easiest way to do it. Much of this is due to the design of MySQL's wire
protocol, which dictates the API of the C library, which in turn guides the design of "high-level"
libraries built on top of it.
I think high-level libraries should strive to abstract away unsafe or painful aspects of the lower levels.
For example, the Chicken MySQL-client egg emulates value separation:
(conn (string-append "SELECT first, last FROM users "
"WHERE login = '?login' "
"AND customer = '?cust' "
"AND department = '?dept'")
`((?login . ,login) (?cust . ,customer) (?dept . ,department)))

Ruby's MySQL gem could easily have done this, but they chose to restrict themselves to making a thin
layer which maps closely to the C library.
Not all is lost with crappy libraries: Abstractions can solve such problems at an even higher level. Rails
can safely pass query parameters via Arel, in a database-independent way, even though MySQL is one
of the back-ends. This is true for SQLAlchemy, PDO and many others.

Other examples
This section will show more examples of the same bug. They can all be structurally solved in two
simple ways: Automatic escaping (by using proper data structures) or passing data separately from the
control language. But let's start with one where this won't work :)

Poisoned NUL bytes


As you may know, strings in the C language are pointers to a character array terminated by a NUL
(ASCII 0) byte. Many other languages represent strings as a pointer plus a length, allowing NUL
"characters" to simply occur in strings, with no special meaning.
This representational mismatch can be a problem when calling C functions from these languages. In
many cases, a C character array of the length of the string plus 1 is allocated, the string contents are
copied from the "native" string to the array and a NUL byte is inserted at the end. This causes a
reinterpretation of the string's value if it contains a NUL byte, which opens up a potential vulnerability
to a "poisoned" NUL byte attack.
Let's look at a toy example in Chicken Scheme:
(define greeting "hello\x00, world!")

(define calculate-length-in-c
(foreign-lambda int "strlen" c-string))

(print "Scheme says: " (string-length greeting))


(print "C says: " (calculate-length-in-c greeting))
As far as Scheme is concerned, the NUL byte is perfectly legal and the string's length is 14, but for C,
the string ends after hello, which makes for a length of 5. There is no way in C to "escape" NUL
bytes, and we can't sidestep it here, either. Our only option is to raise an error:
Scheme says: 14

Error: (##sys#make-c-string) cannot represent string with


NUL bytes as C string: "hello\x00, world!"

This is a good example of structural security; it doesn't matter whether the programmer is caffeine-
deprived, on a tight deadline or simply unaware of this particular vulnerability. He or she is protected
from accidentally making this mistake because it's handled at the boundary between C and Scheme,
which is exactly where it should be handled.

HTTP response splitting/Header injection


HTTP response splitting and HTTP header injection are two closely related attacks, based on a single
underlying weakness.
The idea is simple: HTTP (response) headers are separated by a CRLF combination. If user input ends
up in a header (like in a Location header for a redirect), this can allow an attacker to split a header in
two by putting a separator in it. Let's say that http://example.com/foo gets redirected to
http://example.com/blabla?q=foo.

An attacker can trick someone (or their browser) into following this link (%0d%0a is an URI-encoded
CRLF pair):
http://www.example.com/abc%0d%0aSet-Cookie%3a%20SESSION%3dwhatever-i-want

This could cause the victim's session cookie for example.com to be overwritten:
Location: http://www.example.com/blabla?q=abc
Set-Cookie: SESSION=whatever-i-want

This is a session fixation attack. For this particular bug, the real solution is of course to properly
percent-encode the destination URI, but the general solution can be as simple as disallowing newlines
in the header-setting mechanism (e.g., PHP does this since 5.1.2). Doing it in the only procedure which
is capable of emitting headers is a structurally secure approach, but it won't protect against all attacks.
For example, even if we disallow newlines it is still possible to set a parameter (attribute) or a second
value for a header, splitting it with a semicolon or a comma, respectively:
Accept: text/html;q=0.5, text/{user-type}

If this is done unsafely, extra content-types can be added. They can even be given preference:
Accept: text/html;q=0.5, text/plain;q=0.1, application/octet-stream;q=1.0

Protecting against these sorts of attacks can only be done with libraries which know each header's
syntax and use rich objects to represent them. This approach is taken by intarweb and Guile's HTTP
library, and is similar to representing HTML as a (DOM) tree. I'm not aware of any other libraries
which use fully parsed "objects" to represent HTTP header values.

Running subprocesses
For some reason, often people use a procedure like system() to invoke subprocesses. It's the most
convenient way to quickly run a program just like you would from the command line. It will pass this
string to the Unix shell, which expands globs ("wildcards") and runs the program:
(system (sprintf "echo \"~A\"" input)) ;; UNSAFE: byebye files"; rm -rf / "

Several languages have specialized syntax for invoking the shell and putting the output in a string using
backticks, e.g., `echo hi`. The really bad part is that string interpolation is supported within the
backtick operator, e.g., `echo Hi, "{$name}"`. This is dangerous because the shell is yet
another interpreter with its own language, and we've learned by now that we shouldn't embed user input
directly into a sublanguage. Here too, string interpolation makes the wrong thing very convenient,
which increases the risk of abuse and bugs. After all, spaces and quotes are perfectly legal inside
filenames, but when used with unsafely interpolated parameters, they will cause errors.
It is possible to escape shell arguments, but it's very tricky: no two shells provide exactly the same
command language with the same meta-characters. Is your /bin/sh really bash, dash, ash, ksh or
something else? It is even unspecified whether the sh used is /bin/sh.

However, a better approach is often available. Many programming languages offer an interface to one
or more members of the POSIX exec() function family. These allow passing the arguments to the
program in a separate array, and they don't go through the shell to invoke the program at all. This is
faster and a lot more secure.
(use posix)
;; Accepts a list of arguments:
(process "echo" (list "Hello, " first-name " " last-name))

By sidestepping the problem we've made it simpler, shorter than the system call above and safer, which
is our goal. In languages with string interpolation this will probably be slightly more verbose than the
system() version.

There is one small problem: by eliminating a call to the shell, we've also lost the ability to easily
construct pipelines. This can be done by calling several procedures, but this is way more complicated
than it is in the shell language. The obvious solution to that is to design a safe DSL. This is what the
Scheme Shell does with its notation for UNIX pipelines:
;; This will echo back the input, regardless of "special" characters
(define output (run/string (| (echo input) (caesar 5) (caesar 21))))
(display output)

Almost as convenient as the backtick operator, but without its dangers.


Summary
Language design can help us write applications which are structurally secure. We should strive to make
writing the right thing easier than the wrong thing so that even naively written, quick and dirty code has
some chance of being safe. To reach this goal, we can use the following approaches, in roughly
decreasing order of safety:
• "Sidestep" the issue by keeping data separated from commands.
• Represent data in proper data structures, not strings. On output, escape where needed.
• Use "safe buffers" which auto-escape concatenated strings.
• If escaping or separation is impossible, raise an error on bad data.
• If all else fails you can escape manually, but use coding conventions that make unsafe code
stand out.
These approaches are your first line of defense. Besides using these, you should also filter and sanitize
your input. Just don't mistake that as the fix for injection vulnerabilities!
This is the positive advice I can give you. The negative advice is simply to avoid building language or
library features which make unsafe code easier to write than safe code. An example of such a feature is
string interpolation, which causes more harm than good.

You might also like