Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!EU.net!sunic!news.kth.se!news.ifm.liu.se!liuida!mikpe
From: mikpe@ida.liu.se (Mikael Pettersson)
Subject: Baker's "Cheney on the M.T.A." CPS->C translation
Message-ID: <1994Oct7.023308.5041@ida.liu.se>
Summary: Elegant, but doesn't quite cut it.
Sender: news@ida.liu.se
Organization: Department of Computer Science, University of Linkping
Date: Fri, 7 Oct 1994 02:33:08 GMT
Lines: 60

Early last February, Henry Baker posted an intriguing article on
translating CPS (continuation-passing style) to C. About a month
ago, I mentioned that I had tried this idea in a compiler of mine,
and that it appeared to work ok on the SPARC. Now I have ported
my runtime system to a number of different architectures (SPARC,
MIPS, PowerPC, HP-PA, and Alpha), and can summarize my experiences
to this group.

First some terminology:

Baker called his article "CONS Should Not CONS Its Arguments, Part II:
Cheney on the M.T.A.". Since the basic idea is to use the C stack
as the younger generation of a generational copying collector,
implement allocation by local variables, and use recursion for
tailcalls, I call this the "pushy" scheme.

The "dispatch" scheme uses the usual interpretive solution: each
lambda becomes a parameterless C function, arguments are passed
via global variables, a dispatch function sits in a tight loop calling
a function pointer, getting back a new function pointer, calling
this one, and so on. A separate heap takes care of dynamic allocation.

Ok, so how well did the "pushy" scheme do for my benchmarks?

On the SPARC and MIPS, the pushy scheme appears to be slightly
faster (a few % on the SPARC, perhaps 30% on the MIPS) for short-
running code (less than 10-15 seconds). However, for longer-running
code the "dispatch" scheme wins by 10-30%.

On the PowerPC and HP-PA, the dispatching scheme always wins
by about 25-30%.

Finally, on the Alpha, the dispatching scheme consistently wins
by about a factor of 4.

I do not yet understand fully why the pushy scheme is so much slower on
the Alpha, but I note that operating system conventions on the Alpha
(OSF), HP-PA (HP-UX), and PowerPC (PARIX, don't worry if you haven't
heard of it), impose serious overheads in both direct and indirect
function calls. Stack usage was also high due to largish fixed-layout
stack frames; this causes frequent (minor) collections.

While the SPARC did reasonably well, this only happened after using
some fairly hairy inline assembly hacks with the GNU C compiler.
(I also tried GCC's "-mflat" option, and it was a complete disaster,
causing a slow-down by almost a factor of two. Also its
"__attribute__((noreturn))" declaration was next to useless.)

So it appears that the pushy scheme, despite of its elegance and
apparent simplicity, doesn't mix well together with current HW
and operating systems.

Finally, a question: does anyone know of any other interesting approaches
to the tailcall problem in languages like C? (Apart from "the BIG switch",
which isn't really practical.) I know that the Japanese ICOT Kl1->C
compiler KLIC uses "the largish switch" within modules, and a dispatcher
loop to connect modules.
-- 
Mikael Pettersson, Dept of Comp & Info Sci, Linkoping University, Sweden
email: mpe@ida.liu.se or ...!{mcsun,munnari,uunet,unido,...}!sunic!liuida!mpe
