Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!news.kei.com!hookup!ames!pacbell.com!amdahl!netcomsv!ix.netcom.com!netcom.com!hbaker
From: hbaker@netcom.com (Henry G. Baker)
Subject: Re: Baker's "Cheney on the M.T.A." CPS->C translation
Message-ID: <hbakerCxCzC4.EIB@netcom.com>
Organization: nil
References: <1994Oct7.023308.5041@ida.liu.se>
Date: Sat, 8 Oct 1994 14:38:28 GMT
Lines: 122

In article <1994Oct7.023308.5041@ida.liu.se> mikpe@ida.liu.se (Mikael Pettersson) writes:
>Early last February, Henry Baker posted an intriguing article on
>translating CPS (continuation-passing style) to C. About a month
>ago, I mentioned that I had tried this idea in a compiler of mine,
>and that it appeared to work ok on the SPARC. Now I have ported
>my runtime system to a number of different architectures (SPARC,
>MIPS, PowerPC, HP-PA, and Alpha), and can summarize my experiences
>to this group.
>
>First some terminology:
>
>Baker called his article "CONS Should Not CONS Its Arguments, Part II:
>Cheney on the M.T.A.". Since the basic idea is to use the C stack
>as the younger generation of a generational copying collector,
>implement allocation by local variables, and use recursion for
>tailcalls, I call this the "pushy" scheme.
>
>The "dispatch" scheme uses the usual interpretive solution: each
>lambda becomes a parameterless C function, arguments are passed
>via global variables, a dispatch function sits in a tight loop calling
>a function pointer, getting back a new function pointer, calling
>this one, and so on. A separate heap takes care of dynamic allocation.
>
>Ok, so how well did the "pushy" scheme do for my benchmarks?
>
>On the SPARC and MIPS, the pushy scheme appears to be slightly
>faster (a few % on the SPARC, perhaps 30% on the MIPS) for short-
>running code (less than 10-15 seconds). However, for longer-running
>code the "dispatch" scheme wins by 10-30%.
>
>On the PowerPC and HP-PA, the dispatching scheme always wins
>by about 25-30%.
>
>Finally, on the Alpha, the dispatching scheme consistently wins
>by about a factor of 4.
>
>I do not yet understand fully why the pushy scheme is so much slower on
>the Alpha, but I note that operating system conventions on the Alpha
>(OSF), HP-PA (HP-UX), and PowerPC (PARIX, don't worry if you haven't
>heard of it), impose serious overheads in both direct and indirect
>function calls. Stack usage was also high due to largish fixed-layout
>stack frames; this causes frequent (minor) collections.
>
>While the SPARC did reasonably well, this only happened after using
>some fairly hairy inline assembly hacks with the GNU C compiler.
>(I also tried GCC's "-mflat" option, and it was a complete disaster,
>causing a slow-down by almost a factor of two. Also its
>"__attribute__((noreturn))" declaration was next to useless.)
>
>So it appears that the pushy scheme, despite of its elegance and
>apparent simplicity, doesn't mix well together with current HW
>and operating systems.

You didn't mention how big the stack buffer is in your tests.  I found
that the optimum size of the stack buffer varied from architecture to
architecture (which is why my test program provides this as a command
line argument).  Since the stack buffer is where all of the
allocation/consing is done (it is the first generation in a
generational collector), its size relative to any hardware caches is
critical -- e.g., if the size of the stack buffer exceeds that of the
cache, then I would expect a fair amount of thrashing.

On various architectures, I found optimum stack buffer sizes to range
from as low as 6K bytes to has high as 512K bytes.  

Also, on larger programs, one might expect more access to non-stack
variables, in which case the stack buffer might have to be further
reduced to allow for more stuff outside the stack to become resident
in the cache.

I also found substantial differences among architectures with regard
to the way that setjmp/longjmp were handled.  In some architectures,
the only way to get any decent speed was to use a lower-level version
called _setjmp/_longjmp (or some such name).  Of course, since this is
used only when the stack buffer is contracted, its relative cost goes
down for larger stack buffers.  However, this cost is still
measurable, showing that longjmp performance is truly abysmal on some
machines.  This may explain the poor performance you found on the
Alpha.  (My measurements on the Alpha were extremely promising once I
had made this change, so I would be interested in understanding your
figures better.)

Another difference is that the 'pushy' scheme may end up doing twice
as many checks for stack overflow as a non-pushy scheme, since stack
overflows may be checked at both entrance _and_ exit from a function.
This can be optimized to some extent by doing checks only at function
entrance (assuming that the function returns only a
compiler-determinable amount of stuff).

You mentioned the problem with the SPARC register windows, which can be
essentially eliminated by making simple changes to the output of the
C compiler before it is assembled.

Finally, one of the optimizations that the 'pushy' scheme allows is
that of 'consing' in 'stack' allocated variables, a la my 'CON Should
Not CONS' paper.  Most C compilers and hardware architectures go to a
lot of trouble to optimize access to local variables (as opposed to
global variables), and the 'pushy' scheme tries to take advantage of
this.  This greatly reduces the overhead for extremely short-lived
objects -- e.g., flonum consing, complex num consing, (x,y,z) vector
consing in numerical routines, etc.  If one continues to truly cons
these objects in a traditional fashion, then one gives up a
significant source of optimization.

If you _know_, however, that a cons will live a very long time -- e.g.,
because it is being consed into a hash table -- then it might make more
sense to cons it directly and skip the intermediate step.  Andrew Appel
tells me that his ML 'hash cons' routine was slow, and I attribute this
to the fact that it does twice as much work as it needs to, because it
first conses, then copies to the proper place in the hash table.

Given the relatively small performance differences that you have
found, one would expect that architectural differences will be quite
pronounced.

Your numbers also show that the penalty for going with the more
elegant 'pushy' scheme is not enormous, so many may opt for it on that
basis alone.

      Henry Baker
      Read ftp.netcom.com:/pub/hbaker/README for info on ftp-able papers.

