Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!pipex!sunic!umdac!fizban.solace.mh.se!news.ifm.liu.se!liuida!mikpe
From: mikpe@ida.liu.se (Mikael Pettersson)
Subject: Re: Baker's "Cheney on the M.T.A." CPS->C translation
Message-ID: <1994Oct14.230215.6345@ida.liu.se>
Sender: news@ida.liu.se
Organization: Department of Computer Science, University of Linkping
References: <1994Oct7.023308.5041@ida.liu.se> <hbakerCxCzC4.EIB@netcom.com>
Date: Fri, 14 Oct 1994 23:02:15 GMT
Lines: 84

In article <hbakerCxCzC4.EIB@netcom.com> hbaker@netcom.com (Henry G. Baker) writes:
>You didn't mention how big the stack buffer is in your tests.  I found
>that the optimum size of the stack buffer varied from architecture to
>architecture (which is why my test program provides this as a command
>line argument).  Since the stack buffer is where all of the
>allocation/consing is done (it is the first generation in a
>generational collector), its size relative to any hardware caches is
>critical -- e.g., if the size of the stack buffer exceeds that of the
>cache, then I would expect a fair amount of thrashing.
>
>On various architectures, I found optimum stack buffer sizes to range
>from as low as 6K bytes to has high as 512K bytes.  

The short story is that, assuming the stack buffer in the pushy scheme
is the same size as the youngest generation (nursery) in the dispatcher
scheme, the stack overflows 23-29 TIMES more often than the nursery.
Hence, 23-29 times more frequent minor collections.  There's also a
higher survival rate, since fewer young objects have time to die before
they are caught in a minor collection.

Stack frames have the following sizes (all sizes are in bytes):
machine/compiler	no local vars	32 bytes of local storage
SPARC (SunPro cc)	96		128
SPARC (gcc)		112		144
PowerPC (Moto. cc)	64		96
HP-PA (HP cc)		64		128
HP-PA (gcc)		128		192
Alpha (DEC cc)		16		96
MIPS (DEC cc)		24		56

Software conventions cause C stacks have an appallingly high `idle' cost.
(In `address space', not cycles.) There are slots for links to previous
frames, callee-saved registers, return addresses, and `mystery' slots
with no apparent use. But they all tend to be there, regardless of whether
the active function needs them or not. (Few of these slots are ever
_written_ to, of course.)

For one benchmark, code compiled using the `dispatcher' scheme, performed
15 minor collections (no major) and 1196719 tailcalls during a short run
of 1-3 seconds (depending on the machine). The nursery size was 128K words,
0.5MB on most machines, 1MB on the Alpha; this is probably too large...

The following table shows the behaviour of the same benchmark, compiled
in the pushy style and using a 128K word stack buffer:

machine		#minor gc	(#minor gc)/15	time
SPARC		340		22.7		1.2s	
PowerPC		362		24.1		2.2s
HP-PA		433		28.9		2.1s
Alpha		342		22.8		4.4s
MIPS		344		22.9		2.3s

(Each of these also did one major collection near the end of the run,
since the higher survival rate started filling the second generation.)

This translates to roughly one minor gc every 4-13 milliseconds..
I find it hard to believe that reducing the stack buffer size, to have
it fit in some measly cache, is going to improve on these figures..

The reason for this is that, although lots of memory is being allocated,
each individual CPS lambda allocates only a few words, if it allocates at all.
(The 15 minor collections in the dispatcher scheme means that about 7.5+MB
was allocated (Alpha: 15MB). This is about 6.6 bytes per tailcall.)
So much C stack space is lost simply due to its high `idle' cost.

I believe this to be typical of highly symbolic languages, in my case
a strange-looking logic-based language, but I'd like to see similar
measurements for Scheme and SML.

>I also found substantial differences among architectures with regard
>to the way that setjmp/longjmp were handled.

For each machine I've ported the runtime system to, there's a
configuration file that describes, among other things, what the
fastest setjmp/longjmp pair on that machine is.

>Another difference is that the 'pushy' scheme may end up doing twice
>as many checks for stack overflow as a non-pushy scheme, since stack
>overflows may be checked at both entrance _and_ exit from a function.

My `pushy' compiler only emits stack checks at function entries.
-- 
Mikael Pettersson, Dept of Comp & Info Sci, Linkoping University, Sweden
email: mpe@ida.liu.se or ...!{mcsun,munnari,uunet,unido,...}!sunic!liuida!mpe
