Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!MathWorks.Com!panix!ddsw1!redstone.interpath.net!hilbert.dnai.com!agate!news.ucdavis.edu!csus.edu!netcom.com!hbaker
From: hbaker@netcom.com (Henry G. Baker)
Subject: Re: Baker's "Cheney on the M.T.A." CPS->C translation
Message-ID: <hbakerCxrzJA.439@netcom.com>
Organization: nil
References: <1994Oct7.023308.5041@ida.liu.se> <hbakerCxCzC4.EIB@netcom.com> <1994Oct14.230215.6345@ida.liu.se>
Date: Sun, 16 Oct 1994 17:06:46 GMT
Lines: 84

In article <1994Oct14.230215.6345@ida.liu.se> mikpe@ida.liu.se (Mikael Pettersson) writes:
>In article <hbakerCxCzC4.EIB@netcom.com> hbaker@netcom.com (Henry G. Baker) writes:
>>You didn't mention how big the stack buffer is in your tests.  I found
>>that the optimum size of the stack buffer varied from architecture to
>>architecture (which is why my test program provides this as a command
>>line argument).  Since the stack buffer is where all of the
>>allocation/consing is done (it is the first generation in a
>>generational collector), its size relative to any hardware caches is
>>critical -- e.g., if the size of the stack buffer exceeds that of the
>>cache, then I would expect a fair amount of thrashing.
>>
>>On various architectures, I found optimum stack buffer sizes to range
>>from as low as 6K bytes to has high as 512K bytes.  
>
>The short story is that, assuming the stack buffer in the pushy scheme
>is the same size as the youngest generation (nursery) in the dispatcher
>scheme, the stack overflows 23-29 TIMES more often than the nursery.
>Hence, 23-29 times more frequent minor collections.  There's also a
>higher survival rate, since fewer young objects have time to die before
>they are caught in a minor collection.

Due to the significant differences in the amount of 'allocation' and
the differing retention rates, I would think that the optimum size for
the nursery of the dispatcher scheme would be quite different from the
optimum size for the pushy stack buffer -- even on the very same HW
architecture.

>Software conventions cause C stacks have an appallingly high `idle' cost.
>(In `address space', not cycles.) There are slots for links to previous
>frames, callee-saved registers, return addresses, and `mystery' slots
>with no apparent use. But they all tend to be there, regardless of whether
>the active function needs them or not. (Few of these slots are ever
>_written_ to, of course.)

I have found this, as well.  The KSR is even an order of magnitude
worse than the machines you mentioned.  The key to the pushy scheme is
that 1) the architecture should pay no penalty for large stack frames
which write only a small fraction of their entries, and 2) the C
compiler should optimize away as many of the writes to 'dead'
locations within the stack frame as possible.  Thus, one could march
through the stack buffer at a furious rate, but this doesn't cost very
much if not very much is written or read.

>For one benchmark, code compiled using the `dispatcher' scheme, performed
>15 minor collections (no major) and 1196719 tailcalls during a short run
>of 1-3 seconds (depending on the machine). The nursery size was 128K words,
>0.5MB on most machines, 1MB on the Alpha; this is probably too large...
>
>The following table shows the behaviour of the same benchmark, compiled
>in the pushy style and using a 128K word stack buffer:

As I said above, I believe the optimum size for the dispatch nursery and
that for the pushy stack buffer will probably be different.

>I find it hard to believe that reducing the stack buffer size, to have
>it fit in some measly cache, is going to improve on these figures..
>
>The reason for this is that, although lots of memory is being allocated,
>each individual CPS lambda allocates only a few words, if it allocates at all.
>So much C stack space is lost simply due to its high `idle' cost.

So what?  If only a little bit has been allocated, then only a little bit
has to be copied out, so that part is correspondingly cheaper.

>>I also found substantial differences among architectures with regard
>>to the way that setjmp/longjmp were handled.
>
>For each machine I've ported the runtime system to, there's a
>configuration file that describes, among other things, what the
>fastest setjmp/longjmp pair on that machine is.

That would be interesting information to publish, just by itself.

----

As I said in my original posting, there is nothing in the
cheneymta/pushy scheme to preclude doing tail-call optimizations
within the Scheme/Lisp compiler, particularly for 'leaf' loops such as
are found in Fortran & C, which do no allocation.  There is no way
that 'pushy' can compete with optimized loops of this sort.

      Henry Baker
      Read ftp.netcom.com:/pub/hbaker/README for info on ftp-able papers.

