Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!yeshua.marcam.com!charnel.ecst.csuchico.edu!csusac!csus.edu!netcom.com!hbaker
From: hbaker@netcom.com (Henry G. Baker)
Subject: Re: CPS->C, pushy vs. dispatch-loop
Message-ID: <hbakerCvtM9A.AC3@netcom.com>
Organization: nil
References: <1994Sep7.132923.19702@ida.liu.se>
Date: Thu, 8 Sep 1994 17:07:58 GMT
Lines: 50

In article <1994Sep7.132923.19702@ida.liu.se> mikpe@ida.liu.se (Mikael Pettersson) writes:
>This data may be of interest to some implementors.
>
>I have modified a compiler, which uses CPS internally and emits
>reasonably portable ANSI-C code, to use the "pushy" CPS scheme
>described by Henry G. Baker in Feb. '94 in some comp.* groups.
>
>Summary:
>* Tests done on a single-user SPARCstation 5 with Solaris 2.3
>  and GCC 2.5.7.
>* The standard CPS->C mapping (STANDARD) using global variables for
>  parameter passing (and in my case a simulated stack for continuation
>  records), and a dispatch-loop function for tailcalls works well.
>* The portable "pushy" CPS->C code (PUSHY) is about 3 times _slower_
>  than the STANDARD code. Although parameters are passed in registers
>  instead of memory, and tailcalls go directly to their targets, the
                                                                  ^^^
>  SPARC loses due to the enormous number of register window overflows.
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>  (The C compilers do not recognize that the register window is dead
>  after the jump to g below:
>	void f(...) { auto x; ... g(&x,...); return; }		)
>* Using GCC's inline assembly and register allocation declarations to
>  code up macros that do tailcalls without register window growth
>  (PUSHY+GCC), we gain about a factor of 3-4.5 over PUSHY. PUSHY+GCC
>  is typically 10-30% faster than STANDARD.
>
>I have not yet had the opportunity to run the tests on any other
>architectures than the SPARC, so the question "do we use STANDARD
>or PUSHY in our CPS->C compilers" is still left unanswered :-(

I see that 'Cheney on the MTA' is too obscure.  Oh well, 'pushy' is
short and sweet.  :-)

Readers should be aware that the 3X slowdown is _unique_ to the SPARC
'register window' architecture.  The tests that I and a number of others
ran last spring on other architectures -- e.g., Alpha, MIPS, 68K, Intel,
etc., all showed competitive performance _without_ special function-call
hacking.  (The size of the stack buffer does have to be tuned to the
architecture, however, to achieve the best performance.)

The performance measurements found here are similar to the results from
my hand-compiled Boyer benchmark, so I am gratified that my hand compilation
didn't use optimizations not available to a normal compiler.

Thanks very much for your results.

Would it be possible to post the C code emitted for some simple functions?
(e.g., factorial, fibonacci, etc.)

