Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!news2.near.net!MathWorks.Com!europa.eng.gtefsd.com!howland.reston.ans.net!math.ohio-state.edu!jussieu.fr!univ-lyon1.fr!swidir.switch.ch!newsfeed.ACO.net!Austria.EU.net!EU.net!sunic!umdac!fizban.solace.mh.se!news.ifm.liu.se!liuida!mikpe
From: mikpe@ida.liu.se (Mikael Pettersson)
Subject: CPS->C, pushy vs. dispatch-loop
Message-ID: <1994Sep7.132923.19702@ida.liu.se>
Sender: news@ida.liu.se
Organization: Department of Computer Science, University of Linkping
Date: Wed, 7 Sep 1994 13:29:23 GMT
Lines: 46

This data may be of interest to some implementors.

I have modified a compiler, which uses CPS internally and emits
reasonably portable ANSI-C code, to use the "pushy" CPS scheme
described by Henry G. Baker in Feb. '94 in some comp.* groups.

Summary:
* Tests done on a single-user SPARCstation 5 with Solaris 2.3
  and GCC 2.5.7.
* The standard CPS->C mapping (STANDARD) using global variables for
  parameter passing (and in my case a simulated stack for continuation
  records), and a dispatch-loop function for tailcalls works well.
* The portable "pushy" CPS->C code (PUSHY) is about 3 times _slower_
  than the STANDARD code. Although parameters are passed in registers
  instead of memory, and tailcalls go directly to their targets, the
  SPARC loses due to the enormous number of register window overflows.
  (The C compilers do not recognize that the register window is dead
  after the jump to g below:
	void f(...) { auto x; ... g(&x,...); return; }		)
* Using GCC's inline assembly and register allocation declarations to
  code up macros that do tailcalls without register window growth
  (PUSHY+GCC), we gain about a factor of 3-4.5 over PUSHY. PUSHY+GCC
  is typically 10-30% faster than STANDARD.

I have not yet had the opportunity to run the tests on any other
architectures than the SPARC, so the question "do we use STANDARD
or PUSHY in our CPS->C compilers" is still left unanswered :-(

Test data:
The application is a specification for the dynamic semantics of a
normal-order (call-by-name) functional language. The compiled spec.
is an interpreter for this language. The test case was hard-coded
(no parsing at runtime): a program computing prime numbers using
the well-known sieve method.
Times below are in seconds. GC-overhead is negligible.
Non-user time is in the 0.15-0.30 seconds range for all timings.

			(# of primes to compute)
Version		10	20	30	40	50
-----------------------------------------------------
STANDARD	0.22	0.68	1.62	3.90	8.60
PUSHY		0.48	1.85	5.30	13.20	25.41
PUSHY+GCC	0.31	0.61	1.35	3.03	5.89
-- 
Mikael Pettersson, Dept of Comp & Info Sci, Linkoping University, Sweden
email: mpe@ida.liu.se or ...!{mcsun,munnari,uunet,unido,...}!sunic!liuida!mpe
