Newsgroups: comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!gatech!howland.reston.ans.net!pipex!sunic!sunic.sunet.se!liuida!mikpe
From: mikpe@ida.liu.se (Mikael Pettersson)
Subject: Re: Mixing languages
X-Nntp-Posting-Host: sen14.ida.liu.se
Message-ID: <D8DG4o.Dy1@ida.liu.se>
Sender: news@ida.liu.se
Organization: Department of Computer Science, University of Linkping
References: <LORD.95May7224548@x1.cygnus.com> <3onfoa$nb6@nyheter.chalmers.se> <sjfz1Cy00bkM1YSeNs@andrew.cmu.edu>
Date: Wed, 10 May 1995 16:50:47 GMT
Lines: 100

In article <sjfz1Cy00bkM1YSeNs@andrew.cmu.edu> "Daniel C. Wang" <dw3u+@andrew.cmu.edu> writes:
>What I'd like to see is a preformace comparison of a bytecoded VM based on a
>virtual RISC architecture which tries to uses the hardware registers to
>there maximum potential, compared to native code. I'm sure that such a VM
>based on a virtual RISC machine would outpreform a stackbased VM, and
>probably make interpreted code look like a much more viable solution when
>compared to compiled native code.

I don't believe a RISC-like bytecoded VM is going to be a win. Here's why:

RISC
====
Assume a 3-address RISC VM, with instructions like
	<op> reg_src_1, reg_src_2, reg_dst
	load <immediate>, reg_dst
	fetch [reg_addr], reg_dst
	store reg_src, [reg_addr]
	etc.
and a decent number of homogeneous registers, say 16, that
are to be in a one-to-one mapping with the real HW registers.

How does one interpret "add r1,r2,r3" for instance?

Alternative 1
-------------
Fetch the instruction opcode (add), and dispatch.
In the code for add, fetch the two argument register numbers from the
instruction stream, and dispatch on those to fetch the actual HW
register values. Compute the result, fetch the result register number,
and dispatch to find the correct HW register in which to store the result.
(If the VM/HW dispatching code is inlined, the interpreter becomes quite
bulky. (Thus stressing the I-cache.) If it is called as subroutines,
we lose some speed.)

To reduce the bulk of this code (the three VM-reg-to-HW-reg mappings),
one can can combine the code for all "<op> r1,r2,r3" instructions
as follows: encode "add r1,r2,r3" as the 5 bytes <ARITH,ADD,r1,r2,r3>.
Fetching ARITH and dispatching takes us to code that first maps the
VM argument registers to their HW counterparts, then dispatches on
the sub-opcode (ADD, SUB, etc.) to compute a value, and finally
dispatches on the VM result register to store the value.
While this will reduce the size of the interpreter (nicer for the
I-cache), we have to add another dispatch for the sub-opcode (ADD, ..).

Alternative 2
-------------
To encode "add r1,r2,r3" we need something like 3*4 = 12 bits for the
registers, and a few bits for the opcode. Assume we encode all instructions
as 16-bit unsigned integers. Each and every one of those will have its
own specialized code in the interpreter. (That's 64K cases!) Now, the
code for, say, "add r1,r2,r3" can proceed at full speed since no
translations need to be done between VM register numbers and HW registers.
Unfortunately, the size of the interpreter is going to be absolutely
monstrous. (Bad for the I-cache.)


STACK/CISC
==========
A stack-based VM, with its mainly implicit addressing modes
(NOT always negates the value in the accumulator (or top stack slot),
ADD always adds the accumulator and top-of-stack, pops the stack,
and places the result in the accumulator, and so on)
is much simpler to interpret. There are many fewer possible instructions,
and many instructions require no internal decoding of arguments.
So the interpreter is quite lean. (Good for the I-cache.) Furthermore,
the active part of the VM stack is likely to reside in the data cache,
since it is accessed so frequently.

There are at least two standard tricks to improve a CISCy VM:
1. Some instructions do take arguments in the instruction stream.
   For instance, accessing the environment might be encoded
   as the 3 bytes <OP_ENV_REF, frame#, offset>. Even if this
   doesn't require _dispatching_ in the code for OP_ENV_REF, we still
   have to fetch the bytes for the frame# and offset.
   By analyzing the typical run-time patterns, we may discover that
   most uses of OP_ENV_REF use frame numbers and offsets in some
   small interval, say 0 <= frame# <= 2, and 0 <= offset <= 4.
   We can now add specialized instructions OP_ENV_REF_<frame#>_<offset>
   to optimize the execution of the most common cases.

2. Some combinations of 2-3 instructions may be quite common.
   Analogously with the previous case, we can introduce new instructions
   for these sequences. This increases the amount of work done per
   instruction, or equivalently, reduces the overhead of the
   fetch/dispatch cycle.


In summary: I believe a stack-like VM wins because it allows us to
make the instructions simple (and fast). I believe a CISCy VM wins
because it has a better work/"interpretive overhead" ratio.


(You could combine a bytecoded VM with on-the-fly macro expansion
to native code, and a cache of such code blocks for the most
heavily used VM procedures, to boost performance. I understand
some hardware emulators use such techniques.)
-- 
Mikael Pettersson                                | Email: mpe@ida.liu.se
Department of Computer and Information Science   | Phone: +46 13282683
Linkoping University, S-581 83 Linkoping, SWEDEN | Fax  : +46 13282666
