The new 6502 family implementation

Introduction

The new 6502 family implementation has been created to reach sub-instruction accuracy in observable behaviour. It is designed with 3 goals in mind:

  • every bus cycle must happen at the exact time it would happen in a real CPU, and every access the real CPU does is done

  • instructions can be interrupted at any time in the middle then restarted at that point transparently

  • instructions can be interrupted even from within a memory handler for bus contention/wait states emulation purposes

Point 1 has been ensured through bisimulation with the gate-level simulation perfect6502. Point 2 has been ensured structurally through a code generator which will be explained in section 8. Point 3 is not done yet due to lack of support on the memory subsystem side, but section 9 shows how it will be handled.

The 6502 family

The MOS 6502 family has been large and productive. A large number of variants exist, varying on bus sizes, I/O, and even opcodes. Some offshoots (g65c816, hu6280) even exist that live elsewhere in the mame tree. The final class hierarchy is this:

                          6502
                           |
        +------+--------+--+--+-------+-------+
        |      |        |     |       |       |
      6510   deco16   6504   6509   rp2a03  65c02
        |                                     |
  +-----+-----+                            r65c02
  |     |     |                               |
6510t  7501  8502                         +---+---+
                                          |       |
                                       65ce02   65sc02
                                          |
                                        4510

The 6510 adds an up to 8 bits I/O port, with the 6510t, 7501 and 8502 being software-identical variants with different pin count (hence I/O count), die process (NMOS, HNMOS, etc.) and clock support.

The deco16 is a Deco variant with a small number of not really understood additional instructions and some I/O.

The 6504 is a pin and address-bus reduced version.

The 6509 adds internal support for paging.

The rp2a03 is the NES variant with the D flag disabled and sound functionality integrated.

The 65c02 is the very first cmos variant with some additional instructions, some fixes, and most of the undocumented instructions turned into nops. The R (Rockwell, but eventually produced by WDC too among others) variant adds a number of bitwise instructions and also stp and wai. The SC variant, used by the Lynx portable console, looks identical to the R variant. The 'S' probably indicates a static-ram-cell process allowing full DC-to-max clock control.

The 65ce02 is the final evolution of the ISA in this hierarchy, with additional instructions, registers, and removals of a lot of dummy accesses that slowed the original 6502 down by at least 25%. The 4510 is a 65ce02 with integrated MMU and GPIO support.

Usage of the classes

All the CPUs are standard modern CPU devices, with all the normal interaction with the device infrastructure. To include one of these CPUs in your driver you need to include "CPU/m6502/<CPU>.h" and then do a MCFG_CPU_ADD("tag", <CPU>, clock).

6510 variants port I/O callbacks are setup through:

MCFG_<CPU>_PORT_CALLBACKS(READ8(type, read_method), WRITE8(type, write_method))

And the pullup and floating lines mask is given through:

MCFG_<CPU>_PORT_PULLS(pullups, floating)

In order to see all bus accesses on the memory handlers it is possible to disable accesses through the direct map (at a CPU cost, of course) with:

MCFG_M6502_DISABLE_DIRECT()

In that case, transparent decryption support is also disabled, everything goes through normal memory-map read/write calls. The state of the sync line is given by the CPU method get_sync(), making implementing the decryption in the handler possible.

Also, as for every executable device, the CPU method total_cycles() gives the current time in cycles since the start of the machine from the point of view of the CPU. Or, in other words, what is usually called the cycle number for the CPU when somebody talks about bus contention or wait states. The call is designed to be fast (no system-wide sync, no call to machine.time()) and is precise. Cycle number for every access is exact at the sub-instruction level.

The 4510 special nomap line is accessible through get_nomap().

Other than these specifics, these are perfectly normal CPU classes.

General structure of the emulations

Each variant is emulated through up to 4 files:

  • <CPU>.h = header for the CPU class

  • <CPU>.c = implementation of most of the CPU class

  • d<CPU>.lst = dispatch table for the CPU

  • o<CPU>.lst = opcode implementation code for the CPU

The last two are optional. They're used to generate a <CPU>.inc file in the object directory which is included by the .c file.

At a minimum, the class must include a constructor and an enum picking up the correct input line ids. See m65sc02 for a minimalist example. The header can also include specific configuration macros (see m8502) and also the class can include specific memory accessors (more on these later, simple example in m6504).

If the CPU has its own dispatch table, the class must also include the declaration (but not definition) of disasm_entries, do_exec_full and do_exec_partial, the declaration and definition of disasm_disassemble (identical for all classes but refers to the class-specific disasm_entries array) and include the .inc file (which provides the missing definitions). Support for the generation must also be added to CPU.mak.

If the CPU has in addition its own opcodes, their declaration must be done through a macro, see f.i. m65c02. The .inc file will provide the definitions.

Dispatch tables

Each d<CPU>.lst is the dispatch table for the CPU. Lines starting with '#' are comments. The file must include 257 entries, the first 256 being opcodes and the 257th what the CPU should do on reset. In the 6502 irq and nmi actually magically call the "brk" opcode, hence the lack of specific description for them.

Entries 0 to 255, i.e. the opcodes, must have one of these two structures:

  • opcode_addressing-mode

  • opcode_middle_addressing-mode

Opcode is traditionally a three-character value. Addressing mode must be a 3-letter value corresponding to one of the DASM_* macros in m6502.h. Opcode and addressing mode are used to generate the disassembly table. The full entry text is used in the opcode description file and the dispatching methods, allowing for per-CPU variants for identical-looking opcodes.

An entry of "." was usable for unimplemented/unknown opcodes, generating "???" in the disassembly, but is not a good idea at this point since it will infloop in execute() if encountered.

Opcode descriptions

Each o<CPU>.lst file includes the CPU-specific opcodes descriptions. An opcode description is a series of lines starting by an opcode entry by itself and followed by a series of indented lines with code executing the opcode.

For instance the asl <absolute address> opcode looks like this:

asl_aba
TMP = read_pc();
TMP = set_h(TMP, read_pc());
TMP2 = read(TMP);
write(TMP, TMP2);
TMP2 = do_asl(TMP2);
write(TMP, TMP2);
prefetch();

First the low part of the address is read, then the high part (read_pc is auto-incrementing). Then, now that the address is available the value to shift is read, then re-written (yes, the 6502 does that), shifted then the final result is written (do_asl takes care of the flags). The instruction finishes with a prefetch of the next instruction, as all non-CPU-crashing instructions do.

Available bus-accessing functions are:

read(adr)

standard read

read_direct(adr)

read from program space

read_pc()

read at the PC address and increment it

read_pc_noinc()

read at the PC address

read_9()

6509 indexed-y banked read

write(adr, val)

standard write

prefetch()

instruction prefetch

prefetch_noirq()

instruction prefetch without irq check

Cycle counting is done by the code generator which detects (through string matching) the accesses and generates the appropriate code. In addition to the bus-accessing functions a special line can be used to wait for the next event (irq or whatever). "eat-all-cycles;" on a line will do that wait then continue. It is used by wai_imp and stp_imp for the m65c02.

Due to the constraints of the code generation, some rules have to be followed:

  • in general, stay with one instruction/expression per line

  • there must be no side effects in the parameters of a bus-accessing function

  • local variables lifetime must not go past a bus access. In general, it's better to leave them to helper methods (like do_asl) which do not do bus accesses. Note that "TMP" and "TMP2" are not local variables, they're variables of the class.

  • single-line then or else constructs must have braces around them if they're calling a bus-accessing function

The per-opcode generated code are methods of the CPU class. As such they have complete access to other methods of the class, variables of the class, everything.

Memory interface

For better opcode reuse with the MMU/banking variants, a memory access subclass has been created. It's called memory_interface, declared in m6502_device, and provides the following accessors:

UINT8 read(UINT16 adr)

normal read

UINT8 read_sync(UINT16 adr)

opcode read with sync active (first byte of opcode)

UINT8 read_arg(UINT16 adr)

opcode read with sync inactive (rest of opcode)

void write(UINT16 adr, UINT8 val)

normal write

UINT8 read_9(UINT16 adr)

special y-indexed 6509 read, defaults to read()

void write_9(UINT16 adr, UINT8 val);

special y-indexed 6509 write, defaults to write()

Two implementations are given by default, one usual, mi_default_normal, one disabling direct access, mi_default_nd. A CPU that wants its own interface (see 6504 or 6509 for instance) must override device_start, intialize mintf there then call init().

The generated code

A code generator is used to support interrupting and restarting an instruction in the middle. This is done through a two-level state machine with updates only at the boundaries. More precisely, inst_state tells you which main state you're in. It's equal to the opcode byte when 0-255, and 0xff00 means reset. It's always valid and used by instructions like rmb. inst_substate indicates at which step we are in an instruction, but it set only when an instruction has been interrupted. Let's go back to the asl <abs> code:


asl_aba
TMP = read_pc();
TMP = set_h(TMP, read_pc());
TMP2 = read(TMP);
write(TMP, TMP2);
TMP2 = do_asl(TMP2);
write(TMP, TMP2);
prefetch();

The complete generated code is:

void m6502_device::asl_aba_partial()
{
switch(inst_substate) {
case 0:
if(icount == 0) { inst_substate = 1; return; }
case 1:
TMP = read_pc();
icount--;
if(icount == 0) { inst_substate = 2; return; }
case 2:
TMP = set_h(TMP, read_pc());
icount--;
if(icount == 0) { inst_substate = 3; return; }
case 3:
TMP2 = read(TMP);
icount--;
if(icount == 0) { inst_substate = 4; return; }
case 4:
write(TMP, TMP2);
icount--;
TMP2 = do_asl(TMP2);
if(icount == 0) { inst_substate = 5; return; }
case 5:
write(TMP, TMP2);
icount--;
if(icount == 0) { inst_substate = 6; return; }
case 6:
prefetch();
icount--;
}
inst_substate = 0;
}

One can see that the initial switch() restarts the instruction at the appropriate substate, that icount is updated after each access, and upon reaching 0 the instruction is interrupted and the substate updated. Since most instructions are started from the beginning a specific variant is generated for when inst_substate is known to be 0:


void m6502_device::asl_aba_full()
{
if(icount == 0) { inst_substate = 1; return; }
TMP = read_pc();
icount--;
if(icount == 0) { inst_substate = 2; return; }
TMP = set_h(TMP, read_pc());
icount--;
if(icount == 0) { inst_substate = 3; return; }
TMP2 = read(TMP);
icount--;
if(icount == 0) { inst_substate = 4; return; }
write(TMP, TMP2);
icount--;
TMP2 = do_asl(TMP2);
if(icount == 0) { inst_substate = 5; return; }
write(TMP, TMP2);
icount--;
if(icount == 0) { inst_substate = 6; return; }
prefetch();
icount--;
}

That variant removes the switch, avoiding a costly computed branch and also an inst_substate write. There is in addition a fair chance that the decrement-test with zero pair is compiled into something efficient.

All these opcode functions are called through two virtual methods, do_exec_full and do_exec_partial, which are generated into a 257-entry switch statement. Pointers-to-methods being expensive to call, a virtual function implementing a switch has a fair chance of being better.

The execute main call ends up very simple:

void m6502_device::execute_run()
{
if(inst_substate)
do_exec_partial();

while(icount > 0) {
if(inst_state < 0x100) {
PPC = NPC;
inst_state = IR;
if(machine().debug_flags & DEBUG_FLAG_ENABLED)
debugger_instruction_hook(this, NPC);
}
do_exec_full();
}
}

If an instruction was partially executed finish it (icount will then be zero if it still doesn't finish). Then try to run complete instructions. The NPC/IR dance is due to the fact that the 6502 does instruction prefetching, so the instruction PC and opcode come from the prefetch results.

Future bus contention/delay slot support

Supporting bus contention and delay slots in the context of the code generator only requires being able to abort a bus access when not enough cycles are available into icount, and restart it when cycles have become available again. The implementation plan is to:

  • Have a delay() method on the CPU that removes cycles from icount. If icount becomes zero or less, having it throw a suspend() exception.

  • Change the code generator to generate this:

void m6502_device::asl_aba_partial()
{
switch(inst_substate) {
case 0:
if(icount == 0) { inst_substate = 1; return; }
case 1:
try {
TMP = read_pc();
} catch(suspend) { inst_substate = 1; return; }
icount--;
if(icount == 0) { inst_substate = 2; return; }
case 2:
try {
TMP = set_h(TMP, read_pc());
} catch(suspend) { inst_substate = 2; return; }
icount--;
if(icount == 0) { inst_substate = 3; return; }
case 3:
try {
TMP2 = read(TMP);
} catch(suspend) { inst_substate = 3; return; }
icount--;
if(icount == 0) { inst_substate = 4; return; }
case 4:
try {
write(TMP, TMP2);
} catch(suspend) { inst_substate = 4; return; }
icount--;
TMP2 = do_asl(TMP2);
if(icount == 0) { inst_substate = 5; return; }
case 5:
try {
write(TMP, TMP2);
} catch(suspend) { inst_substate = 5; return; }
icount--;
if(icount == 0) { inst_substate = 6; return; }
case 6:
try {
prefetch();
} catch(suspend) { inst_substate = 6; return; }
icount--;
}
inst_substate = 0;
}

A modern try/catch costs nothing if an exception is not thrown. Using this the control will go back to the main loop, which will then look like this:

void m6502_device::execute_run()
{
if(waiting_cycles) {
icount -= waiting_cycles;
waiting_cycles = 0;
}

if(icount > 0 && inst_substate)
do_exec_partial();

while(icount > 0) {
if(inst_state < 0x100) {
PPC = NPC;
inst_state = IR;
if(machine().debug_flags & DEBUG_FLAG_ENABLED)
debugger_instruction_hook(this, NPC);
}
do_exec_full();
}

waiting_cycles = -icount;
icount = 0;
}

A negative icount means that the CPU won't be able to do anything for some time in the future, because it's either waiting for the bus to be free or for a peripheral to answer. These cycles will be counted until elapsed and then normal processing will go on. It's important to note that the exception path only happens when the contention/wait state goes further than the scheduling slice of the CPU. That should not usually be the case, so the cost should be minimal.

Multi-dispatch variants

Some variants currently in the process of being supported change instruction set depending on an internal flag, either switching to a 16-bits mode or changing some register accesses to memory accesses. This is handled by having multiple dispatch tables for the CPU, the d<CPU>.lst not being 257 entries anymore but 256*n+1. The variable inst_state_base must select which instruction table to use at a given time. It must be a multiple of 256, and is in fact simply OR-ed to the first instruction byte to get the dispatch table index (aka inst_state).

Current TO-DO:

  • Implement the bus contention/wait states stuff, but that requires support on the memory map side first.

  • Integrate the I/O subsystems in the 4510

  • Possibly integrate the sound subsytem in the rp2a03

  • Add decent hookups for the Apple 3 madness