Playstation Emulation Guide: Lionel Flandrin July 24, 2019
Playstation Emulation Guide: Lionel Flandrin July 24, 2019
Lionel Flandrin
July 24, 2019
1
Contents
1 Introduction 6
1.1 Isn’t emulation complicated? . . . . . . . . . . . . . . . . . . . . 6
1.2 Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2
2.38 Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.39 SH instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.40 SPU registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.41 JAL instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.42 ANDI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.43 SB instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.44 Expansion 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.45 JR instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.46 LB instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.47 BEQ instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.48 Expansion 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.49 RAM byte access . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.50 MFC0 instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.51 AND instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.52 ADD instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.53 Interrupt Control registers . . . . . . . . . . . . . . . . . . . . . . 53
2.54 BGTZ instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.55 BLEZ instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.56 LBU instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.57 JALR instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.58 BLTZ, BLTZAL, BGEZ and BGEZAL instructions . . . . . . . . 55
2.59 SLTI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.60 SUBU instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.61 SRA instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.62 DIV instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.63 MFLO instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.64 SRL instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.65 SLTIU instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.66 DIVU instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.67 MFHI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.68 SLT instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.69 Interrupt Control read . . . . . . . . . . . . . . . . . . . . . . . . 62
2.70 Timer registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.71 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.72 SYSCALL instruction . . . . . . . . . . . . . . . . . . . . . . . . 66
2.73 MTLO instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.74 MTHI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.75 RFE intsruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.76 Exceptions and branch delay slots . . . . . . . . . . . . . . . . . 69
2.77 ADD and ADDI overflows . . . . . . . . . . . . . . . . . . . . . . 71
2.78 Store and load alignment exceptions . . . . . . . . . . . . . . . . 72
2.79 PC alignment exception . . . . . . . . . . . . . . . . . . . . . . . 73
2.80 RAM 16bit store . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.81 DMA registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.82 LHU instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.83 SLLV instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.84 LH instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.85 NOR instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.86 SRAV instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.87 SRLV instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3
2.88 MULTU instruction . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.89 GPU registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.89.1 GP0: Draw Mode Setting command . . . . . . . . . . . . 80
2.90 Interrupt Control 16bit access . . . . . . . . . . . . . . . . . . . . 81
2.91 Timer registers 32bit access . . . . . . . . . . . . . . . . . . . . . 81
2.92 GPUSTAT “DMA ready” field . . . . . . . . . . . . . . . . . . . 82
2.93 XOR instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.94 BREAK instructions . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.95 MULT instruction . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.96 SUB instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.97 XORI instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.98 Cop1, cop2 and cop3 opcodes . . . . . . . . . . . . . . . . . . . . 85
2.99 Non-aligned reads . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.99.1 LWL instruction . . . . . . . . . . . . . . . . . . . . . . . 87
2.99.2 LWR instruction . . . . . . . . . . . . . . . . . . . . . . . 88
2.100Non-aligned writes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.100.1 SWL instruction . . . . . . . . . . . . . . . . . . . . . . . 89
2.100.2 SWR instruction . . . . . . . . . . . . . . . . . . . . . . . 89
2.101Coprocessor loads and stores . . . . . . . . . . . . . . . . . . . . 90
2.101.1 LWCn instructions . . . . . . . . . . . . . . . . . . . . . . 90
2.101.2 SWCn instructions . . . . . . . . . . . . . . . . . . . . . . 91
2.102Illegal instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4
4.15 GP1 Display Range commands . . . . . . . . . . . . . . . . . . . 126
4.16 GP0 Monochrome Quadrilateral command . . . . . . . . . . . . . 126
4.17 Interleaved video deadlock workaround . . . . . . . . . . . . . . . 129
4.18 GP0 Clear Cache command . . . . . . . . . . . . . . . . . . . . . 129
4.19 GP0 Load Image command . . . . . . . . . . . . . . . . . . . . . 130
4.20 DMA image transfer . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.21 GP1 Display Enable command . . . . . . . . . . . . . . . . . . . 133
4.22 GP0 Image Store command . . . . . . . . . . . . . . . . . . . . . 133
4.23 GP0 Shaded Quadrilateral command . . . . . . . . . . . . . . . . 133
4.24 GP0 Shaded Triangle command . . . . . . . . . . . . . . . . . . . 134
4.25 GP0 Textured Quadrilateral With Color Blending command . . . 135
4.26 GP1 Acknowledge Interrupt command . . . . . . . . . . . . . . . 135
4.27 GP1 Reset Command Buffer command . . . . . . . . . . . . . . . 136
5 The GPU: Basic OpenGL renderer for the boot logo 136
5.1 Window and OpenGL context creation . . . . . . . . . . . . . . . 136
5.2 Drawing the primitives . . . . . . . . . . . . . . . . . . . . . . . . 139
5.3 The vertex shader . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4 The fragment shader . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.5 Compiling and linking the shaders . . . . . . . . . . . . . . . . . 144
5.6 Vertex array objects . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.7 OpenGL rendering and synchronization . . . . . . . . . . . . . . 148
5.8 OpenGL debugging . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.9 Drawing quadrilaterals . . . . . . . . . . . . . . . . . . . . . . . . 152
5.10 Draw Offset emulation . . . . . . . . . . . . . . . . . . . . . . . . 155
5.11 Handling SDL2 events and exiting cleanly . . . . . . . . . . . . . 157
5
1 Introduction
This is my attempt at documenting my implementation of a PlayStation emulator
from scratch. I’ll write the document as I go and I’ll try to explain as much as
possible along the way. You can find the complete source of the emulator itself
in my GitHub repository.
Since my favourite passtime is to reinvent the wheel and recode things that
already exist I decided that this time I might as well document it. This way
maybe this time something useful will come out of it and it’ll give me a motivation
to finish it.
I will be using the Rust programming language but this is not meant as a
Rust tutorial and knowledge of the language shouldn’t be necessary to follow
this guide, although it won’t hurt.
1.2 Feedback
If some part of this document is unclear, poorly written or incomplete please
submit an issue so that I can fix or complete it. Corrections for grammar, syntax
and typos are very welcome. Thank you!
Ready? Let’s begin!
6
it directly accesses the system bus. Basically we’re going to implement a Von
Neumann architecture.As we make progress we’ll have to revisit this design to
add the missing bits when they are needed.
The objective of this section is to implement all the instructions and try to
reach the part of the BIOS where it starts to draw on the screen. As we’ll see
there’s a bunch of boring initialization code to run before we get there.
There are 67 opcodes in the Playstation MIPS CPU. Some take one line to
implement, others will give us more trouble. In order to make the process more
interactive and less tedious we’ll implement them as they’re encountered while
we’re running the original BIOS code. This way we’ll immediately be able to see
our emulator in action.
But first things first, before we start implementing instructions we need to
explain how a CPU works.
2.2 Architecture
A simple Von Neumann architecture looks like this: the CPU only sees a flat
address space: an array of bytes. The PlayStation uses 32bit addresses so the
CPU sees 1 << 32 addresses. In other words it can address 4GB of memory.
That’s why the PlayStation is said to be a 32bit console (that and the fact that
it uses 32bit registers in the CPU as we’ll see in a minute).
This address space contains all the external ressources the CPU can access:
the RAM of course but also the various peripherals (GPU, controllers, CD drive,
BIOS...). That’s called memory mapped IO. Note that in this context ”memory”
doesn’t mean RAM. Rather it means that you access peripherals as if they were
memory (instead of using dedicated instructions for instance). From the point
of view of the CPU, everything is just a big array of bytes and it doesn’t really
know what’s out there.
Of course we’ll have to figure out how the devices and RAM are mapped in
this address space to make sure the transactions end up at the right location
when the CPU starts reading and writing to the bus. But first we need to
understand how the code is executed.
7
The Program Counter (henceforth refered to as PC) is one of the most
elementary registers, it exists in one form or an other on basically all computer
architectures (although it goes by various names, on x86 for instance it’s called
the Instruction Pointer, IP). Its job is simply to hold the address of the next
instruction to be run.
As we’ve seen, the PlayStation uses 32bit addresses, so the PC register is
32bit wide (as are all other CPU registers for that matter).
A typical CPU execution cycle goes roughly like this:
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
l e t pc = s e l f . pc ;
// Fetch i n s t r u c t i o n a t PC
l e t i n s t r u c t i o n = s e l f . l o a d 3 2 ( pc ) ;
// I n c r e m e n t PC t o p o i n t t o t h e n e x t i n s t r u c t i o n .
s e l f . pc = pc . wrapping add ( 4 ) ;
In Rust wrapping add means that we want the PC to wrap back to 0 in case
of an overflow (i.e. 0xfffffffc + 4 => 0x00000000). We’ll see that most CPU
operations wrap on overflow (although some instructions catch those overflows
and generate an exception, we’ll see that later).
If you’re coding in C you don’t need to worry about that if you use uint32 t
since the C standard mandates that unsigned overflow wraps around in this
fashion. Rust however says that overflows are undefined and will generate an
8
error in debug builds if an unchecked overflow is detected, that’s why I need to
write pc.wrapping add(4) instead of pc + 4.
We now finally have some code but it doesn’t build yet.
We’re still missing 3 pieces of the puzzle before we can run this piece of code:
// . . .
}
9
KUSEG KSEG0 KSEG1 Length Description
0x00000000 0x80000000 0xa0000000 2048K Main RAM
0x1f000000 0x9f000000 0xbf000000 8192K Expansion Region 1
0x1f800000 0x9f800000 0xbf800000 1K Scratchpad
0x1f801000 0x9f801000 0xbf801000 8K Hardware registers
0x1fc00000 0x9fc00000 0xbfc00000 512K BIOS ROM
Table 2 shows the last region: KSEG2. It’s a bit different from the others.
It doesn’t mirror the other regions, instead it gives access to a unique set of
registers. As far as I know the only important register there is the cache control
but there might be others I haven’t encountered yet.
10
play CDs. As a player that’s probably the only time you’d know there was a
BIOS running.
But that’s just the tip of the iceberg! The BIOS remains loaded at all time
and provides a Basic Input/Output System to the running game. That means
that the game can call into the BIOS to do things like allocating memory, reading
the memory card, common libc functions (qsort, memset...) and many other
things.
We won’t be implementing the BIOS ourselves. It’s possible (and it’s been
done) but that’s a lot of work and probably something you’d want to do once
you have a working emulator. It might also hurt compatibility since many games
are known to patch the BIOS at runtime. The Nocash specs have more info.
We could dump the BIOS of a console but that requires access to the actual
hardware and the know-how to access the BIOS memory. Fortunately some nice
people have done it for us and these days it’s easy to find BIOS files on the web.
There are many BIOS versions: they change depending on the region, the
hardware revision and patches. Any good dump should work (after all, they all
do more or less the same thing) but if you’re following this guide it’s probably
better that we use the same file.
Algorithm Hash
MD5 924e392ed05558ffdb115408c263dccf
SHA-1 10155d8d6e6e832d6ea66db9bc098321fb5e8ebf
I’ve decided to go for the version named SCPH1001.BIN. The file should be
exactly 512KB big. Check table 3 to make sure you got the right one.
impl B i o s {
// Load t h e BIOS
t r y ! ( f i l e . t a k e ( BIOS SIZE ) . r e a d t o e n d (&mut data ) ) ;
11
}
}
}
We also need to be able to read data from the BIOS. The CPU wants to
read 32bit of data to load the instructions so let’s start by implementing load32:
impl B i o s {
// . . .
// / Fetch t h e 32 b i t l i t t l e e n d i a n word a t ‘ o f f s e t ‘
pub f n l o a d 3 2 (& s e l f , o f f s e t : u32 ) −> u32 {
l e t o f f s e t = o f f s e t as u s i z e ;
A few things to note: offset, as its name implies, is not the absolute address
used by the CPU, it’s just the offset in the BIOS memory range. Remember
that the BIOS is mapped in multiple regions so we’ll handle that in the generic
interconnect code. Each peripheral will just handle offsets in its address range.
In the comment I mention that we read the word in little endian. That’s
important. If you’ve never had to worry about endianess issues before let me
give you the gist.
The basic unit of memory is a byte (8 bits in our case). You cannot address
anything smaller than that. However sometimes you need to store data over
multiple bytes. For instance we’ve seen that our instructions are 4byte long. We
have multiple way to store 4byte words in our ”array of bytes”.
Let’s take an example: you have the 32bit word 0x12345678. You have
multiple way to store that value in 4 consecutive bytes. We can store [0x12,
0x34, 0x56, 0x78] or [0x78, 0x56, 0x34, 0x12] for instance. The former is called
big-endian because we store the most significant byte first. The latter is little-
endian because we store the least significant byte first. There are other endian
types with weirder patterns but they’re not often used is modern computers.
Check wikipedia if you want more details.
The PlayStation is little-endian so we’re in the 2nd case: when reading or
writing multi-byte values the least significiant byte goes first. If we do it the
other way around we’ll end up with garbage.
Now we can implement our interconnect to let the CPU communicate with
the BIOS.
12
able to dispatch the CPU’s loads and stores to the correct peripheral depending
on the address range.
I’m not quite sure how this is handled on the actual hardware. For simple
buses it’s very possible that the CPU just ”broadcasts” the address to all the
peripherals and each of them just checks if it’s within their address range and
simply ignores the transaction if they see it’s not for them. It’s fast in hardware
because all peripherals work in parallel so there’s no delay induced: they can all
receive and decode the address at the same moment.
Unfortunately we can’t really do that in software: the closest equivalent
would be to spawn a thread for each peripheral. The problem is that memory
transactions are very common (several millions per second potentially) and having
to send data and resynchronize across threads would kill our performances.
Multihreading emulators in general is a very tough issue: for threading to be
really efficient you need to reduce data exchange and resynchronization as much
as possible to let each thread live its life. When we emulate however we want to
mimick the original hardware behaviour and speed as much as possible which
requires very frequent resynchronization and we have plenty of shared state.
The two endeavors are somewhat at odds. That’s not to say multithreading
is impossible in emulators, just that it’s hard. We can’t just spawn threads
willy-nilly.
Anyway, back to our interconnect: since threads are out it means we’ll have
to sequentially match the address against each mapping until we get a match.
Then we can let the selected peripheral handle the transaction.
Let’s do just that:
// / G l o b a l i n t e r c o n n e c t
pub s t r u c t I n t e r c o n n e c t {
// / B a s i c I n p u t / Output memory
b i o s : Bios ,
}
impl I n t e r c o n n e c t {
pub f n new ( b i o s : B i o s ) −> I n t e r c o n n e c t {
Interconnect {
bios : bios ,
}
}
}
I’ve decided to store the BIOS directly in the interconnect struct. We’ll
append the other peripherals there as we implement them. We are going to store
the interconnect inside the struct Cpu which will give us a device tree with the
CPU at the top. It makes the data paths pretty simple: everything goes from
the CPU to the peripherals. It’s easier to reason about than a full “everybody
sees everybody” architecture in my opinion but it might prove limiting as we
progress. We’ll see if we need to revise that later.
Now we can finally implement the load32 function that the CPU will be
using. I don’t like having hardcoded constants all over the place so I’m going to
tie the address ranges to nice symbolic names:
mod map {
s t r u c t Range ( u32 , u32 ) ;
impl Range {
// / Return ‘ Some ( o f f s e t ) ‘ i f addr i s c o n t a i n e d i n ‘ s e l f ‘
13
pub f n c o n t a i n s ( s e l f , addr : u32 ) −> Option<u32> {
l e t Range ( s t a r t , l e n g t h ) = s e l f ;
If you’re not familiar with rust what this does is create a new type Range
which is a tuple of two values: the start address and length of the mapping.
I also declare a contains methods which takes an address and returns
Some(offset) if the address is within the range, None otherwise. You can think
of it as a form of multiple return values with some nice type-safety on top.
Finally I declare our first range for the BIOS.
Now for the load32 function:
impl I n t e r c o n n e c t {
// . . .
p a n i c ! ( ” unhandled f e t c h 3 2 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
The if let syntax is an other rust nicety: if the contains function returns
Some(offset) we enter the body of the if with offset bound to a temporary
variable. If contains returns None on the other hand the if is refuted and we
don’t enter the body and go straight to the panic! command which will make
our emulator crash.
impl Cpu {
14
pub f n new ( i n t e r : I n t e r c o n n e c t ) −> Cpu {
Cpu {
// PC r e s e t v a l u e a t t h e b e g i n n i n g o f t h e BIOS
pc : 0 x b f c 0 0 0 0 0 ,
inter : inter ,
}
}
// . . .
}
We can also implement the load32 function for the CPU which will just call
the interconnect.
impl Cpu {
// . . .
// / Load 32 b i t v a l u e from t h e i n t e r c o n n e c t
f n l o a d 3 2 (& s e l f , addr : u32 ) −> u32 {
s e l f . i n t e r . l o a d 3 2 ( addr )
}
}
We’re still lacking the decode and execute function, let’s use a placeholder
function that just panics for now:
impl Cpu {
// . . .
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : u32 ) {
p a n i c ! ( ” Unhandled i n s t r u c t i o n { : 0 8 x} ” ,
instruction ) ;
}
}
l e t i n t e r = I n t e r c o n n e c t : : new ( b i o s ) ;
loop {
cpu . r u n n e x t i n s t r u c t i o n ( ) ;
}
}
I’ve hardcoded the BIOS path for now. It would be better to read it from
the command line, a config file or even some fancy dialog window but it’ll do
nicely for now.
We should now be able to build the code. When I run it, assuming that the
BIOS file was found at the correct location I get:
15
2.10 Instruction decoding
We’ve now fetched our first instruction from the BIOS: 0x3c080013. What do
we do with this?
In order to be able to run this instruction we need to decode it to figure out
what it means. Instruction encoding is of course CPU dependent so we need
to interpret this value in the context of the Playstation R3000 processor. Once
again the Nocash specs have our back and list the format of the instruction.
MIPS is a common architecture used outside of the playstation and you can find
plenty of resources online describing its instruction set.
Let’s decode this one by hand to see how this works. If we look at the
“Opcode/Parameter Encoding” table in Nocash’s docs we see that we need to
look at the bits [31:26] of the operation to see what type it is. In our case they
are 001111. That means the operation is a LUI or “Load Upper Immediate”.
Immediate means that the value loaded is directly in the instruction, not indirectly
somewhere else in memory. Upper means that it’s loading this immediate value
into the high 16 bits of the target register. The 16 low bits are cleared (set to 0).
But what are the register and the value used by the instruction? Well we
need to finish decoding it to figure it out: for a LUI bits [20:16] give us the
target register: in our case it’s 01000 which means it’s register 8. Finally bits
[15:0] contain the immediate value: 0000 0000 0001 0011 or 19 in decimal.
Bits [25:21] are not used and their value doesn’t matter.
In other words this instruction puts 0x13 in the 16 high bits of the register 8.
In MIPS assembly1 it would be equivalent to:
l u i $8 , 0 x13
Enough babbling, let’s implement decoding. First I’ll wrap the raw instruction
in a nice interface that will let us extract the fields without doing the bitshifts
and masking everywhere. If you look at the encoding for other MIPS instructions
you’ll see that it’s fairly regular, for instance immediate values are always stored
in the LSBs:
s t r u c t I n s t r u c t i o n ( u32 ) ;
impl I n s t r u c t i o n {
// / Return b i t s [ 3 1 : 2 6 ] o f t h e i n s t r u c t i o n
f n f u n c t i o n ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
op >> 26
}
// / Return r e g i s t e r i n d e x i n b i t s [ 2 0 : 1 6 ]
f n t ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
( op >> 1 6 ) & 0 x 1 f
}
// / Return immediate v a l u e i n b i t s [ 1 6 : 0 ]
f n imm( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
op & 0 x f f f f
1 I’m using the GNU assembler syntax in this guide unless otherwise noted.
16
}
}
The names for the accessor functions match those I’ve seen used in the various
references to name the various fields.
We can now leverage that fancy interface in decode and execute:
impl Cpu {
// . . .
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . f u n c t i o n ( ) {
0 b001111 => s e l f . o p l u i ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n { : x} ” ,
instruction .0) ,
}
}
p a n i c ! ( ” what now? ” ) ;
}
}
We’re very close to finally run our first instruction in full but we’re still
missing something: we see that the register field in this instruction is 5bits, that
means it can index 32 registers. But for now we only have one register in our
CPU: the PC. We need to introduce the rest of them.
Table 4 lists the registers in the Playstation MIPS R3000 CPU (ignoring the
coprocessors for now). They’re all 32bit wide.
You can see that we have 32 registers ($0 to $31) which are the general
purpose registers. They’re all given a mnemonic used when writing assembly.
17
For instance, by convention, $29 is the stack pointer($sp) and $30 holds the
frame pointer ($fp).
It’s important to understand that those are just a convention between de-
velopers, in the hardware there’s no difference between $29 and $30. The point
of those calling conventions is to make it possible to make code generated from
different compilers or written in assembly by different coders remain interopera-
ble. If you write MIPS assembly and want to call third party functions (like the
BIOS functions for instance) you’ll have to adhere to this convention.
Only two general purpose registers are given a special meaning by the
hardware itself: $zero and $ra.
However this “move” instruction is not actually part of the MIPS instruction
set, it’s just a convenient shorthand understood by the assembler which will
generate the equivalent instruction:
addu $a0 , $v0 , $ z e r o
We can see that it effectively does the same thing by setting $a0 to the result
of $v0 + 0 but we avoid having to implement a dedicated “move” instruction in
the CPU.
As we’ll soon see we don’t really have to bother with the various roles assigned
to those general purpose registers when writing our emulator (with the exception
of $zero and $ra) but it’s still useful to know the convention when trying to
understand what some emulated code is doing.
Name Description
PC Program counter
HI high 32bits of multiplication result; remainder of division
LO low 32bits of multiplication result; quotient of division
18
Table 5 lists the three special purpose CPU registers. We’re already familiar
with the PC used to keep track of the code execution. The two others are HI and
LO which contain the results of multiplication and division instructions. Those
cannot be used as general purpose registers, instead there are special instructions
used to manipulate them. We’ll discover them as we implement them.
The registers are not initialized on reset, so they contain garbage value when
we start up. For the sake of our emulator being deterministic I won’t actually
put random values in the registers however, instead I’m going to use an arbitrary
garbage value 0xdeadbeef. We could as well initialize them to 0 but I prefer
to use a more distinguishable value which can be helpful while debugging. We
must remember to put 0 in $zero however.
impl Cpu {
pub f n new ( i n t e r : I n t e r c o n n e c t ) −> Cpu {
// Not s u r e what t h e r e s e t v a l u e s a r e . . .
l e t mut r e g s = [ 0 x d e a d b e e f ; 3 2 ] ;
// . . . but R0 i s h a r d w i r e d t o 0
regs [ 0 ] = 0;
Cpu {
// PC r e s e t v a l u e a t t h e b e g i n n i n g o f t h e BIOS
pc : 0 x b f c 0 0 0 0 0 ,
regs : regs ,
inter : inter ,
}
}
// Make s u r e R0 i s a l w a y s 0
s e l f . regs [ 0 ] = 0;
}
// . . .
}
19
I’ve also added a getter and a setter. They’re very straightforward but I take
care to always write 0 in $zero in case it gets overwritten. I don’t ever bother
checking if the function wrote in this register or an other one, writing a 32bit
value is cheap and probably cheaper than adding an if. It’s also important to
note that the BIOS does try to write to $zero, it is believed that this is useful to
discard an I/O result without having to waste a register.
// . . .
// Low 16 b i t s a r e s e t t o 0
l e t v = i << 1 6 ;
s e l f . set reg (t , v) ;
}
}
Note that the low 16bits are set to 0. It’s important as we’ll see with the
next instruction.
The first instruction in the BIOS uses LUI to put 0x13 in the high 16bits of
$8.
In other words, it puts the result of the bitwise or of $8 and 0x243f back into
$8. The previous LUI initialized the high 16bits of $8 and set the rest to 0 so
this one will initialize the low 16bits.
That’s the simplest way to load a constant in a register with the MIPS
instruction set and that’s why it’s important for LUI to set the low 16bits to 0,
otherwise the ORI wouldn’t do the right thing.
The implementation is straightforward:
impl Cpu {
// . . .
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . f u n c t i o n ( ) {
0 b001111 => s e l f . o p l u i ( i n s t r u c t i o n ) ,
0 b001101 => s e l f . o p o r i ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n { : x} ” ,
instruction .0) ,
}
20
}
// / B i t w i s e Or Immediate
f n o p o r i (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t v = s e l f . reg ( s ) | i ;
s e l f . set reg (t , v) ;
}
}
After those two instructions the value of $8 should be 0x0013243f. The next
instruction as an other LUI which puts 0x1f800000 in $1.
If you’re not familiar with GNU assembly syntax the 0x1010($1) syntax
means “address in $1 plus offset 0x1010”. In this case the full instruction is “store
the 32bits in register $8 at the location $1 + 0x1010”. Given the current values
of the $1 and $8 registers it would store 0x0013243f at the address 0x1f801010.
We can implement the storing to memory by mirroring our load32 code:
impl Cpu {
// . . .
// / S t o r e 32 b i t v a l u e i n t o t h e memory
f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
s e l f . i n t e r . s t o r e 3 2 ( addr , v a l ) ;
}
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . f u n c t i o n ( ) {
0 b001111 => s e l f . o p l u i ( i n s t r u c t i o n ) ,
0 b001101 => s e l f . o p o r i ( i n s t r u c t i o n ) ,
0 b101011 => s e l f . op sw ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n { : x} ” ,
instruction .0) ,
}
}
// . . .
// / S t o r e Word
f n op sw(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
s e l f . s t o r e 3 2 ( addr , v ) ;
}
}
21
This code for op sw is actually subtly broken, I’ll explain why in a moment.
For these values of addr and i it’ll do the right thing though. You can see that
we call into the interconnect’s store32 method that we have yet to implement.
Since the only peripheral we support so far is the BIOS ROM and we can’t write
to it there’s not much we can do at that point, let’s just log the access and panic:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
p a n i c ! ( ” unhandled s t o r e 3 2 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
i f addr % 4 != 0 {
p a n i c ! ( ” U n a l i g n e d l o a d 3 2 a d d r e s s : { : 0 8 x} ” , addr ) ;
}
// . . .
}
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
i f addr % 4 != 0 {
p a n i c ! ( ” U n a l i g n e d s t o r e 3 2 a d d r e s s : { : 0 8 x} ” , addr ) ;
}
// . . .
}
}
22
Looking at the specs we see that registers in this range are for “memory
control”. They’re mainly used to set things like access latencies to the various
peripherals. We’re going to hope we don’t need to emulate those very low level
settings so we’ll ignore the writes to those registers for now.
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
p a n i c ! ( ” unhandled s t o r e 3 2 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
It’s a bit hackish but at least the store will now go through.
Before we move on to the next instruction we need to address the “subtle
brokenness” in our SW implementation I was talking about earlier.
23
In other words, if the immediate value of the SW was 0xffff it would give
an offset of -1, not +65535.
In order to support this we don’t need to add any branching, we just need to
sign extend the immediate value. It means that we increase the width of the
16bit value to 32bit but instead of padding with zeroes we pad with the original
MSB (which is sometimes called the sign bit). This way the signed value remains
the same. See table 6 for some examples.
You can see that for values where the sign bit is not set if we simply pad
the 16 high bits with 0s we get the same result in both signed and unsigned
extension. However for values with the MSB set to 1 we have a big difference.
So when we extend values it’s important to know if we’re dealing with signed
or unsigned quantities. We’ll have the same problem with rightwise bitshifts: if
we’re shifting signed quantities we have to pad with the sign bit.
It might sounds complicated but it’s very straightforward to implement with
most programming languages, for instance in C, C++ and Rust simply casting
from a 16bit signed integer to a 32bit integer makes the compiler sign-extend
the value. If it didn’t casting a 16bit variable containing -1 into a 32bit variable
would have the final value be 65535 which is obviously not what we want.
We can’t guess which instructions use signed or unsigned immediate values,
it’s described in the MIPS instruction set. For instance our ORI instruction
correctly uses an unsigned immediate value.
The nice thing with two’s complement representation is that while we need to
think about the signedness of the value when bitshifting and widening it doesn’t
matter for most arithmetic operations.
For instance the 16 bit addition 0x01ad + 0x84e0 gives the same result
whether the operands are signed or not: 0x01ad is 429, 0x84e0 is either 34016 if
it’s unsigned or -31520 if it’s a two’s complement signed value. 429 + 34016 is
34445 or 0x868d in hexadecimal. 429 - 31520 is -31091 or 0x868d in 16bit two’s
complement hexadecimal.
You can see that doing the calculation with signed or unsigned quantities
doesn’t matter: we end up with the same binary pattern.
Therefore we just need to care about the sign when widening the immediate
from 16 to 32 bits and then we can proceed with our usual ”unsigned” addition
and we’ll get the correct result whether the offset is negative or positive:
24
impl I n s t r u c t i o n {
// . . .
// / Return immediate v a l u e i n b i t s [ 1 6 : 0 ] a s a s i g n −e x t e n d e d 32
bit
// / v a l u e
f n imm se ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
l e t v = ( op & 0 x f f f f ) a s i 1 6 ;
v a s u32
}
}
Note the order of the casts from u32 to i16 back to u32. They might
look useless but that’s what’s forcing the compiler to generate instructions to
sign-extend v.
2.18 SW instruction
We can now use this function to fix op sw, we just have to replace instruction.imm()
with the new sign-extending instruction.imm se():
impl Cpu {
// . . .
// / S t o r e Word
f n op sw(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
s e l f . s t o r e 3 2 ( addr , v ) ;
}
}
Obviously this instruction does absolutely nothing since it stores the result
in $zero. This instruction is just the preferred way to encode a NOP2 . There are
many instruction in the MIPS architecture that behave like NOPs, for instance
using the opcodes we’ve already encountered we can craft several other NOPs:
2 MIPS assemblers actually feature a nop pseudo-instruction that generates this
25
l u i $zero , 0
o r i $zero , $zero , 0
o r i $ z e r o , $4 , 1234
And there are many others since almost anything targeting $zero is a NOP3 . I
think the SLL version is preferred simply because it has this noticeable encoding
of being all 0s.
In this case I can only assume that the NOP is used as a delay, probably
waiting for the previous SW instructions to take effect but I’m not entirely sure
why it’s needed.
In our emulator we won’t special-case this particular instruction, we can just
implement the generic SLL instruction in full. Since NOPs are pretty common
it might make some sense to special-case them but we’ll need to benchmark it
to make sure the cost of the test won’t be greater than computing a useless shift
and storing it in $zero.
Let’s start by implementing the accessors (the shift immediate is only 5bits
since it wouldn’t make sense to shift by more than 31 places and the rest of the
low bits is taken by the “subfunction” part of the instruction):
impl I n s t r u c t i o n {
// . . .
// / Return r e g i s t e r i n d e x i n b i t s [ 1 5 : 1 1 ]
f n d ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
( op >> 1 1 ) & 0 x 1 f
}
// / Return b i t s [ 5 : 0 ] o f t h e i n s t r u c t i o n
f n s u b f u n c t i o n ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
op & 0 x 3 f
}
// / S h i f t Immediate v a l u e s a r e s t o r e d i n b i t s [ 1 0 : 6 ]
f n s h i f t ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
( op >> 6 ) & 0 x 1 f
}
}
Now that we have our fancy getters ready to parse the instruction we can
implement the opcode itself:
impl Cpu {
// . . .
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . f u n c t i o n ( ) {
0 b000000 => match i n s t r u c t i o n . s u b f u n c t i o n ( ) {
0 b000000 => s e l f . o p s l l ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n { : 0 8 x} ” ,
instruction .0) ,
},
3 One exception would be memory loads which can have side effects even if the value is
discarded in $zero.
26
0 b001111 => s e l f . o p l u i ( i n s t r u c t i o n ) ,
0 b001101 => s e l f . o p o r i ( i n s t r u c t i o n ) ,
0 b101011 => s e l f . op sw ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n { : 0 8 x} ” ,
instruction .0) ,
}
}
// / S h i f t L e f t L o g i c a l
f n o p s l l (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let i = instruction . shift () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t v = s e l f . r e g ( t ) << i ;
s e l f . s e t r e g (d , v) ;
}
}
Obviously in this case it won’t do anything since it’s a NOP but it should
work correctly when we encounter a “real” SLL instruction.
l e t v = s e l f . r e g ( s ) . wrapping add ( i ) ;
s e l f . set reg (t , v) ;
}
}
You can see an other use of the $zero register: this time with the ADDIU
opcode it sets $8 to the immediate value 0xb88. It saves having a dedicated
“Load immediate” opcode.
4 I’ll skip the code in decode and execute from now on, I’m sure you can figure it out by
yourself. . .
27
2.21 RAM configuration register
This value of 0x00000b88 is then stored at address 0x1f801060.
This register is called RAM SIZE in the NoCash specs. The exact purpose
of this register remains partially unknown but it seems to be configuring the
memory controller. I assume that this controller is capable of handling various
amounts of RAM for instance and this register lets the BIOS load the particular
configuration needed by the Playstation hardware.
At any rate, since we’re trying to emulate the Playstation and not some
generic MIPS computer we probably don’t have to handle this register in any
specific way so it’s hopefully safe to ignore it. I just add a new mapping entry,
ignore the store at this address and move along:
// / R e g i s t e r t h a t has s o m e t h i n g t o do with RAM c o n f i g u r a t i o n ,
// / c o n f i g u r e d by t h e BIOS
pub c o n s t RAM SIZE : Range = Range ( 0 x 1 f 8 0 1 0 6 0 , 4 ) ;
After this instruction we get a few NOPs. I suppose that the ram size
configuration takes a few cycle to take effect and the BIOS delays a bit before
continuing.
2.22 J instruction
The next instruction is 0x0bf00054 which is a jump instruction (J). This function
is used to change the value of the PC and have the CPU execution pipeline jump
to some other location in memory.
Jump behaves like a goto: it sets the PC to the immediate value contained
in the instruction. Since the instruction is 32bit wide and the instruction set
uses 6bits to encode the opcode it can only specify 26bits of the ‘PC‘ at once.
To make the most of those 26bits the target address is shifted two places to
the right. It’s not a problem because instructions must be aligned to a 32bit
boundary so the two LSBs of the PC always have to be zero. It means that
the instruction really encodes 28bits of the target address. The remaining 4
high bits are the PC’s MSB and remain untouched. In the case of our current
instruction this makes the target address 0xbfc00150.
You can see that this instruction cannot jump anywhere in RAM, only to an
address within the current 256MB of addressable memory. If the CPU needs to
jump further away5 it’ll have to use an other instruction like JR which takes a
full 32bit register containing the destination address. But we’ll see that soon
enough.
First we need to add an accessor for the 26bit immediate field:
impl I n s t r u c t i o n {
// . . .
// / Jump t a r g e t s t o r e d i n b i t s [ 2 5 : 0 ]
f n imm jump ( s e l f ) −> u32 {
l e t I n s t r u c t i o n ( op ) = s e l f ;
op & 0 x 3 f f f f f f
}
}
28
impl Cpu {
// . . .
// / Jump
f n o p j (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm jump ( ) ;
s e l f . pc = ( s e l f . pc & 0 x f 0 0 0 0 0 0 0 ) | ( i << 2 ) ;
}
}
Looks simple enough but unfortunately it’s broken. Why you ask?
behaviour you have to use “.set noreorder” with the GNU assembler.
29
j foo
l u i $a0 , 0 x f 0 0
The LUI instruction gets executed before the code jumps to foo. When the
function is entered $a0 will be equal to 0x0f000000.
Fortunately it’s pretty easy to emulate this behaviour: we just have to do the
same thing the processor does and load the next instruction before we execute
the current one:
// / CPU s t a t e
pub s t r u c t Cpu {
// / The program c o u n t e r r e g i s t e r
pc : u32 ,
// / Next i n s t r u c t i o n t o be e x e c u t e d , used t o s i m u l a t e t h e
branch
// / d e l a y s l o t
next instruction : Instruction ,
// . . .
}
impl Cpu {
Cpu {
// PC r e s e t v a l u e a t t h e b e g i n n i n g o f t h e BIOS
pc : 0 x b f c 0 0 0 0 0 ,
// . . .
n e x t i n s t r u c t i o n : I n s t r u c t i o n ( 0 x0 ) , // NOP
}
}
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
l e t pc = s e l f . pc ;
// Use p r e v i o u s l y l o a d e d i n s t r u c t i o n
let instruction = self . next instruction ;
// Fetch i n s t r u c t i o n a t PC
s e l f . n e x t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . l o a d 3 2 ( pc ) ) ;
// I n c r e m e n t PC t o p o i n t t o t h e n e x t i n s t r u c t i o n . A l l
// i n s t r u c t i o n s a r e 32 b i t l o n g .
s e l f . pc = pc + 4 ;
// . . .
}
2.24 OR instruction
After the jump there’s a sequence of LUI/ORI/SW used to store a bunch of
values in the SYS CONTROL registers that we chose to ignore. We then stumbble
upon a new instruction: 0x00000825 which encodes a bitwise or operation:
30
o r $1 , $ z e r o , $ z e r o
Unlike ORI which used an immediate value as a 2nd operand this one takes
two register and stores the result in a third one. We can see that in this case
the two source registers are $zero so it just clears $1. The implementation is
fairly straightforward:
impl Cpu {
// . . .
// / B i t w i s e Or
f n o p o r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
l e t v = s e l f . reg ( s ) | s e l f . reg ( t ) ;
s e l f . s e t r e g (d , v) ;
}
}
The next few instructions use OR to set all the general purpose registers to
0.
// / B i t w i s e Or
f n o p o r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
let v = s | s e l f . r e g ( t ) ; // Oops . . .
s e l f . s e t r e g (d , v) ;
}
}
This code is broken: instead of OR-ing the value of the register number s it
ORs the index s itself. It’s meaningless and obviously wrong and yet it builds
without any error.
Fortunately with a small modification in our code we can have the compiler
reject such code by wrapping register indexes in a new type incompatible with
u32:
s t r u c t R e g i s t e r I n d e x ( u32 ) ;
Note that this is not like a typedef in C or C++: typedef just creates an
alias which remains compatible (i.e. interchangeable) with the original type.
The equivalent in C would be to wrap the u32 in a struct or something like
that.
31
Then we just have to update our helpers as well as the Cpu::reg and
Cpu::set reg methods to use a RegisterIndex instead of a plain u32.
With this modification the compiler will reject the broken op or implemen-
tation above:
Hopefully we shouldn’t have to mess with the GTE until we start encountering
3D code.
32
2.28 MTC0 instruction
Back to the 0x408c6000 instruction: the opcode (bits [31:26]) is equal to
0b010000 which means that it’s an instruction for the coprocessor 0. The
generic format is 0b0100nn where nn is the coprocessor number.
impl Cpu {
// . . .
f n d e c o d e a n d e x e c u t e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . f u n c t i o n ( ) {
// . . .
0 b010000 => s e l f . o p c o p 0 ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” Unhandled i n s t r u c t i o n {} ” ,
instruction ) ,
}
}
// / C o p r o c e s s o r 0 opcode
f n o p c o p 0 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . c o p o p c o d e ( ) {
0 b00100 => s e l f . op mtc0 ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” unhandled cop0 i n s t r u c t i o n {} ” ,
instruction )
}
}
}
The coprocessor register $cop0 12 is very useful: it’s called the “status
register” or SR for short. Among other things it’s used to query and mask the
exceptions and controlling the cache behaviour.
At this point the $12 register contains 0x00010000 so this MTC0 instruction
sets bit 16 of SR which is the “isolate cache” bit. It makes all the following read
and write target directly the cache instead of going through it towards the main
memory. We’re probably in the middle of the cache initialization sequence.
At any rate since we still haven’t implemented anything cache-related we’ll
just store the value of the SR in our Cpu struct and move along:
// / CPU s t a t e
pub s t r u c t Cpu {
// . . .
// / Cop0 r e g i s t e r 1 2 : S t a t u s R e g i s t e r
s r : u32 ,
7 This
is actually pseudo-assembly for the sake of clarity. The correct GNU assembler syntax
would be mtc0 $12, $12 but it’s a bit too ambiguous for my taste.
33
}
impl Cpu {
// . . .
Cpu {
// . . .
sr : 0 ,
}
}
f n op mtc0(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t cpu r = i n s t r u c t i o n . t () ;
l e t cop r = instruction . d () . 0 ;
l e t v = s e l f . reg ( cpu r ) ;
match c o p r {
12 => s e l f . s r = v ,
n => p a n i c ! ( ” Unhandled cop0 r e g i s t e r : { : 0 8 x} ” , n ) ,
}
}
}
// / S t o r e Word
f n op sw(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
if s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” i g n o r i n g s t o r e while cache i s i s o l a t e d ” ) ;
return ;
}
// . . .
}
}
34
bne $10 , $11 , −36
In other words the instruction will compare the values in $10 and $11 and
if they’re unequal it’ll subtract 36 from the PC. If the values are equal it’ll do
absolutely nothing.
Like jumps, branches have a delay slot8 . Fortunately our implementation in
section 2.23 already takes care of that without any more work.
I’ve decided to factor the “branching” code itself in a separate function
because we’ll have to use the same logic in the other branch instructions:
impl Cpu {
// . . .
// / Branch t o immediate v a l u e ‘ o f f s e t ‘ .
f n branch (&mut s e l f , o f f s e t : u32 ) {
// O f f s e t i m m e d i a t e s a r e a l w a y s s h i f t e d two p l a c e s t o t h e
// r i g h t s i n c e ‘PC‘ a d d r e s s e s have t o be a l i g n e d on 32 b i t s
at
// a l l t i m e s .
l e t o f f s e t = o f f s e t << 2 ;
l e t mut pc = s e l f . pc ;
pc = pc . wrapping add ( o f f s e t ) ;
s e l f . pc = pc ;
}
if s e l f . r e g ( s ) != s e l f . r e g ( t ) {
s e l f . branch ( i ) ;
}
}
}
35
Since this operation checks for signed overflow I’ll cast the operands to i32
before using the checked add provided by rust’s standard library9 . For now
I just panic if we encounter an overflow, we’ll change that when we actually
implement exceptions:
impl Cpu {
// . . .
l e t s = s e l f . reg ( s ) as i32 ;
l e t v = match s . c h e c k e d a d d ( i ) {
Some ( v ) => v a s u32 ,
None => p a n i c ! ( ”ADDI o v e r f l o w ” ) ,
};
s e l f . set reg (t , v) ;
}
}
We can reuse the load32 method to fetch the data from memory:
impl Cpu {
// . . .
// / Load Word
f n o p l w (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
if s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” Ignoring load while cache i s i s o l a t e d ” ) ;
return ;
}
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
find plenty of examples online. Welcome to the 1970s. Be careful with your implementation
though because signed integer overflow is undefined behaviour in C.
36
l e t v = s e l f . l o a d 3 2 ( addr ) ;
s e l f . set reg (t , v) ;
}
}
The first MOVE instruction11 is in the load delay slot of the previous LW.
That means that at that point the register $1 does not yet contain the value
loaded into it. So after these two instructions $2 contains the value of $1 before
the load. The 2nd MOVE however takes place after the load delay slot so $3
will contain the final, post-load value of $1.
But it gets worse. Consider the value of $1 after these two instructions:
lw $1 , 0 ( $ z e r o ) /∗ Load $1 with t h e v a l u e a t a d d r e s s 0 ∗/
addiu $1 , $ z e r o , 42 /∗ Put 42 i n $1 ∗/
We first use LW to load something in $1 and then, while the load takes place,
we change the value of $1 with an ADDIU instruction. Who wins?
You might think that since the LW finishes after the load delay slot its fetched
value will override the one set by the ADDIU. It turns out that it’s not the case
however: after those two instructions $1 will contain 42, no matter what the LW
fetched.
It’s a bit of a bad news for us emulator writers. It means we can’t execute
the load before the delay slot because the instruction must see the previous
value of the loaded register (otherwise the first example code above won’t work)
and we can’t just execute it afterwards because it would make the load take the
priority over the delay slot (thus breaking our 2nd example).
One way to see it is that the loaded value ends up in the target register after
the next instruction has fetched the input register values but before the next
instruction updates the target register values. In our first example $1 is an input
register to both MOVEs while in the 2nd it’s the output (destination) register
of the ADDIU.
We could implement it exactly that way by splitting each instruction in two:
10 This
behaviour is part of the MIPS I architecture. Later revisions (starting with MIPS II)
don’t have load delay slots, only branch delay slots.
11 As I mentioned earlier MOVE is actually a pseudo-instruction that the assembler will
37
• The first part would take the pre-load register values, compute the result
(adding $zero and 10 in the 2nd example example above),
• Then it would execute any pending load,
• Finally it would store the result of the computation in the target register
($1 in the ADDIU). That way the ADDIU will write last.
I don’t really like this solution however because we’ll have to handle load
delays explicitly in all instructions which seems inelegant and error-prone.
Instead I’m going to use two sets of general purpose registers: one will be
the input set and the other the output set. Each instruction will read its input
values from the former set and will write to the latter. Once the instruction is
finished we copy the output set into the input set for the next instruction.
This way we can update the output register set with the load value before
we execute the instruction and it will still see the old value from the input set.
And if the instruction writes to the same register it will overwrite the value in
the output set.
Hopefully it should be clearer in code. First let’s add a 2nd set of registers
and a (register, value) tuple containing the pending load:
// / CPU s t a t e
pub s t r u c t Cpu {
// . . .
// / 2nd s e t o f r e g i s t e r s used t o e m u l a t e t h e l o a d d e l a y s l o t
// / a c c u r a t e l y . They c o n t a i n t h e ou tp ut o f t h e c u r r e n t
// / i n s t r u c t i o n .
o u t r e g s : [ u32 ; 3 2 ] ,
// / Load i n i t i a t e d by t h e c u r r e n t i n s t r u c t i o n
l o a d : ( R e g i s t e r I n d e x , u32 ) ,
}
impl Cpu {
Cpu {
// . . .
out regs : regs ,
load : ( RegisterIndex (0) , 0) ,
}
}
// . . .
}
f n s e t r e g (&mut s e l f , i n d e x : R e g i s t e r I n d e x , v a l : u32 ) {
s e l f . o u t r e g s [ index . 0 as u s i z e ] = val ;
// Make s u r e R0 i s a l w a y s 0
s e l f . out regs [ 0 ] = 0;
}
38
}
Since all our instructions so far use this helper method to update the register
values we won’t have to modify their code at all.
The next step is to update run next instruction to handle pending loads
and copying the output registers between every instructions:
impl Cpu {
// . . .
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
// . . .
// Execute t h e p e n d i n g l o a d ( i f any , o t h e r w i s e i t w i l l l o a d
// $ z e r o which i s a NOP) . ‘ s e t r e g ‘ works o n l y on
// ‘ o u t r e g s ‘ s o t h i s o p e r a t i o n won ’ t be v i s i b l e by
// t h e n e x t i n s t r u c t i o n .
l e t ( reg , v a l ) = s e l f . l o a d ;
s e l f . s e t r e g ( reg , v a l ) ;
// We r e s e t t h e l o a d t o t a r g e t r e g i s t e r 0 f o r t h e n e x t
// i n s t r u c t i o n
s e l f . load = ( RegisterIndex (0) , 0) ;
// Copy t h e ou tp ut r e g i s t e r s a s i n p u t f o r t h e
// n e x t i n s t r u c t i o n
s e l f . regs = s e l f . out regs ;
}
}
You can see that we’re copying 128 bytes worth of registers for each instruction
which might not be great performance-wise but at this point I don’t really care
about that.
2.33 LW instruction
We can now write the correct, load-delay friendly implementation of SW:
impl Cpu {
// . . .
// / Load Word
f n o p l w (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
if s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” Ignoring load while cache i s i s o l a t e d ” ) ;
return ;
}
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t v = s e l f . l o a d 3 2 ( addr ) ;
// Put t h e l o a d i n t h e d e l a y s l o t
39
s e l f . load = ( t , v) ;
}
}
impl Ram {
// D e f a u l t RAM c o n t e n t s a r e g a r b a g e
l e t data = v e c ! [ 0 xca , 2 ∗ 1024 ∗ 1 0 2 4 ] ;
// / Fetch t h e 32 b i t l i t t l e e n d i a n word a t ‘ o f f s e t ‘
pub f n l o a d 3 2 (& s e l f , o f f s e t : u32 ) −> u32 {
l e t o f f s e t = o f f s e t as u s i z e ;
// / S t o r e t h e 32 b i t l i t t l e e n d i a n word ‘ v a l ‘ i n t o ‘ o f f s e t ‘
pub f n s t o r e 3 2 (&mut s e l f , o f f s e t : u32 , v a l : u32 ) {
l e t o f f s e t = o f f s e t as u s i z e ;
let b0 = v a l a s u8 ;
let b1 = ( v a l >> 8 ) a s u8 ;
let b2 = ( v a l >> 1 6 ) a s u8 ;
let b3 = ( v a l >> 2 4 ) a s u8 ;
I arbitrarily chose 0xca as the poison value on startup. It’s pretty strange
that the BIOS attempts to fetch data from the RAM before writing anything to
it (and effectively reading garbage) but if you look at the following instructions
40
it repeatedly reads the same address (the first word in RAM) and does nothing
with it. I’m not sure what this code does but it probably initializes something.
Let’s hope it’s not too important. . .
We can then plug our brand new RAM in the interconnect as usual:
pub c o n s t RAM: Range = Range ( 0 xa0000000 , 2 ∗ 1024 ∗ 1 0 2 4 ) ;
• $cop0 5 is BDA, the data breakpoint. It’s like BPC except it breaks when
a certain address is accessed on a data load/store instead of a PC value.
• $cop0 6: I couldn’t find a lot of informations on this register or what it
does, the consensus seems to be that it’s basically useless.
• $cop0 7 is DCIC, used to enable and disable the various hardware break-
points.
• $cop0 9 is BDAM, it’s a bitmask applied when testing for BDA above.
That way we could trigger on a range of address instead of a single one.
• $cop0 11 is BPCM, like BDAM but for masking the BPC breakpoint.
You can see that most of those registers (except SR and CAUSE) deal with
hardware breakpoints. That’s generally used for debugging so we shouldn’t need
to emulate those for most games. It’s probably safe to ignore for now. You can
see that the BIOS loads $zero into all of them which disables them.
For now we’re just going to ignore write to these registers when the value is
0. If at some point some game writes something else we’ll catch it and see what
we need to implement:
impl Cpu {
// . . .
// / Move To C o p r o c e s s o r 0
f n op mtc0(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t cpu r = i n s t r u c t i o n . t () ;
l e t cop r = instruction . d () . 0 ;
l e t v = s e l f . reg ( cpu r ) ;
match c o p r {
3 | 5 | 6 | 7 | 9 | 11 => // B r e a k p o i n t s r e g i s t e r s
41
i f v != 0 {
p a n i c ! ( ” Unhandled w r i t e t o c o p 0 r {} ” , c o p r )
},
12 => s e l f . s r = v ,
13 => // Cause r e g i s t e r
i f v != 0 {
p a n i c ! ( ” Unhandled w r i t e t o CAUSE r e g i s t e r . ” )
},
=> p a n i c ! ( ” Unhandled cop0 r e g i s t e r {} ” , c o p r ) ,
}
}
}
This instruction compares the value of two registers ($2 and $3 in this case)
and sets the value of a third one ($1) to either 0 or 1 depending on the result of
the “less than” comparison:
impl Cpu {
// . . .
// / S e t on L e s s Than Unsigned
f n o p s l t u (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
s e l f . s e t r e g ( d , v a s u32 ) ;
}
}
You can see that with $zero as the third operand it simply moves $29 in $30,
so in this case it’s really a MOVE instruction.
The instruction is implemented like ADDIU except that we add two registers
instead of a register and an immediate value:
impl Cpu {
// . . .
// / Add Unsigned
f n op addu(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
let d = instruction . d() ;
42
l e t v = s e l f . r e g ( s ) . wrapping add ( s e l f . r e g ( t ) ) ;
s e l f . s e t r e g (d , v) ;
}
}
2.38 Regions
Our next problem is an unhandled access at address 0x00000060. If we look at
the memory map1 we see that it’s the RAM. But we’ve already added the RAM
in our interconnect in section 2.34!
The problem is that currently we mapped the RAM at 0xa0000000, in the
KSEG1 region. But this time the BIOS attempts to access it through an other
region: KUSEG. We could add multiple mappings for each peripheral in each
region but that would be a waste of code and performance.
Let’s a closer look at how those regions are specified by the MIPS architecture:
All that sounds rather complicated. Fortunately for us since we’re targeting
the Playstation and not some generic MIPS architecture we’ll be able to make
some simplifications:
• The Playstation hardware does not have a MMU and therefore no virtual
memory. We won’t have to deal with memory translation.
• The Playstation CPU has 1KB of data cache and an other kilobyte of
instruction cache. However the data cache is not used, instead its memory
is mapped as the ”scratpad” at a fixed location. In other word we don’t
need to implement the data cache.
• As far as I can tell the Playstation software doesn’t seem to use the
kernel/user privilege separation and runs everything in kernel mode.
In other words the only time we’ll need to worry about which region is in use
is when we’ll implement the cache instruction and only for KSEG1 since that’s
the only non-cached region.. For everything else it doesn’t matter through which
region the peripherals are accessed.
In order to solve our issue of having multiple mappings at different addresses
for the same peripherals in different regions we want to compute the unique
43
physical address corresponding to a memory access and map that through our
interconnect code.
By the descriptions above you see that we should mask a different number
of bits depending on the region. Since KSEG2 doesn’t share anything with the
other regions we won’t touch the address here (otherwise we would allow access
to the RAM through KSEG2 for instance and that wouldn’t be accurate). In
order to avoid branches we can use a nice mask lookup table:
// / Mask a r r a y used t o s t r i p t h e r e g i o n b i t s o f t h e a d d r e s s . The
// / mask i s s e l e c t e d u s i n g t h e 3 MSBs o f t h e a d d r e s s s o each e n t r y
// / e f f e c t i v e l y matches 512kB o f t h e a d d r e s s s p a c e . KSEG2 i s not
// / t o u c h e d s i n c e i t doesn ’ t s h a r e a n y t h i n g with t h e o t h e r
// / r e g i o n s .
c o n s t REGION MASK: [ u32 ; 8 ] = [
// KUSEG: 2048MB
0xffffffff , 0xffffffff , 0xffffffff , 0xffffffff ,
// KSEG0 : 512MB
0x7fffffff ,
// KSEG1 : 512MB
0x1fffffff ,
// KSEG2 : 1024MB
0xffffffff , 0xffffffff ,
];
We can now use this mask region function in our interconnect’s load and
store functions to convert any address coming from the CPU into a unique
physical address used to identify the target peripheral.
We also have to change all our current address map declarations to use
physical addresses:
pub c o n s t RAM: Range = Range ( 0 x00000000 , 2 ∗ 1024 ∗ 1 0 2 4 ) ;
// / Cache c o n t r o l r e g i s t e r . F u l l a d d r e s s s i n c e i t ’ s i n KSEG2
pub c o n s t CACHE CONTROL: Range = Range ( 0 x f f f e 0 1 3 0 , 4 ) ;
2.39 SH instruction
The next unhandled instruction is 0xa5200180 which encodes “store halfword”
(SH). It’s used to write 16bits (a halfword) to the memory:
sh $ z e r o , 0 x180 ( $9 )
44
The implementation is very similar to the “store word” instruction except
we truncate the register to 16bits and we’ll have to implement a new store16
method on our interconnect12 :
impl Cpu {
// . . .
// / S t o r e 16 b i t v a l u e i n t o t h e memory
f n s t o r e 1 6 (&mut s e l f , addr : u32 , v a l : u16 ) {
s e l f . i n t e r . s t o r e 1 6 ( addr , v a l ) ;
}
// / S t o r e Halfword
f n o p s h (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
if s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” Ignoring s t o r e while cache i s i s o l a t e d ” ) ;
return ;
}
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
s e l f . s t o r e 1 6 ( addr , v a s u16 ) ;
}
}
// / S t o r e 16 b i t h a l f w o r d ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 1 6 (&mut s e l f , addr : u32 , v a l : u16 ) {
i f addr % 2 != 0 {
p a n i c ! ( ” U n a l i g n e d s t o r e 1 6 a d d r e s s : { : 0 8 x} ” , addr ) ;
}
p a n i c ! ( ” unhandled s t o r e 1 6 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
I start with an empty function instead of copying the store32 code because
different devices react differently when we change the transaction width. Some
will pad the value to 32bits with zeroes, others may just set 16bits in the register
and leave the others untouched. For this reason I’ll be conservative and add
them only when needed.
45
unit (SPU) hardware register. At that point we don’t really care for sound so
we’re going to ignore writes to these addresses for the time being:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e 16 b i t h a l f w o r d ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 1 6 (&mut s e l f , addr : u32 , : u16 ) {
i f addr % 2 != 0 {
p a n i c ! ( ” U n a l i g n e d s t o r e 1 6 a d d r e s s : { : 0 8 x} ” , addr ) ;
}
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
p a n i c ! ( ” unhandled s t o r e 1 6 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / SPU r e g i s t e r s
pub c o n s t SPU : Range = Range ( 0 x 1 f 8 0 1 c 0 0 , 6 4 0 ) ;
Using this instruction it’s easy to implement function calls: the instruction
is called with JAL and can return to the caller by jumping to the value in $ra.
Then the control returns to the calling function. The $ra register is the link
between the caller and the callee.
We can reuse the regular J opcode implementation and simply add the code
to store the return value in $31:
impl Cpu {
// . . .
// S t o r e r e t u r n a d d r e s s i n $31 ( $ r a )
s e l f . s e t r e g ( RegisterIndex (31) , ra ) ;
s e l f . op j ( instruction ) ;
}
}
46
2.42 ANDI instruction
We continue with instruction 0x308400ff which is a “bitwise and immediate”
(ANDI):
a n d i $4 , $4 , 0 x f f
We can simply copy the implementation of ORI and replace the | with an &:
impl Cpu {
// . . .
// / B i t w i s e And Immediate
f n o p a n d i (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t v = s e l f . reg ( s ) & i ;
s e l f . set reg (t , v) ;
}
}
2.43 SB instruction
After the word and halfword store instructions we now meet 0xa1c42041 which
is a “store byte” (SB) instruction. We have to implement a third path for
accessing the memory like we did for store32 and store32:
impl Cpu {
// . . .
// / S t o r e 16 b i t v a l u e i n t o t h e memory
f n s t o r e 8 (&mut s e l f , addr : u32 , v a l : u8 ) {
s e l f . i n t e r . s t o r e 8 ( addr , v a l ) ;
}
// / S t o r e Byte
f n o p s b (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
if s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” Ignoring s t o r e while cache i s i s o l a t e d ” ) ;
return ;
}
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
s e l f . s t o r e 8 ( addr , v a s u8 ) ;
}
}
47
2.44 Expansion 2
The address being written to is 0x1f802041 which falls in the expansion 2
memory map. As far as I can tell this expansion is only used for debugging on
development boards and doesn’t do anything useful on real hardware. Therefore
we’ll just ignore writes to this expansion:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e b y t e ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 8 (&mut s e l f , addr : u32 , : u8 ) {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
p a n i c ! ( ” unhandled s t o r e 8 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / Expansion r e g i o n 2
pub c o n s t EXPANSION 2 : Range = Range ( 0 x 1 f 8 0 2 0 0 0 , 6 6 ) ;
2.45 JR instruction
A few steps later we encounter 0x03e00008 which is the “jump register” (JR)
instruction. It simply sets the PC to the value stored in one of the general
purpose registers:
j r $31
Since JAL stores the return address in $31 we can return from a subroutine
by calling jr $ra which is exactly what the BIOS is doing here.
impl Cpu {
// . . .
// / Jump R e g i s t e r
f n o p j r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
s e l f . pc = s e l f . r e g ( s ) ;
}
}
2.46 LB instruction
The next unhandled instruction is 0x81efe288 which encodes “load byte” (LB).
As you can guess it’s like LW except that it only loads 8bits from the memory13 :
l b $15 , −7544( $15 )
13 Note the use of a negative offset, if we hadn’t implemented proper sign extension earlier
48
Since the general purpose registers are always 32bit LB only loads the low
8bits of the register. The byte is treated like a signed value so it’s sign extended
to the full 32bits. Of course like LW there’s a load delay of one instruction. We
can implement it like this14 :
impl Cpu {
// . . .
// / Load Byte ( s i g n e d )
f n o p l b (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// Cast a s i 8 t o f o r c e s i g n e x t e n s i o n
l e t v = s e l f . l o a d 8 ( addr ) a s i 8 ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . l o a d = ( t , v a s u32 ) ;
}
}
// / Load b y t e a t ‘ addr ‘
pub f n l o a d 8 (& s e l f , addr : u32 ) −> u8 {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
p a n i c ! ( ” unhandled l o a d 8 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / Fetch b y t e a t ‘ o f f s e t ‘
pub f n l o a d 8 (& s e l f , o f f s e t : u32 ) −> u8 {
s e l f . data [ o f f s e t a s u s i z e ]
}
}
14 Note the cast from u8 to i8 and finally u32 to force the sign extension.
49
2.47 BEQ instruction
We then get a new branch instruction: 0x11e0000c is “branch if equal” (BEQ):
beq $15 , $ z e r o , +48
// / Branch i f Equal
f n o p b eq (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let s = instruction . s () ;
let t = instruction . t () ;
if s e l f . r e g ( s ) == s e l f . r e g ( t ) {
s e l f . branch ( i ) ;
}
}
}
2.48 Expansion 1
After that the BIOS attemps to read a byte at 0x1f000084. This is where the
first expansion port is mapped. This expansion goes to the parallel port on the
back of the early Playstation models.
If you look at the byte read by the first LB instruction above you’ll see it’s
the first byte in a C-string: “Licensed by Sony Computer Entertainment Inc”.
Apparently in order to detect and validate the expansion the BIOS compares this
hardcoded string with the values stored starting at offset 0x84 in the expansion.
We don’t really have any reason to implement an expansion at that point
so we’ll return the default value when no expansion is present. Looking at
mednafen’s source code it seems to be full-ones15 :
impl I n t e r c o n n e c t {
// . . .
// / Load b y t e a t ‘ addr ‘
pub f n l o a d 8 (& s e l f , addr : u32 ) −> u8 {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
p a n i c ! ( ” unhandled l o a d 8 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
15 I’m actually not sure how to test that easily since I need to have an expansion plugged in
the parallel connector to be able to run code on my console. Maybe I could start the code and
unplug it but that doesn’t sound too great. . . A better way would be to burn the test code on
a CD and run it on a modchipped console.
50
2.49 RAM byte access
Now the BIOS wants to store a byte to the RAM but we haven’t implemented
that yet, let’s fix that by implementing store8 and let’s add load8 while we’re
at it:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e b y t e ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 8 (&mut s e l f , addr : u32 , v a l : u8 ) {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
// . . .
}
// / Load b y t e a t ‘ addr ‘
pub f n l o a d 8 (& s e l f , addr : u32 ) −> u8 {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
// . . . .
}
}
// / S t o r e t h e b y t e ‘ v a l ‘ i n t o ‘ o f f s e t ‘
pub f n s t o r e 8 (&mut s e l f , o f f s e t : u32 , v a l : u8 ) {
s e l f . data [ o f f s e t a s u s i z e ] = v a l ;
}
// / Fetch t h e b y t e a t ‘ o f f s e t ‘
pub f n l o a d 8 (& s e l f , o f f s e t : u32 ) −> u8 {
s e l f . data [ o f f s e t a s u s i z e ]
}
}
There’s one important thing to note however: MFC instructions behave like
memory loads and have a delay slot before the value is finally stored in the target
register.
Fortunately we can simply re-use our load delay slots infrastructure:
16 I’m using peudo-assembly again. The proper GNU assembler syntax would be
mfc0 $2, $12
51
impl Cpu {
// . . .
// / Move From C o p r o c e s s o r 0
f n op mfc0 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t cpu r = i n s t r u c t i o n . t () ;
l e t cop r = instruction . d () . 0 ;
l e t v = match c o p r {
12 => s e l f . s r ,
13 => // Cause r e g i s t e r
p a n i c ! ( ” Unhandled r e a d from CAUSE r e g i s t e r ” ) ,
=>
p a n i c ! ( ” Unhandled r e a d from c o p 0 r {} ” , c o p r ) ,
};
s e l f . load = ( cpu r , v )
}
}
We’ve already implemented OR so we can reuse the code, only changing the
operator:
impl Cpu {
// . . .
// / B i t w i s e And
f n op and(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
s e l f . s e t r e g (d , v) ;
}
}
It adds the value of two registers (like ADDU) but generates an exception on
signed overflow (like ADDI):
impl Cpu {
// . . .
// / Add and g e n e r a t e an e x c e p t i o n on o v e r f l o w
f n op add(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
52
let s = instruction . s () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t s = s e l f . reg ( s ) as i32 ;
l e t t = s e l f . reg ( t ) as i32 ;
l e t v = match s . c h e c k e d a d d ( t ) {
Some ( v ) => v a s u32 ,
None => p a n i c ! ( ”ADD o v e r f l o w ” ) ,
};
s e l f . s e t r e g (d , v) ;
}
}
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
p a n i c ! ( ” unhandled s t o r e 3 2 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / I n t e r r u p t C o n t r o l r e g i s t e r s ( s t a t u s and mask )
pub c o n s t IRQ CONTROL : Range = Range ( 0 x 1 f 8 0 1 0 7 0 , 8 ) ;
53
2.54 BGTZ instruction
The next unhandled instruction is 0x1ca00003 which is a “branch if greater
than zero” (BGTZ):
b g t z $5 , +12
It’s similar to the BEQ and BNE we’ve already encountered but instead of
comparing two registers it compares a single general purpose register to 0.
The comparison is done using signed integers. For unsigned integers the test
would only ever be false if the register contained 0 and we can already test that
with BNE:
bne $5 , $ z e r o , +12
l e t v = s e l f . reg ( s ) as i32 ;
if v > 0 {
s e l f . branch ( i ) ;
}
}
}
l e t v = s e l f . reg ( s ) as i32 ;
i f v <= 0 {
s e l f . branch ( i ) ;
}
}
}
54
2.56 LBU instruction
After that we meet instruction 0x90ae0000 which is a “load byte unsigned”
(LBU):
l b u $14 , 0 ( $5 )
It’s exactly like LB but without sign extension, the high 24 bits of the target
register are set to 0:
impl Cpu {
// . . .
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t v = s e l f . l o a d 8 ( addr ) ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . l o a d = ( t , v a s u32 ) ;
}
}
It’s implemented like JR except that it also stores the return address in a
general purpose register. Unlike JAL, JALR can store the return address in any
general purpose register, not just $ra:
impl Cpu {
// . . .
l e t r a = s e l f . pc ;
// S t o r e r e t u r n a d d r e s s i n ‘ d ‘
s e l f . s e t r e g (d , ra ) ;
s e l f . pc = s e l f . r e g ( s ) ;
}
}
55
• “branch if less than zero” (BLTZ):
bltz $4 , +12
l e t i s b g e z = ( i n s t r u c t i o n >> 1 6 ) & 1 ;
l e t i s l i n k = ( i n s t r u c t i o n >> 1 7 ) & 0 x f == 8 ;
l e t v = s e l f . reg ( s ) as i32 ;
// Test ” l e s s than z e r o ”
l e t t e s t = ( v < 0 ) a s u32 ;
// I f t h e t e s t i s ” g r e a t e r than o r e q u a l t o z e r o ” we need
// t o n e g a t e t h e c o m p a r i s o n above s i n c e
// ( ” a >= 0” <=> ” ! ( a < 0 ) ” ) . The x o r t a k e s c a r e o f t h a t .
let test = test ˆ is bgez ;
if is link {
l e t r a = s e l f . pc ;
// S t o r e r e t u r n a d d r e s s i n R31
s e l f . s e t r e g ( RegisterIndex (31) , ra ) ;
}
if t e s t != 0 {
s e l f . branch ( i ) ;
}
}
}
56
Instead of testing bit 16 directly I save a branch by xoring the value of test
(which is a boolean 0 or 1) with it.
// / S e t i f L e s s Than Immediate ( s i g n e d )
f n o p s l t i (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) a s i 3 2 ;
let s = instruction . s () ;
let t = instruction . t () ;
s e l f . s e t r e g ( t , v a s u32 ) ;
}
}
// / S u b s t r a c t Unsigned
f n op subu (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
let d = instruction . d() ;
s e l f . s e t r e g (d , v) ;
}
}
57
There are two versions of the shift right instruction: arithmetic and logical.
The arithmetic version considers that the value is signed and use the sign bit to
fill the missing MSBs in the register after the shift.
In Rust, C and C++ we can achieve the same behavior by casting the register
value to a signed integer before doing the shift:
impl Cpu {
// . . .
// / S h i f t Right A r i t h m e t i c
f n o p s r a (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let i = instruction . shift () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t v = ( s e l f . r e g ( t ) a s i 3 2 ) >> i ;
s e l f . s e t r e g ( d , v a s u32 ) ;
}
}
Multiplications and divisions are a bit peculiar on the MIPS architecture: for
one, the result is not stored in general purpose registers but in two dedicated
32bit registers: HI and LO.
For a division LO will contain the quotient and HI the remainder of the
euclidean division.
The reason for this is that divisions and multiplications are typically much
slower than the other instructions we’ve implemented so far (with the exception
of loads and stores potentially, due to the memory latency). While a simple
ADD or SRA can be executed in a single CPU cycle, DIV can take as much as
36 cycles to get the result.
In order to try and hide this delay when the CPU executes a division
instruction it does not stall the pipeline waiting for the instruction to finish.
Rather it continues executing the following instructions and when the code
decides to fetch the result of the division (using dedicated instructions to load HI
or LO) the CPU only stalls if it didn’t have the time to finish doing the division
in the background. This way if you craft your assembly cleverly you can hide
the division delay by doing some other work while the division is finishing.
For now we haven’t bothered implementing accurate timings at all so we
won’t worry about these details and consider the division takes one cycle to
execute. Later on when we implement proper timings we’ll have to revisit that
code.
An important thing to consider is what happens when we encounter a division
by zero. Perhaps surprisingly the CPU does not generate an exception, it just
gives bogus values (1 or -1 depending on the sign of the dividend).
An other bogus behaviour would be to divide 0x80000000 (-2147483648) by
0xffffffff (-1) which would yield 2147483648 which does not fit in a 32bit
signed integer. Table 7 gives a summary of those special cases.
58
Numerator Denominator Quotient (LO) Remainder (HI)
≥0 0 -1 (0xffffffff) numerator
<0 0 +1 numerator
0x80000000 0xffffffff 0x80000000 0
We should now have all we need to implement the instruction, let’s start by
adding the HI and LO registers to our Cpu:
// / CPU s t a t e
pub s t r u c t Cpu {
// . . .
// / HI r e g i s t e r f o r d i v i s i o n r e m a i n d e r and m u l t i p l i c a t i o n h i g h
// / result
hi : u32 ,
// / LO r e g i s t e r f o r d i v i s i o n q u o t i e n t and m u l t i p l i c a t i o n low
// / result
lo : u32 ,
}
impl Cpu {
Cpu {
// . . .
hi : 0 xdeadbeef ,
l o : 0 xdeadbeef ,
}
}
// . . .
}
// / D i v i d e ( s i g n e d )
f n o p d i v (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
l e t n = s e l f . reg ( s ) as i32 ;
l e t d = s e l f . reg ( t ) as i32 ;
i f d == 0 {
// D i v i s i o n by z e r o , r e s u l t s a r e bogus
s e l f . h i = n a s u32 ;
i f n >= 0 {
s e l f . lo = 0 x f f f f f f f f ;
} else {
s e l f . lo = 1;
}
} e l s e i f n a s u32 == 0 x80000000 && d == −1 {
// R e s u l t i s not r e p r e s e n t a b l e i n a 32 b i t
59
// s i g n e d integer
s e l f . hi = 0;
s e l f . lo = 0 x80000000 ;
} else {
s e l f . hi = ( n % d ) a s u32 ;
s e l f . lo = ( n / d ) a s u32 ;
}
}
}
// / Move From LO
f n o p m f l o (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let lo = s e l f . lo ;
s e l f . s e t r e g (d , lo ) ;
}
}
It’s very similiar to SRA except that the instruction treats the value as
unsigned and fills the missing MSBs with 0 after the shift. In Rust, C and C++
we can achieve this behavior by shifting unsigned values:
impl Cpu {
// . . .
// / S h i f t Right L o g i c a l
f n o p s r l (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let i = instruction . shift () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t v = s e l f . r e g ( t ) >> i ;
s e l f . s e t r e g (d , v) ;
}
}
60
2.65 SLTIU instruction
After that we meet 0x2c410045 which is “set if less than immediate unsigned”
(SLTI):
s l t i u $1 , $2 , 0 x45
l e t v = s e l f . reg ( s ) < i ;
s e l f . s e t r e g ( t , v a s u32 ) ;
}
}
Since this version uses unsigned operands we only have one special case: the
division by zero (the first line in table 7). Thus the implementation is slightly
shorter than DIV:
impl Cpu {
// . . .
// / D i v i d e Unsigned
f n o p d i v u (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
l e t n = s e l f . reg ( s ) ;
l e t d = s e l f . reg ( t ) ;
i f d == 0 {
// D i v i s i o n by z e r o , r e s u l t s a r e bogus
s e l f . hi = n ;
s e l f . lo = 0 x f f f f f f f f ;
} else {
s e l f . hi = n % d ;
s e l f . lo = n / d ;
}
}
}
18 Note that the immediate is still sign extended even though it’s then used as an unsigned
value.
61
2.67 MFHI instruction
We already implemented MFLO, now we meet instruction 0x0000c810 which
encodes “move from HI” (MFHI):
mfhi $25
Like MFLO it should be able to stall if the operation has not yet finished
but we’ll implement that later:
impl Cpu {
// . . .
// / Move From HI
f n o p m f l o (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
l e t hi = s e l f . hi ;
s e l f . s e t r e g (d , hi ) ;
}
}
// / S e t on L e s s Than ( s i g n e d )
f n o p s l t (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
l e t s = s e l f . reg ( s ) as i32 ;
l e t t = s e l f . reg ( t ) as i32 ;
let v = s < t ;
s e l f . s e t r e g ( d , v a s u32 ) ;
}
}
62
if l e t Some ( o f f s e t ) = map : : IRQ CONTROL . c o n t a i n s ( a b s a d d r ) {
p r i n t l n ! ( ”IRQ c o n t r o l r e a d { : x} ” , o f f s e t ) ;
return 0;
}
p a n i c ! ( ” unhandled l o a d 3 2 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / S t o r e 16 b i t h a l f w o r d ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 1 6 (&mut s e l f , addr : u32 , : u16 ) {
// . . .
p a n i c ! ( ” unhandled s t o r e 1 6 i n t o a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
2.71 Exceptions
The next unhandled instruction is 0x0000000c which encodes a “system call”
(SYSCALL):
syscall 0
• The current value of the PC is stored in $cop0 14, the EPC (Exception
PC) register19 ,
19 This is not entirely accurate when the exception occurs in a branch delay slot. We’ll review
63
• Record the cause of the exception (syscall, overflow, interrupt...) in
$cop0 13, the CAUSE register,
Unlike regular jumps and branches exceptions don’t have a branch delay slot:
the CPU jumps to the exception handler right after the current instruction.
The problem is that with my current architecture we fetch an instruction
ahead of time to emulate the branch delay slot. When an exception is triggered
we’d have to replace that instruction by the first one in the exception handler.
It’s possible of course but it’s a bit messy and I think it was a bad idea after all.
Instead I’m going to use two variables for the PC: one will hold he current
instruction and one will hold the “next PC”. Normally next pc is always 4 bytes
ahead but when a branch occurs we’ll set the PC to the instruction in the delay
slot and next pc to the branch target. In case of an exception however we’ll set
the PC to the exception handler address directly.
Let’s change our CPU state to reflect that change:
// / CPU s t a t e
pub s t r u c t Cpu {
// / The program c o u n t e r r e g i s t e r : p o i n t s t o t h e
// / n e x t i n s t r u c t i o n
pc : u32 ,
// / Next v a l u e f o r t h e PC, used t o s i m u l a t e t h e
// / branch d e l a y s l o t
n e x t p c : u32 ,
// . . .
}
impl Cpu {
Cpu {
pc : pc ,
n e x t p c : pc . wrapping add ( 4 ) ,
// . . .
}
}
// . . .
}
We can then (once again) rework run next instruction to use our PC pair:
impl Cpu {
// . . .
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
l e t pc = s e l f . pc ;
64
// Fetch i n s t r u c t i o n a t PC
l e t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . l o a d 3 2 ( pc ) ) ;
// I n c r e m e n t n e x t PC t o p o i n t t o t h e n e x t i n s t r u c t i o n .
s e l f . pc = s e l f . n e x t p c ;
s e l f . n e x t p c = s e l f . n e x t p c . wrapping add ( 4 ) ;
// Execute t h e p e n d i n g l o a d ( i f any , o t h e r w i s e i t w i l l l o a d
// ‘ R0 ‘ which i s a NOP) . ‘ s e t r e g ‘ works o n l y on ‘ o u t r e g s ‘
// s o t h i s o p e r a t i o n won ’ t be v i s i b l e by t h e n e x t
// i n s t r u c t i o n .
l e t ( reg , v a l ) = s e l f . l o a d ;
s e l f . s e t r e g ( reg , v a l ) ;
// We r e s e t t h e l o a d t o t a r g e t r e g i s t e r 0 f o r t h e n e x t
// i n s t r u c t i o n
s e l f . load = ( RegisterIndex (0) , 0) ;
// Copy t h e ou tp ut r e g i s t e r s a s i n p u t f o r t h e n e x t
instruction
s e l f . regs = s e l f . out regs ;
}
}
Then we just need to modify our branch and jump functions to set next pc
instead of pc to set the target address.
After that we can implement our exception infrastructure. On top of pc and
next pc we’ll also need to store the address of the current instruction to store
it in the EPC register ($cop0 14). We also need to add the CAUSE register to
store the exception code:
// / CPU s t a t e
pub s t r u c t Cpu {
// . . .
// / Address o f t h e i n s t r u c t i o n c u r r e n t l y b e i n g e x e c u t e d . Used
for
// / s e t t i n g t h e EPC i n e x c e p t i o n s .
c u r r e n t p c : u32 ,
// / Cop0 r e g i s t e r 1 3 : Cause R e g i s t e r
c a u s e : u32 ,
// / Cop0 r e g i s t e r 1 4 : EPC
epc : u32 ,
}
impl Cpu {
// . . .
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
// Fetch i n s t r u c t i o n a t PC
l e t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . l o a d 3 2 ( s e l f . pc ) ) ;
// Save t h e a d d r e s s o f t h e c u r r e n t i n s t r u c t i o n t o s a v e i n
// ‘EPC‘ i n c a s e o f an e x c e p t i o n .
s e l f . c u r r e n t p c = s e l f . pc ;
// . . .
}
}
65
Now that we’ve added the EPC and CAUSE registers for cop0 we can also
add them to our implementation of MFC0:
impl Cpu {
// . . .
// / Move From C o p r o c e s s o r 0
f n op mfc0 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t cpu r = i n s t r u c t i o n . t () ;
l e t cop r = instruction . d () . 0 ;
l e t v = match c o p r {
12 => s e l f . s r ,
13 => s e l f . c a u s e ,
14 => s e l f . epc ,
=>
p a n i c ! ( ” Unhandled r e a d from c o p 0 r {} ” , c o p r ) ,
};
s e l f . load = ( cpu r , v )
}
}
// / T r i g g e r an e x c e p t i o n
f n e x c e p t i o n (&mut s e l f , c a u s e : E x c e p t i o n ) {
// E x c e p t i o n h a n d l e r a d d r e s s depends on t h e ‘BEV‘ b i t :
l e t h a n d l e r = match s e l f . s r & ( 1 << 2 2 ) != 0 {
t r u e => 0 x b f c 0 0 1 8 0 ,
f a l s e => 0 x80000080 ,
};
// S h i f t b i t s [ 5 : 0 ] o f ‘SR ‘ two p l a c e s t o t h e l e f t .
// Those b i t s a r e t h r e e p a i r s o f I n t e r r u p t Enable / User
// Mode b i t s b e h a v i n g l i k e a s t a c k 3 e n t r i e s deep .
// E n t e r i n g an e x c e p t i o n p u s h e s a p a i r o f z e r o e s
// by l e f t s h i f t i n g t h e s t a c k which d i s a b l e s
// i n t e r r u p t s and p u t s t h e CPU i n k e r n e l mode .
// The o r i g i n a l t h i r d e n t r y i s d i s c a r d e d ( i t ’ s up
// t o t h e k e r n e l t o h a n d l e more than two r e c u r s i v e
// e x c e p t i o n l e v e l s ) .
l e t mode = s e l f . s r & 0 x 3 f ;
s e l f . s r &= ˜0 x 3 f ;
s e l f . s r |= ( mode << 2 ) & 0 x 3 f ;
// Save c u r r e n t i n s t r u c t i o n a d d r e s s i n ‘EPC‘
s e l f . epc = s e l f . c u r r e n t p c ;
66
// i n t o t h e h a n d l e r
s e l f . pc = handler ;
s e l f . n e x t p c = s e l f . pc . wrapping add ( 4 ) ;
}
// / System C a l l
f n o p s y s c a l l (&mut s e l f , : Instruction ) {
s e l f . exception ( Exception : : SysCall ) ;
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// / System c a l l ( c a u s e d by t h e SYSCALL opcode )
S y s C a l l = 0x8 ,
}
Our op syscall method ends up being a one liner. All the logic is in the
generic exception method.
With this SYSCALL instruction the BIOS enters the exception handler. The
NoCash specs tell us that we have to look at the contents of register $4 to know
what the BIOS is supposed to do. In this case $4 contains 1 so it’s supposed
to run “EnterCriticalSection”. This function is apparently supposed to disable
all interrupts. Once this is done if everything works well the exception handler
should return to the caller using an RFE instruction, let’s continue and see if we
find it as expected.
As its name implies it just moves the value from a general purpose register
into the LO register. Be careful though because the instruction encoding is
different from MFLO:
impl Cpu {
// . . .
// / Move t o LO
f n o p m t l o (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
s e l f . lo = s e l f . reg ( s ) ;
}
}
It might seem surprising to encounter this instruction: why would the BIOS
want to move something into the LO register? After all this register is for the
result of divisions and multiplications, you can’t do anything with it besides
reading it back.
The answer is that exception handlers are not supposed to restore all register
values before returning to the “normal” code flow. The reason is obvious:
exceptions can be triggered by asynchronous interrupts so they can basically
happen at any time. If the exception handler changes the value of any register
67
before giving back the control to the interrupted code it could lead to bogus
behaviour.
For instance some game code could start a division and be interrupted before
it reads the result in LO. Then the interrupt handler needs to compute an other
division but does not restore the original value of the register before returning
the control to the game. At that point the game reads LO expecting to get the
result of its computation but instead it gets some garbage value left there by
the handler. Obviously that would be problematic.
To avoid this the prologue of the exception handler saves the value of the
registers it might modify (including HI and LO) to the RAM and then loads
them back in the epilogue.
There are two exceptions though: registers $26 and $27‘ are reserved for the
BIOS and are not preserved by the exception handler. In other words no code
should use those registers when exceptions can occur because their content could
change at any moment.
// / Move t o HI
f n op mthi (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
s e l f . hi = s e l f . reg ( s ) ;
}
}
All this instruction does is shift the Interrupt Enable/User Mode bits two
places back to the right. This effectively undoes the opposite shift done when
entering the handler and therefore puts the CPU back in the mode it was when
the exception triggered (unless SR itself has been modified in the handler).
It does not reset the PC however, it’s up to the BIOS to fetch the address in
EPC, increment it by 4 to point at the next instruction and jump to it. The
RFE instruction is typically in the final jump delay slot (and that’s exactly what
the Playstation BIOS handler does in this case).
The instruction encoding for RFE is a bit annoying: as usual we begin by
checking bits [31:26] which are 0b010000 and introduce a coprocessor opcode.
Then we check bits [25:21] to figure which one it is. For RFE it’s 0b10000.
68
But it’s not over! There can be multiple instructionts with this coprocessor
encoding, although RFE is the only one implemented on the Playstation hardware
(the others have to do with virtual memory). To make sure the requested
instruction is the one we expect we must check bits [5:0] which must be equal to
0b010000:
impl Cpu {
// . . .
// / C o p r o c e s s o r 0 opcode
f n o p c o p 0 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
match i n s t r u c t i o n . c o p o p c o d e ( ) {
0 b00000 => s e l f . op mfc0 ( i n s t r u c t i o n ) ,
0 b00100 => s e l f . op mtc0 ( i n s t r u c t i o n ) ,
0 b10000 => s e l f . o p r f e ( i n s t r u c t i o n ) ,
=> p a n i c ! ( ” unhandled cop0 i n s t r u c t i o n {} ” ,
instruction )
}
}
// / Return From E x c e p t i o n
f n o p r f e (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
// There a r e o t h e r i n s t r u c t i o n s with t h e same e n c o d i n g but
all
// a r e v i r t u a l memory r e l a t e d and t h e P l a y s t a t i o n doesn ’ t
// implement them . S t i l l , l e t ’ s make s u r e we ’ r e not r u n n i n g
// buggy code .
i f i n s t r u c t i o n . 0 & 0 x 3 f != 0 b010000 {
p a n i c ! ( ” I n v a l i d cop0 i n s t r u c t i o n : {} ” , i n s t r u c t i o n ) ;
}
// R e s t o r e t h e pre−e x c e p t i o n mode by s h i f t i n g t h e I n t e r r u p t
// Enable / User Mode s t a c k back t o i t s o r i g i n a l p o s i t i o n .
l e t mode = s e l f . s r & 0 x 3 f ;
s e l f . s r &= ! 0 x 3 f ;
s e l f . s r |= mode >> 2 ;
}
}
In this case the CPU will put the address of the jr $ra instruction in EPC
before entering the exception handler. In order to signal this condition to the
handler the CPU also sets bit 31 of the CAUSE register.
In order to implement this behaviour we first need to keep track of whether
or we’re in a branch delay slot. It’s tempting to just check whether or not the
next instruction is 4 bytes ahead of the current one but it’s technically possible
20 This is only for branch delay slots, load delay slots behave normally exception-wise.
69
to branch 4 bytes ahead, even though it wouldn’t be very useful. Instead I’m
going to play it safe and add new variables:
pub s t r u c t Cpu {
// . . .
// / S e t by t h e c u r r e n t i n s t r u c t i o n i f a branch o c c u r e d and t h e
// / n e x t i n s t r u c t i o n w i l l be i n t h e d e l a y s l o t .
branch : b o o l ,
// / S e t i f t h e c u r r e n t i n s t r u c t i o n e x e c u t e s i n t h e d e l a y s l o t
d e l a y s l o t : bool ,
}
impl Cpu {
Cpu {
// . . .
branch : false ,
delay slot : false ,
}
}
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
// . . .
l e t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . l o a d 3 2 ( s e l f . pc ) ) ;
// . . .
}
// . . .
}
Now we can simply modify (once again) all the branch and jump instructions
to set self.branch = true. In the next cycle run next instruction will copy
this variable to self.delay slot.
Now that we keep track of delay slots we can modify our exception code to
handle them accurately:
impl Cpu {
// . . .
// / T r i g g e r an e x c e p t i o n
f n e x c e p t i o n (&mut s e l f , c a u s e : E x c e p t i o n ) {
// . . .
// Save c u r r e n t i n s t r u c t i o n a d d r e s s i n ‘EPC‘
s e l f . epc = s e l f . c u r r e n t p c ;
70
if self . delay slot {
// When an e x c e p t i o n o c c u r s i n a d e l a y s l o t ‘EPC‘
points
// t o t h e branch i n s t r u c t i o n and b i t 31 o f ‘CAUSE‘ i s
set .
s e l f . epc = s e l f . epc . w r a p p i n g s u b ( 4 ) ;
s e l f . c a u s e |= 1 << 3 1 ;
}
// . . .
}
}
With our exception handling infrastructure in place we can take the oppor-
tunity to review some exception conditions we’ve ignored so far and implement
them accurately.
// / Add and c h e c k f o r s i g n e d o v e r f l o w
f n op add(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t s = s e l f . reg ( s ) as i32 ;
l e t t = s e l f . reg ( t ) as i32 ;
match s . c h e c k e d a d d ( t ) {
Some ( v ) => s e l f . s e t r e g ( d , v a s u32 ) ,
None => s e l f . e x c e p t i o n ( E x c e p t i o n : : O v e r f l o w ) ,
}
}
l e t s = s e l f . reg ( s ) as i32 ;
match s . c h e c k e d a d d ( i ) {
Some ( v ) => s e l f . s e t r e g ( t , v a s u32 ) ,
None => s e l f . e x c e p t i o n ( E x c e p t i o n : : O v e r f l o w ) ,
}
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// . . .
// / A r i t h m e t i c o v e r f l o w
71
O v e r f l o w = 0 xc ,
}
// / Load Word
f n o p l w (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
// . . .
// Address must be 32 b i t a l i g n e d
i f addr % 4 == 0 {
l e t v = s e l f . l o a d 3 2 ( addr ) ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . load = ( t , v) ;
} else {
s e l f . e x c e p t i o n ( E x c e p t i o n : : LoadA ddre ssErr or ) ;
}
}
// / S t o r e Halfword
f n o p s h (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
// . . .
// Address must be 16 b i t a l i g n e d
i f addr % 2 == 0 {
s e l f . s t o r e 1 6 ( addr , v a s u16 ) ;
} else {
s e l f . exception ( Exception : : StoreAddressError ) ;
}
}
// / S t o r e Word
f n op sw(&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
// . . .
// Address must be 32 b i t a l i g n e d
i f addr % 4 == 0 {
s e l f . s t o r e 3 2 ( addr , v ) ;
} else {
s e l f . exception ( Exception : : StoreAddressError ) ;
}
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// . . .
// / Address e r r o r on l o a d
Loa dAddr essE rror = 0x4 ,
// / Address e r r o r on s t o r e
S t o r e A d d r e s s E r r o r = 0x5 ,
}
72
2.79 PC alignment exception
We should also generate an exception if the PC address is not correctly aligned
when we attempt to fetch an instruction. This can happen if a JR or JALR
instruction jumped to an address that was not 32bit aligned21 :
impl Cpu {
// . . .
pub f n r u n n e x t i n s t r u c t i o n (&mut s e l f ) {
// Save t h e a d d r e s s o f t h e c u r r e n t i n s t r u c t i o n t o s a v e i n
// ‘EPC‘ i n c a s e o f an e x c e p t i o n .
s e l f . c u r r e n t p c = s e l f . pc ;
if s e l f . c u r r e n t p c % 4 != 0 {
// PC i s not c o r r e c t l y a l i g n e d !
s e l f . e x c e p t i o n ( E x c e p t i o n : : LoadA ddre ssErr or ) ;
return ;
}
// Fetch i n s t r u c t i o n a t PC
l e t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . l o a d 3 2 ( s e l f . pc ) ) ;
// . . .
}
}
// / S t o r e 16 b i t h a l f w o r d ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 1 6 (&mut s e l f , addr : u32 , v a l : u16 ) {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
// . . .
}
}
// / S t o r e t h e 16 b i t l i t t l e e n d i a n h a l f w o r d ‘ v a l ‘ i n t o ‘ offset ‘
pub f n s t o r e 1 6 (&mut s e l f , o f f s e t : u32 , v a l : u16 ) {
l e t o f f s e t = o f f s e t as u s i z e ;
21 It
might be more efficient to add the test in the branch and jump instructions capable of
setting an invalid PC but I don’t really care about performance at that point and that would
make the code more complicated
73
l e t b0 = v a l a s u8 ;
l e t b1 = ( v a l >> 8 ) a s u8 ;
s e l f . data [ o f f s e t + 0 ] = b0 ;
s e l f . data [ o f f s e t + 1 ] = b1 ;
}
}
p a n i c ! ( ” unhandled l o a d 3 2 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / D i r e c t Memory A c c e s s r e g i s t e r s
pub c o n s t DMA: Range = Range ( 0 x 1 f 8 0 1 0 8 0 , 0 x80 ) ;
You’ll notice that I ignore all loads from any DMA register, not just the
control. Let’s hope we’ll be able to keep the smoke screen up for a little longer.
Soon after that we encounter a SW targeting the DMA control register with
the value 0x000b0000. This value configures the DMA SPU channel priority
and enables it. This probably means the BIOS is getting ready to play some
sound. Since we don’t care about the SPU or the DMA at that point let’s ignore
those writes as well:
impl I n t e r c o n n e c t {
// . . .
22 Although on the Playstation the CPU is seriously gimped while the DMA is running as
74
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
p a n i c ! ( ” unhandled s t o r e 3 2 i n t o a d d r e s s { : 0 8 x } : { : 0 8 x} ” ,
addr , v a l ) ;
}
}
Hopefuly we should be able to ignore the DMA for a while and keep focusing
on the CPU.
It’s the 16bit counterpart to LBU and it’s our first 16bit load istruction:
impl Cpu {
// . . .
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// Address must be 16 b i t a l i g n e d
i f addr % 2 == 0 {
l e t v = s e l f . l o a d 1 6 ( addr ) ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . l o a d = ( t , v a s u32 ) ;
} else {
s e l f . e x c e p t i o n ( E x c e p t i o n : : LoadA ddre ssErr or ) ;
}
}
}
75
// / Load 16 b i t h a l f w o r d a t ‘ addr ‘
pub f n l o a d 1 6 (& s e l f , addr : u32 ) −> u16 {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
p a n i c ! ( ” unhandled l o a d 1 6 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / Load 16 b i t h a l f w o r d a t ‘ addr ‘
pub f n l o a d 1 6 (& s e l f , addr : u32 ) −> u16 {
// . . .
p a n i c ! ( ” unhandled l o a d 1 6 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
// / Fetch t h e 16 b i t l i t t l e e n d i a n h a l f w o r d a t ‘ o f f s e t ‘
pub f n l o a d 1 6 (& s e l f , o f f s e t : u32 ) −> u16 {
l e t o f f s e t = o f f s e t as u s i z e ;
l e t b0 = s e l f . data [ o f f s e t + 0 ] a s u16 ;
l e t b1 = s e l f . data [ o f f s e t + 1 ] a s u16 ;
b0 | ( b1 << 8 )
}
}
It’s like SLL except the shift amount is stored in a register instead of an
immediate value.
The implementation is quite simple but there’s something to consider: so far
the shift amount was always a 5bit immediate value but this time it’s a 32bit
register. What happens when the register value is greater than 31?
76
It’s also important to figure out because shifting out of range is undefined
in Rust (and in C) so we have to be careful not to introduce weird undefined
behavior in our emulator.
Shifting by more than 31 places would mean shifting the 32bit value completely
out of range. Intuitively you might say that it sets it to 0 (all significant bits get
shifted outside the register) but it turns out it’s not accurate.
In reality on the R3000 CPU the shift amount is always implicitly masked
with 0x1f to only keep the low 5 bits. It means that a shift amount of 32 behaves
like 0 (i.e. it’s a NOP) while 130 behaves like 2:
impl Cpu {
// . . .
// / S h i f t L e f t L o g i c a l V a r i a b l e
f n o p s l l v (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
// S h i f t amount i s t r u n c a t e d t o 5 b i t s
l e t v = s e l f . r e g ( t ) << ( s e l f . r e g ( s ) & 0 x 1 f ) ;
s e l f . s e t r e g (d , v) ;
}
}
2.84 LH instruction
We implemented LHU not long ago and now we meet 0x87a30018 which is “load
halfword” (LH):
l h $3 , 2 4 ( $29 )
It’s implemented like LHU but it sign-extends the 16bit value to fit the 32bit
target register:
impl Cpu {
// . . .
// / Load Halfword ( s i g n e d )
f n o p l h (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// Cast a s i 1 6 t o f o r c e s i g n e x t e n s i o n
l e t v = s e l f . l o a d 1 6 ( addr ) a s i 1 6 ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . l o a d = ( t , v a s u32 ) ;
}
}
77
nor $25 , $2 , $ z e r o
// / B i t w i s e Not Or
f n o p n o r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
l e t v = ! ( s e l f . reg ( s ) | s e l f . reg ( t ) ) ;
s e l f . s e t r e g (d , v) ;
}
}
We’ve already implemented SRA and SLLV so this one shouldn’t give us any
trouble:
impl Cpu {
// . . .
// / S h i f t Right A r i t h m e t i c V a r i a b l e
f n o p s r a v (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
// S h i f t amount i s t r u n c a t e d t o 5 b i t s
l e t v = ( s e l f . r e g ( t ) a s i 3 2 ) >> ( s e l f . r e g ( s ) & 0 x 1 f ) ;
s e l f . s e t r e g ( d , v a s u32 ) ;
}
}
It’s implemented like SRAV without sign extension (or like SRL with a
register holding the shift amount, if you prefer):
23 Note that in this context ! in rust does the same thing as ~ in C: it’s the bitwise NOT
operator.
78
impl Cpu {
// . . .
// / S h i f t Right L o g i c a l V a r i a b l e
f n o p s r l v (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
// S h i f t amount i s t r u n c a t e d t o 5 b i t s
l e t v = s e l f . r e g ( t ) >> ( s e l f . r e g ( s ) & 0 x 1 f ) ;
s e l f . s e t r e g (d , v) ;
}
}
It’s our first multiplication opcode. The CPU does the multiplication using
64bit arithmetics and store the result across the HI and LO registers:
impl Cpu {
// . . .
// / M u l t i p l y Unsigned
f n op multu (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
l e t a = s e l f . r e g ( s ) a s u64 ;
l e t b = s e l f . r e g ( t ) a s u64 ;
let v = a ∗ b;
s e l f . h i = ( v >> 3 2 ) a s u32 ;
s e l f . l o = v a s u32 ;
}
}
79
Well, let’s not get ahead of ourselves, for now we have zero GPU emulation
code so we’re going to use the usual deception and have the BIOS read zeroes
when it attempts to access the GPU register space. That’s easy, there are only
two registers in the GPU24 :
impl I n t e r c o n n e c t {
// . . .
p a n i c ! ( ” unhandled l o a d 3 2 a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
80
}
p a n i c ! ( ” unhandled s t o r e 3 2 i n t o a d d r e s s { : 0 8 x } : { : 0 8 x} ” ,
addr , v a l ) ;
}
}
// / Load 16 b i t h a l f w o r d a t ‘ addr ‘
pub f n l o a d 1 6 (& s e l f , addr : u32 ) −> u16 {
// . . .
Unsurprisingly it’s followed by a 16 bit write to the same address with the
value 1. This means that the BIOS wants to use the first interrupt which is the
vertical blanking interrupt generated by the GPU’s video output. As usual let’s
ignore that:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e 16 b i t h a l f w o r d ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 1 6 (&mut s e l f , addr : u32 , v a l : u16 ) {
// . . .
81
impl I n t e r c o n n e c t {
// . . .
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
After that the BIOS writes 0x148 to 0x1f801114 which sets the timer 1 mode.
Bit 0x8 clears the counter (resets it to 0), bit 0x40 sets the timer interrupt to
repeat mode which means that it will fire periodically when the counter reaches
the target. Finally bit 0x100 sets the clock source as “horizontal blanking”.
It means that the timer increments when the display reaches the horizontal
blanking period.
This doesn’t set bit 0x10 however which would actually enable the interrupt.
And it hasn’t attempted to unmask the interrupt in the Interrupt Mask register
either anyway. Not sure where the BIOS is going with this.
After that the BIOS tries to change the value of the Interrupt Mask and
enables interrupt 0x8 which is the DMA’s.
There are more things in the loop but that’s the important part. We can see
that the BIOS loads GPUSTAT, masks bit 28 and loops if it’s 0.
If we look at the specs we can see that bit 28 of GPUSTAT tells if the GPU
is ready to receive a DMA block. So it seems that the BIOS is polling this bit in
GPUSTAT because it’s about to initiate a DMA transfer between the RAM and
the GPU.
Let’s modify our GPUSTAT handling code to return 0x10000000 when read:
impl I n t e r c o n n e c t {
// . . .
82
p r i n t l n ! ( ”GPU r e a d {} ” , o f f s e t ) ;
r e t u r n match o f f s e t {
// GPUSTAT: s e t b i t 28 t o s i g n a l t h a t t h e GPU i s
ready
// t o r e c e i v e DMA b l o c k s
4 => 0 x10000000 ,
=> 0 ,
}
}
// . . .
}
}
// / B i t w i s e E x c l u s i v e Or
f n o p x o r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let d = instruction . d() ;
let s = instruction . s () ;
let t = instruction . t () ;
l e t v = s e l f . reg ( s ) ˆ s e l f . reg ( t ) ;
s e l f . s e t r e g (d , v) ;
}
}
With this instruction implemented the BIOS then goes on to write a bunch of
DMA registers and then gets stuck in an other infinite loop, polling GPUSTAT
once again.
We could look at what the BIOS is doing once again to try and figure out
the right value to return to let it continue but that would be a bit pointless at
that point. We’ve almost implemented all the CPU instructions anyway and
we’ve reach the part of the BIOS where the bootup logo is drawn. We need to
implement the DMA to send the commands to the GPU and then emulate the
GPU itself to accept those commands and draw on the screen.
Before we move on though let’s implement the handful of CPU opcodes we
haven’t yet encountered. At this point we’ve implemented 48 opcodes and 19
are remaining. Fortunately most of those are variations of instructions we’ve
already implemented so let’s get this over with.
83
code for debugging purposes but I imagine some games might abuse it for other
purposes.
This instruction is encoded by setting bits [31:26] of the instruction to zero
and bits [5:0] to 0xd.
impl Cpu {
// . . .
// / Break
f n o p b r e a k (&mut s e l f , : Instruction ) {
s e l f . e x c e p t i o n ( E x c e p t i o n : : Break ) ;
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// . . .
// / B r e a k p o i n t ( c a u s e d by t h e BREAK opcode )
Break = 0x9 ,
}
// / M u l t i p l y ( s i g n e d )
f n op mult (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
l e t v = ( a ∗ b ) a s u64 ;
s e l f . h i = ( v >> 3 2 ) a s u32 ;
s e l f . l o = v a s u32 ;
}
}
All those casts are a bit ugly but they’re necessary to get the proper sign
extension.
84
impl Cpu {
// . . .
// / S u b s t r a c t and c h e c k f o r s i g n e d o v e r f l o w
f n o p s u b (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
let s = instruction . s () ;
let t = instruction . t () ;
let d = instruction . d() ;
l e t s = s e l f . reg ( s ) as i32 ;
l e t t = s e l f . reg ( t ) as i32 ;
match s . c h e c k e d s u b ( t ) {
Some ( v ) => s e l f . s e t r e g ( d , v a s u32 ) ,
None => s e l f . e x c e p t i o n ( E x c e p t i o n : : O v e r f l o w ) ,
}
}
}
// / B i t w i s e e X c l u s i v e Or Immediate
f n o p x o r i (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t v = s e l f . reg ( s ) ˆ i ;
s e l f . set reg (t , v) ;
}
}
// / C o p r o c e s s o r 1 opcode ( d o e s not e x i s t on t h e P l a y s t a t i o n )
f n o p c o p 1 (&mut s e l f , : Instruction ) {
s e l f . exception ( Exception : : CoprocessorError ) ;
}
85
// / C o p r o c e s s o r 3 opcode ( d o e s not e x i s t on t h e P l a y s t a t i o n )
f n o p c o p 3 (&mut s e l f , : Instruction ) {
s e l f . exception ( Exception : : CoprocessorError ) ;
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// . . .
// / Unsupported c o p r o c e s s o r o p e r a t i o n
C o p r o c e s s o r E r r o r = 0xb ,
}
// / C o p r o c e s s o r 2 opcode (GTE)
f n o p c o p 2 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
p a n i c ! ( ” unhandled GTE i n s t r u c t i o n : {} ” , i n s t r u c t i o n ) ;
}
}
After this sequence $1 contains the 4byte little endian value at the address
stored in $2 regardless of its alignment.
You can see that the LWL instruction is given an offset of 3. If the address
was correctly aligned we remain within the same aligned 32bit word, otherwise
we’ve moved to the next one.
86
Okay, that might sound a bit complicated, hopefully everything will be clearer
when we see the code of the implementation.
Before that however it’s important to note a specificity of these unaligned
word instructions: you’ll notice that in my asm snippet above I run the two
instructions back-to-back without delay. That’s because those instructions can
merge their data with that of a pending load without having to wait for the load
to finish.
For other load instructions it wouldn’t make a lot of sense (why would you
want to load twice to the same target register without doing anything with the
first value?) but since LWL and LWR are meant to be used together to load a
single value it makes sense to spare a cycle there25 .
// / Load Word L e f t ( l i t t l e −e n d i a n o n l y i m p l e m e n t a t i o n )
f n o p l w l (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// This i n s t r u c t i o n b y p a s s e s t h e l o a d d e l a y r e s t r i c t i o n :
this
// i n s t r u c t i o n w i l l merge t h e new c o n t e n t s with t h e v a l u e
// c u r r e n t l y b e i n g l o a d e d i f need be .
l e t cur v = s e l f . o u t r e g s [ t . 0 as u s i z e ] ;
// Next we l o a d t h e ∗ a l i g n e d ∗ word c o n t a i n i n g t h e f i r s t
// a d d r e s s e d b y t e
l e t a l i g n e d a d d r = addr & ! 3 ;
l e t aligned word = s e l f . load32 ( aligned addr ) ;
// Depending on t h e a d d r e s s a l i g n m e n t we f e t c h t h e 1 , 2 , 3
or
// 4 ∗ most ∗ s i g n i f i c a n t b y t e s and put them i n t h e t a r g e t
// r e g i s t e r .
l e t v = match addr & 3 {
0 => ( c u r v & 0 x 0 0 f f f f f f ) | ( a l i g n e d w o r d << 2 4 ) ,
1 => ( c u r v & 0 x 0 0 0 0 f f f f ) | ( a l i g n e d w o r d << 1 6 ) ,
2 => ( c u r v & 0 x 0 0 0 0 0 0 f f ) | ( a l i g n e d w o r d << 8 ) ,
3 => ( c u r v & 0 x00000000 ) | ( a l i g n e d w o r d << 0 ) ,
=> u n r e a c h a b l e ! ( ) ,
};
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . load = ( t , v) ;
25 Interesting bit of trivia: apparently the LWL and LWR instructions were patented. The
patent expired in 2006 and some people claimed that it might also have covered software
implementations. If that’s true it means one could not have distributed our emulator without
a license from MIPS Computer Systems.
87
}
}
Hopefully the comments are clear enough to follow what the code is doing.
You can see that LWL updates one, two, three or all four bytes in the target
register depending on the address alignment.
Note the direct reference to self.out regs instead of our usual helper to
make sure we ignore the load delay when the two instructions are used in
sequence.
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// This i n s t r u c t i o n b y p a s s e s t h e l o a d d e l a y r e s t r i c t i o n :
this
// i n s t r u c t i o n w i l l merge t h e new c o n t e n t s with t h e v a l u e
// c u r r e n t l y b e i n g l o a d e d i f need be .
l e t cur v = s e l f . o u t r e g s [ t . 0 as u s i z e ] ;
// Next we l o a d t h e ∗ a l i g n e d ∗ word c o n t a i n i n g t h e f i r s t
// a d d r e s s e d b y t e
l e t a l i g n e d a d d r = addr & ! 3 ;
l e t aligned word = s e l f . load32 ( aligned addr ) ;
// Depending on t h e a d d r e s s a l i g n m e n t we f e t c h t h e 1 , 2 , 3
or
// 4 ∗ l e a s t ∗ s i g n i f i c a n t b y t e s and put them i n t h e t a r g e t
// r e g i s t e r .
l e t v = match addr & 3 {
0 => ( c u r v & 0 x00000000 ) | ( a l i g n e d w o r d >> 0 ) ,
1 => ( c u r v & 0 x f f 0 0 0 0 0 0 ) | ( a l i g n e d w o r d >> 8 ) ,
2 => ( c u r v & 0 x f f f f 0 0 0 0 ) | ( a l i g n e d w o r d >> 1 6 ) ,
3 => ( c u r v & 0 x f f f f f f 0 0 ) | ( a l i g n e d w o r d >> 2 4 ) ,
=> u n r e a c h a b l e ! ( ) ,
};
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . load = ( t , v) ;
}
}
You can see that like LWL we update from one to four bytes depending on
the alignment, however this time it’s the least significant bytes.
88
2.100 Non-aligned writes
Naturally the MIPS instruction set doesn’t only support loading non-aligned
words, it can also store them using “store word left” (SWL) and “store word
right” (SWR).
The concept is the same: to store a 32bit integer at an unaligned access one
would call SWR and SWL in sequence to update the entire word.
// / S t o r e Word L e f t ( l i t t l e −e n d i a n o n l y i m p l e m e n t a t i o n )
f n o p s w l (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
l e t a l i g n e d a d d r = addr & ! 3 ;
// Load t h e c u r r e n t v a l u e f o r t h e a l i g n e d word a t t h e
target
// a d d r e s s
l e t cur mem = s e l f . l o a d 3 2 ( a l i g n e d a d d r ) ;
s e l f . s t o r e 3 2 ( a l i g n e d a d d r , mem) ;
}
}
// / S t o r e Word Right ( l i t t l e −e n d i a n o n l y i m p l e m e n t a t i o n )
f n o p s w r (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
89
l e t addr = s e l f . r e g ( s ) . wrapping add ( i ) ;
let v = s e l f . reg ( t ) ;
l e t a l i g n e d a d d r = addr & ! 3 ;
// Load t h e c u r r e n t v a l u e f o r t h e a l i g n e d word a t t h e
target
// a d d r e s s
l e t cur mem = s e l f . l o a d 3 2 ( a l i g n e d a d d r ) ;
s e l f . s t o r e 3 2 ( a l i g n e d a d d r , mem) ;
}
}
// / Load Word i n C o p r o c e s s o r 0
f n o p l w c 0 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
// / Load Word i n C o p r o c e s s o r 1
f n o p l w c 1 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
// / Load Word i n C o p r o c e s s o r 2
f n o p l w c 2 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
p a n i c ! ( ” unhandled GTE LWC: {} ” , i n s t r u c t i o n ) ;
}
90
// / Load Word i n C o p r o c e s s o r 3
f n o p l w c 3 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
}
// / S t o r e Word i n C o p r o c e s s o r 0
f n op swc0 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
// / S t o r e Word i n C o p r o c e s s o r 1
f n op swc1 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
// / S t o r e Word i n C o p r o c e s s o r 2
f n op swc2 (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
p a n i c ! ( ” unhandled GTE SWC: {} ” , i n s t r u c t i o n ) ;
}
// / S t o r e Word i n C o p r o c e s s o r 3
f n op swc3 (&mut s e l f , : Instruction ) {
// Not s u p p o r t e d by t h i s c o p r o c e s s o r
s e l f . exception ( Exception : : CoprocessorError ) ;
}
}
91
match i n s t r u c t i o n . f u n c t i o n ( ) {
0 b000000 => match i n s t r u c t i o n . s u b f u n c t i o n ( ) {
0 b000000 => s e l f . o p s l l ( i n s t r u c t i o n ) ,
0 b000010 => s e l f . o p s r l ( i n s t r u c t i o n ) ,
0 b000011 => s e l f . o p s r a ( i n s t r u c t i o n ) ,
0 b000100 => s e l f . o p s l l v ( i n s t r u c t i o n ) ,
0 b000110 => s e l f . o p s r l v ( i n s t r u c t i o n ) ,
0 b000111 => s e l f . o p s r a v ( i n s t r u c t i o n ) ,
0 b001000 => s e l f . o p j r ( i n s t r u c t i o n ) ,
0 b001001 => s e l f . o p j a l r ( i n s t r u c t i o n ) ,
0 b001100 => s e l f . o p s y s c a l l ( i n s t r u c t i o n ) ,
0 b001101 => s e l f . o p b r e a k ( i n s t r u c t i o n ) ,
0 b010000 => s e l f . o p m f h i ( i n s t r u c t i o n ) ,
0 b010001 => s e l f . op mthi ( i n s t r u c t i o n ) ,
0 b010010 => s e l f . o p m f l o ( i n s t r u c t i o n ) ,
0 b010011 => s e l f . o p m t l o ( i n s t r u c t i o n ) ,
0 b011000 => s e l f . op mult ( i n s t r u c t i o n ) ,
0 b011001 => s e l f . op multu ( i n s t r u c t i o n ) ,
0 b011010 => s e l f . o p d i v ( i n s t r u c t i o n ) ,
0 b011011 => s e l f . o p d i v u ( i n s t r u c t i o n ) ,
0 b100000 => s e l f . op add ( i n s t r u c t i o n ) ,
0 b100001 => s e l f . op addu ( i n s t r u c t i o n ) ,
0 b100010 => s e l f . o p s u b ( i n s t r u c t i o n ) ,
0 b100011 => s e l f . op subu ( i n s t r u c t i o n ) ,
0 b100100 => s e l f . op and ( i n s t r u c t i o n ) ,
0 b100101 => s e l f . o p o r ( i n s t r u c t i o n ) ,
0 b100110 => s e l f . o p x o r ( i n s t r u c t i o n ) ,
0 b100111 => s e l f . o p n o r ( i n s t r u c t i o n ) ,
0 b101010 => s e l f . o p s l t ( i n s t r u c t i o n ) ,
0 b101011 => s e l f . o p s l t u ( i n s t r u c t i o n ) ,
=> s e l f . o p i l l e g a l ( i n s t r u c t i o n ) ,
},
0 b000001 => s e l f . op bxx ( i n s t r u c t i o n ) ,
0 b000010 => s e l f . o p j ( i n s t r u c t i o n ) ,
0 b000011 => s e l f . o p j a l ( i n s t r u c t i o n ) ,
0 b000100 => s e l f . o p b eq ( i n s t r u c t i o n ) ,
0 b000101 => s e l f . op bne ( i n s t r u c t i o n ) ,
0 b000110 => s e l f . o p b l e z ( i n s t r u c t i o n ) ,
0 b000111 => s e l f . o p b g t z ( i n s t r u c t i o n ) ,
0 b001000 => s e l f . o p a d d i ( i n s t r u c t i o n ) ,
0 b001001 => s e l f . o p a d d i u ( i n s t r u c t i o n ) ,
0 b001010 => s e l f . o p s l t i ( i n s t r u c t i o n ) ,
0 b001011 => s e l f . o p s l t i u ( i n s t r u c t i o n ) ,
0 b001100 => s e l f . o p a n d i ( i n s t r u c t i o n ) ,
0 b001101 => s e l f . o p o r i ( i n s t r u c t i o n ) ,
0 b001110 => s e l f . o p x o r i ( i n s t r u c t i o n ) ,
0 b001111 => s e l f . o p l u i ( i n s t r u c t i o n ) ,
0 b010000 => s e l f . o p c o p 0 ( i n s t r u c t i o n ) ,
0 b010001 => s e l f . o p c o p 1 ( i n s t r u c t i o n ) ,
0 b010010 => s e l f . o p c o p 2 ( i n s t r u c t i o n ) ,
0 b010011 => s e l f . o p c o p 3 ( i n s t r u c t i o n ) ,
0 b100000 => s e l f . o p l b ( i n s t r u c t i o n ) ,
0 b100001 => s e l f . o p l h ( i n s t r u c t i o n ) ,
0 b100010 => s e l f . o p l w l ( i n s t r u c t i o n ) ,
0 b100011 => s e l f . o p l w ( i n s t r u c t i o n ) ,
0 b100100 => s e l f . o p l b u ( i n s t r u c t i o n ) ,
0 b100101 => s e l f . o p l h u ( i n s t r u c t i o n ) ,
0 b100110 => s e l f . o p l w r ( i n s t r u c t i o n ) ,
0 b101000 => s e l f . o p s b ( i n s t r u c t i o n ) ,
0 b101001 => s e l f . o p s h ( i n s t r u c t i o n ) ,
0 b101010 => s e l f . o p s w l ( i n s t r u c t i o n ) ,
0 b101011 => s e l f . op sw ( i n s t r u c t i o n ) ,
92
0 b101110 => self . op swr ( i n s t r u c t i o n ) ,
0 b110000 => self . op lwc0 ( i n s t r u c t i o n ) ,
0 b110001 => self . op lwc1 ( i n s t r u c t i o n ) ,
0 b110010 => self . op lwc2 ( i n s t r u c t i o n ) ,
0 b110011 => self . op lwc3 ( i n s t r u c t i o n ) ,
0 b111000 => self . op swc0 ( i n s t r u c t i o n ) ,
0 b111001 => self . op swc1 ( i n s t r u c t i o n ) ,
0 b111010 => self . op swc2 ( i n s t r u c t i o n ) ,
0 b111011 => self . op swc3 ( i n s t r u c t i o n ) ,
=> self . op illegal ( instruction ) ,
}
}
// / I l l e g a l i n s t r u c t i o n
f n o p i l l e g a l (&mut s e l f , i n s t r u c t i o n : I n s t r u c t i o n ) {
println ! ( ” I l l e g a l instruction {}! ” , instruction ) ;
s e l f . exception ( Exception : : I l l e g a l I n s t r u c t i o n ) ;
}
}
// / E x c e p t i o n t y p e s ( a s s t o r e d i n t h e ‘CAUSE‘ r e g i s t e r )
enum E x c e p t i o n {
// . . .
// / CPU e n c o u n t e r e d an unknown i n s t r u c t i o n
I l l e g a l I n s t r u c t i o n = 0xa ,
}
That’s quite a milestone but it’s only the beginning. While implementing all
those instructions and stepping through the BIOS we’ve seen that it tries to use
many peripherals: the SPU, the timers, the DMA and the GPU in particular.
At this point my first objective is to display an image to the screen so I want
to start implementing the GPU as soon as possible. But we won’t be able to do
anything useful with the GPU without the DMA, so let’s start with that.
93
• Channel 4 is connected to the SPU
• Channel 5 is connected to the extension port
• Channel 6 is only connected to the RAM and is used to clear an “ordering
table”
Implementing complete and accurate DMA support can be quite tricky. The
main problem is that in certain modes the DMA sporadically gives back the
control to the CPU. For instance while the GPU is busy processing a command
and won’t accept any new input the DMA has to wait. Instead of wasting time
it gives back control to the CPU to give it the opportunity to do something else.
In order to emulate this behaviour correctly we need to emulate the GPU
command FIFO, DMA timings and CPU timings correctly. Then we need to
setup the state machine to switch between the CPU and DMA when needed.
That would require quite some work to get right and we only have the BIOS
boot logo to test it at this point.
To avoid having to implement all that we’re going to make a simplifying
assumption for now: when the DMA runs it does all the transfer at once without
giving back control to the CPU. This won’t be exactly accurate but it should
suffice to run the BIOS and hopefully some games.
The reason I feel confident doing this simplification is that PCSX-R seems to
do it that way and it can run quite many games, although some comments hint
that it breaks with certain titles and it uses some hacks to improve compatibility.
Mednafen on the other hand implements a much accurate DMA and actually
emulates the DMA giving back the control to the CPU in certain situations,
we’ll probably want to do something similar later on.
For now let’s take a few steps back and revisit all the DMA register reads
and writes done by the BIOS so that we can emulate them correctly.
impl Dma {
94
pub f n new ( ) −> Dma {
Dma {
// R e s e t v a l u e t a k e n from t h e Nocash PSX s p e c
c o n t r o l : 0 x07654321 ,
}
}
// / R e t r i e v e t h e v a l u e o f t h e c o n t r o l r e g i s t e r
pub f n c o n t r o l (& s e l f ) −> u32 {
s e l f . control
}
}
We can then add an instance of this struct Dma in our interconnect and
glue our new control method when the register is accessed:
// / G l o b a l i n t e r c o n n e c t
pub s t r u c t I n t e r c o n n e c t {
// . . .
// / DMA r e g i s t e r s
dma : Dma,
}
impl I n t e r c o n n e c t {
pub f n new ( b i o s : B i o s ) −> I n t e r c o n n e c t {
Interconnect {
// . . .
// / DMA r e g i s t e r read
f n dma reg(& s e l f , o f f s e t : u32 ) −> u32 {
match o f f s e t {
0 x70 => s e l f . dma . c o n t r o l ( ) ,
=> p a n i c ! ( ” unhandled DMA a c c e s s ” )
}
}
}
The BIOS then writes back 0x076f4321 to the the same register which means
that it enables channel 4 (the SPU) and sets it priority to 7. Let’s implement
write support for the control register:
impl I n t e r c o n n e c t {
// . .
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
95
return s e l f . set dma reg ( o f f s e t , val ) ;
}
}
// / DMA r e g i s t e r w r i t e
f n s e t d m a r e g (&mut s e l f , o f f s e t : u32 , v a l : u32 ) {
match o f f s e t {
0 x70 => s e l f . dma . s e t c o n t r o l ( v a l ) ,
=> p a n i c ! ( ” unhandled DMA w r i t e a c c e s s ” )
}
}
}
impl Dma {
// . . .
// / S e t t h e v a l u e o f t h e c o n t r o l r e g i s t e r
pub f n s e t c o n t r o l (&mut s e l f , v a l : u32 ) {
s e l f . control = val
}
}
// / master IRQ e n a b l e
i r q e n : bool ,
// / IRQ e n a b l e f o r i n d i v i d u a l c h a n n e l s
c h a n n e l i r q e n : u8 ,
// / IRQ f l a g s f o r i n d i v i d u a l c h a n n e l s
c h a n n e l i r q f l a g s : u8 ,
// / When s e t t h e i n t e r r u p t i s a c t i v e u n c o n d i t i o n a l l y ( even i f
// / ‘ i r q e n ‘ i s f a l s e )
f o r c e i r q : bool ,
26 You’ll notice that I split the register in individual variables, I prefer to do that when know
I’ll have to manipulate the fields individually. It makes the code clearer and less error prone in
my experience. It has a small cost however: it takes up a little more memory and we have to
pack/unpack them when handling registers read/writes.
96
// / B i t s [ 0 : 5 ] o f t h e i n t e r r u p t r e g i s t e r s a r e RW but I don ’ t
know
// / what they ’ r e su p po s ed t o do s o I j u s t s t o r e them and send
them
// / back untouched on r e a d s
irq dummy : u8 ,
}
impl Dma {
// . . .
// / Return t h e s t a t u s o f t h e DMA i n t e r r u p t
f n i r q (& s e l f ) −> b o o l {
let channel irq = s e l f . channel irq flags & s e l f .
channel irq en ;
// / R e t r i e v e t h e v a l u e o f t h e i n t e r r u p t r e g i s t e r
pub f n i n t e r r u p t (& s e l f ) −> u32 {
l e t mut r = 0 ;
r
}
// / S e t t h e v a l u e o f t h e i n t e r r u p t r e g i s t e r
pub f n s e t i n t e r r u p t (&mut s e l f , v a l : u32 ) {
// Unknown what b i t s [ 5 : 0 ] do
s e l f . irq dummy = ( v a l & 0 x 3 f ) a s u8 ;
s e l f . f o r c e i r q = ( v a l >> 1 5 ) & 1 != 0 ;
s e l f . c h a n n e l i r q e n = ( ( v a l >> 1 6 ) & 0 x 7 f ) a s u8 ;
s e l f . i r q e n = ( v a l >> 2 3 ) & 1 != 0 ;
// W r i t i n g 1 t o a f l a g r e s e t s i t
l e t ack = ( ( v a l >> 2 4 ) & 0 x 3 f ) a s u8 ;
s e l f . c h a n n e l i r q f l a g s &= ! ack ;
}
}
97
Field bits Description
0 Transfer direction: RAM-to-device(0) or device-to-RAM(1)
1 Address increment(0) or decrement(1) mode
2 Chopping mode
[10 : 9] Synchronization type: Manual(0), Request(1) or Linked List(2)
[18 : 16] Chopping DMA window
[22 : 20] Chopping CPU window
24 Enable
28 Manual trigger
[30 : 29] Unknown
98
// / Unkown 2 RW b i t s i n c o n f i g u r a t i o n r e g i s t e r
dummy : u8 ,
}
impl Channel {
f n new ( ) −> Channel {
Channel {
enable : f a l s e ,
d i r e c t i o n : D i r e c t i o n : : ToRam,
s t e p : Step : : Increment ,
s y n c : Sync : : Manual ,
trigger : false ,
chop : f a l s e ,
chop dma sz : 0 ,
chop cpu sz : 0 ,
dummy : 0 ,
}
}
r
}
s e l f . e n a b l e = ( v a l >> 2 4 ) & 1 != 0 ;
s e l f . t r i g g e r = ( v a l >> 2 8 ) & 1 != 0 ;
99
s e l f . dummy = ( ( v a l >> 2 9 ) & 3 ) a s u8 ;
}
}
// / DMA t r a n s f e r d i r e c t i o n
pub enum D i r e c t i o n {
ToRam = 0,
FromRam = 1 ,
}
// / DMA t r a n s f e r s t e p
pub enum Step {
Increment = 0 ,
Decrement = 1 ,
}
// / DMA t r a n s f e r s y n c h r o n i z a t i o n mode
pub enum Sync {
// / T r a n s f e r s t a r t s when t h e CPU w r i t e s t o t h e T r i g g e r b i t and
// / t r a n s f e r s e v e r y t h i n g a t once
Manual = 0 ,
// / Sync b l o c k s t o DMA r e q u e s t s
Request = 1 ,
// / Used t o t r a n s f e r GPU command l i s t s
LinkedList = 2 ,
}
We can then put an array of 7 Channel instances in our struct Dma with
some methods to access them in the interconnect:
// / D i r e c t Memory A c c e s s
pub s t r u c t Dma {
// . . .
// / The 7 c h a n n e l i n s t a n c e s
c h a n n e l s : [ Channel ; 7 ] ,
}
impl Dma {
// . . .
// / Return a r e f e r e n c e t o a c h a n n e l by p o r t number .
pub f n c h a n n e l (& s e l f , p o r t : Port ) −> &Channel {
&s e l f . channels [ port as u s i z e ]
}
// / The 7 DMA p o r t s
pub enum Port {
// / Macroblock d e c o d e r i n p u t
MdecIn = 0 ,
// / Macroblock d e c o d e r o ut pu t
MdecOut = 1 ,
// / G r a p h i c s P r o c e s s i n g Unit
Gpu = 2 ,
// / CD−ROM d r i v e
CdRom = 3 ,
// / Sound P r o c e s s i n g Unit
100
Spu = 4,
// / Extension port
Pio = 5,
// / Used t o c l e a r t h e o r d e r i n g t a b l e
Otc = 6,
}
impl Port {
pub f n f r o m i n d e x ( i n d e x : u32 ) −> Port {
match i n d e x {
0 => Port : : MdecIn ,
1 => Port : : MdecOut ,
2 => Port : : Gpu ,
3 => Port : : CdRom,
4 => Port : : Spu ,
5 => Port : : Pio ,
6 => Port : : Otc ,
n => p a n i c ! ( ” I n v a l i d p o r t {} ” , n ) ,
}
}
}
That’s quite a lot of code to parse one register but it should make our life
easier later on.
Since the 7 channels have the same register layout we can rewrite our
Interconnect methods to be a little more generic:
Impl I n t e r c o n n e c t {
// . . .
// / DMA r e g i s t e r r e a d
f n dma reg(& s e l f , o f f s e t : u32 ) −> u32 {
l e t major = ( o f f s e t & 0 x70 ) >> 4 ;
l e t minor = o f f s e t & 0 x f ;
match major {
// Per−c h a n n e l r e g i s t e r s
0 . . . 6 => {
l e t c h a n n e l = s e l f . dma . c h a n n e l ( Port : : f r o m i n d e x (
major ) ) ;
match minor {
8 => c h a n n e l . c o n t r o l ( ) ,
=> p a n i c ! ( ” Unhandled DMA r e a d a t { : x} ” ,
offset )
}
},
// Common DMA r e g i s t e r s
7 => match minor {
0 => s e l f . dma . c o n t r o l ( ) ,
4 => s e l f . dma . i n t e r r u p t ( ) ,
=> p a n i c ! ( ” Unhandled DMA r e a d a t { : x} ” , o f f s e t )
},
=> p a n i c ! ( ” Unhandled DMA r e a d a t { : x} ” , o f f s e t )
}
}
// / DMA r e g i s t e r w r i t e
f n s e t d m a r e g (&mut s e l f , o f f s e t : u32 , v a l : u32 ) {
l e t major = ( o f f s e t & 0 x70 ) >> 4 ;
l e t minor = o f f s e t & 0 x f ;
101
match major {
// Per−c h a n n e l r e g i s t e r s
0 . . . 6 => {
l e t p o r t = Port : : f r o m i n d e x ( major ) ;
l e t c h a n n e l = s e l f . dma . c h a n n e l m u t ( p o r t ) ;
match minor {
8 => c h a n n e l . s e t c o n t r o l ( v a l ) ,
=> p a n i c ! ( ” Unhandled DMA w r i t e { : x } : { : 0 8 x} ” ,
offset , val )
}
},
// Common DMA r e g i s t e r s
7 => {
match minor {
0 => s e l f . dma . s e t c o n t r o l ( v a l ) ,
4 => s e l f . dma . s e t i n t e r r u p t ( v a l ) ,
=> p a n i c ! ( ” Unhandled DMA w r i t e { : x } : { : 0 8 x} ” ,
offset , val ) ,
}
}
=> p a n i c ! ( ” Unhandled DMA w r i t e { : x } : { : 0 8 x} ” ,
offset , val ) ,
};
}
}
// / DMA s t a r t a d d r e s s
b a s e : u32 ,
}
impl Channel {
// . . .
base : 0 ,
}
}
// / R e t r i e v e t h e c h a n n e l ’ s b a s e a d d r e s s
pub f n b a s e (& s e l f ) −> u32 {
s e l f . base
}
102
// / S e t c h a n n e l b a s e a d d r e s s . Only b i t s [ 0 : 2 3 ] a r e s i g n i f i c a n t
so
// / o n l y 16MB a r e a d d r e s s a b l e by t h e DMA
pub f n s e t b a s e (&mut s e l f , v a l : u32 ) {
s e l f . base = val & 0 x f f f f f f ;
}
}
• In Manual sync mode only the low 16bits are used and they contain the
number of words to transfer.
• In Request sync mode the low 16 bits contain the block size in words while
the upper 16bits contain the number of blocks to transfer. The DMA will
transfer a block at a time and wait for the device to assert the “request”
flag before starting a new block.
• In Linked List mode this register is not used.
// / S i z e o f a b l o c k i n words
b l o c k s i z e : u16 ,
// / Block count , Only used when ‘ sync ‘ i s ‘ Request ‘
b l o c k c o u n t : u16 ,
}
impl Channel {
// . . .
// / R e t r i e v e v a l u e o f t h e Block C o n t r o l r e g i s t e r
pub f n b l o c k c o n t r o l (& s e l f ) −> u32 {
l e t bs = s e l f . b l o c k s i z e a s u32 ;
l e t bc = s e l f . b l o c k c o u n t a s u32 ;
( bc << 1 6 ) | bs
}
// / S e t v a l u e o f t h e Block C o n t r o l r e g i s t e r
pub f n s e t b l o c k c o n t r o l (&mut s e l f , v a l : u32 ) {
s e l f . b l o c k s i z e = v a l a s u16 ;
103
s e l f . b l o c k c o u n t = ( v a l >> 1 6 ) a s u16 ;
}
}
We can see that the BIOS initialized a base address and block size for channel
6, it’s no surprise that it then writes 0x11000002 to the channel control register.
The configuration is Manual sync mode, towards the RAM, with decreasing
addresses and it sets the enable and trigger bits to start the transfer.
We can now implement the DMA copy itself but before we do so we must
understand what this channel does exactly.
104
linked list. As you know inserting an entry between two elements in a linked list
is very cheap: you just rewrite the element’s list pointers and you’re done.
So here’s how a depth ordering table is implemented: each command is
stored in a “packet”, somewhere in RAM. A packet starts with a 32bit “header”
word. The low 24bits of that word are the address of the next packet in RAM
or 0xffffff if it’s the last item and the high 8bits are the number of words in
the packet.
You start with an empty table: you create an array of empty packets in RAM
(only 32bit headers with the high 8bits set to 0 to indicate they’re empty) and
you make each entry point to the address of the previous one and the last one
set to 0xffffff. So you have a linked list of empty elements stored in an array
in reverse order. Sounds silly but it’s actually very handy.
Now when the CPU wants to render a primitive it computes its distance to
the camera, normalizes it over the size of the ordering table and uses it as an
index. It can then take the value of the header at location in the table and insert
the draw command in the list at that point. This way it doesn’t have to iterate
through the entire list to figure out where the primitive goes, the ordering table
effectively works like a lookup table.
No matter the size of the scene, no matter how many elements have already
been inserted in the list you can always insert a new draw command by creating
a packet in ram, figuring out the depth index and updating two headers to insert
yourself in the right order. The computing cost is constant.
Of course, there can be collisions. Since there are only a finite number of
positions in the depth ordering tables two or more packets can end up sharing
the same slot. When that happens the newer element will point to the previous
one and will therefore be drawn first (regardless of whether it’s actually on front
or behind). The smaller the table the smaller the granularity. That explains
some of the visual glitches you can see in a lot of 3D games on the console, it’s
just a limitation of the hardware.
Once the game has finished projecting and sorting the scene’s draw command
it can send it to the GPU by starting from the last entry in the depth ordering
table and then iterating through the linked list until it reaches the 0xffffff
end-of-list marker.
105
// / DMA r e g i s t e r w r i t e
f n s e t d m a r e g (&mut s e l f , o f f s e t : u32 , v a l : u32 ) {
l e t major = ( o f f s e t & 0 x70 ) >> 4 ;
l e t minor = o f f s e t & 0 x f ;
match minor {
0 => c h a n n e l . s e t b a s e ( v a l ) ,
4 => c h a n n e l . s e t b l o c k c o n t r o l ( v a l ) ,
8 => c h a n n e l . s e t c o n t r o l ( v a l ) ,
=>
p a n i c ! ( ” Unhandled DMA w r i t e { : x } : { : 0 8 x} ” ,
offset , val )
}
i f channel . a c t i v e ( ) {
Some ( p o r t )
} else {
None
}
},
// Common DMA r e g i s t e r s
7 => {
// . . .
None
}
=> p a n i c ! ( ” Unhandled DMA w r i t e { : x } : { : 0 8 x} ” ,
offset , val ) ,
};
if l e t Some ( p o r t ) = a c t i v e p o r t {
s e l f . do dma ( p o r t ) ;
}
}
}
impl Channel {
// . . .
s e l f . e n a b l e && t r i g g e r
}
}
Now the Interconnect’s do dma method will be called when a transfer must
take place.
106
The Manual and Request modes both copy blocks of data from/to the RAM.
Linked List mode is a bit different since it hops around the RAM following the
pointers in the headers. For this reason making a generic function to handle all
three modes will be a bit tricky, I prefer to handle linked list separately:
impl I n t e r c o n n e c t {
// . . .
// / Execute DMA t r a n s f e r f o r a p o r t
f n do dma(&mut s e l f , p o r t : Port ) {
// DMA t r a n s f e r has been s t a r t e d , f o r now l e t ’ s
// p r o c e s s e v e r y t h i n g i n one p a s s ( i . e . no
// c h o p p i n g o r p r i o r i t y h a n d l i n g )
match s e l f . dma . c h a n n e l ( p o r t ) . s y n c ( ) {
Sync : : L i n k e d L i s t => p a n i c ! ( ” Linked l i s t mode
unsupported ” ) ,
=> s e l f . d o d m a b l o c k ( p o r t ) ,
}
}
}
f n d o d m a b l o c k (&mut s e l f , p o r t : Port ) {
l e t c h a n n e l = s e l f . dma . c h a n n e l m u t ( p o r t ) ;
l e t i n c r e m e n t = match c h a n n e l . s t e p ( ) {
Step : : I n c r e m e n t => 4 ,
Step : : Decrement => −4,
};
l e t mut addr = c h a n n e l . b a s e ( ) ;
// T r a n s f e r s i z e i n words
l e t mut remsz = match c h a n n e l . t r a n s f e r s i z e ( ) {
Some ( n ) => n ,
// Shouldn ’ t happen s i n c e we shouldn ’ t be r e a c h i n g t h i s
// code i n l i n k e d l i s t mode
None =>
p a n i c ! ( ” Couldn ’ t f i g u r e out DMA b l o c k t r a n s f e r s i z e ” )
,
};
w h i l e remsz > 0 {
// Not s u r e what happens i f a d d r e s s i s
// bogus . . . Mednafen j u s t masks addr t h i s way , maybe
107
// t h a t ’ s how t h e hardware b e h a v e s ( i . e . t h e RAM
// a d d r e s s wraps and t h e two LSB a r e i g n o r e d , seems
// r e a s o n a b l e enough
l e t c u r a d d r = addr & 0 x 1 f f f f c ;
match c h a n n e l . d i r e c t i o n ( ) {
D i r e c t i o n : : FromRam => p a n i c ! ( ” Unhandled DMA
direction ”) ,
D i r e c t i o n : : ToRam => {
l e t s r c w o r d = match p o r t {
// C l e a r o r d e r i n g t a b l e
Port : : Otc => match remsz {
// L a s t e n t r y c o n t a i n s t h e end
// o f t a b l e marker
1 => 0 x f f f f f f ,
// P o i n t e r t o t h e p r e v i o u s e n t r y
=> addr . w r a p p i n g s u b ( 4 ) & 0 x 1 f f f f f ,
},
=> p a n i c ! ( ” Unhandled DMA s o u r c e p o r t {} ” ,
p o r t a s u8 ) ,
};
s e l f . ram . s t o r e 3 2 ( c u r a d d r , s r c w o r d ) ;
}
}
c h a n n e l . done ( ) ;
}
}
impl Channel {
// . . .
match s e l f . s y n c {
// For manual mode o n l y t h e b l o c k s i z e i s used
Sync : : Manual => Some ( bs ) ,
// I n DMA r e q u e s t mode we must t r a n s f e r ‘ bc ‘ b l o c k s
Sync : : Request => Some ( bc ∗ bs ) ,
// I n l i n k e d l i s t mode t h e s i z e i s not known ahead o f
// time : we s t o p when we e n c o u n t e r t h e ” end o f l i s t ”
108
// marker ( 0 x f f f f f f )
Sync : : L i n k e d L i s t => None ,
}
}
// / S e t t h e c h a n n e l s t a t u s t o ” c o m p l e t e d ” s t a t e
pub f n done(&mut s e l f ) {
s e l f . enable = f a l s e ;
self . trigger = false ;
// XXX Need t o s e t t h e c o r r e c t v a l u e f o r t h e o t h e r f i e l d s
// ( i n p a r t i c u l a r i n t e r r u p t s )
}
}
Note the conditional to write 0xffffff in the last iteration, it’s of course
important because otherwise the DMA won’t find the end of table marker and
start jumping randomly in RAM, sending crap to the GPU in the process. It’s
important to note that this is vital, and even the ”Sony Logo” won’t render
correctly if this is not implemented, if you’re not receiving GP0(38h), this is
probably why.
When the copy is done I call the channel.done() method which clears the
trigger and enable flags. It should probably do more than that eventually, in
particular it should trigger the interrupt if it’s enabled. We’ll leave that for later.
We can now finally run our first DMA transfer in full! The BIOS sets the
base address to 0x000eb8d4 and the block size to 1024 before starting channel 6
and we then initialize an empty ordering table.
After that the BIOS enters an infinite loop on the GPUSTAT register. This
time it’s waiting for bit 26 which is “ready to receive command word”. We are
going to set this bit by default and while we’re at it we’re also going to add bit
27 which is “ready to send VRAM to CPU”. This way we should avoid locking
the BIOS on this register in the future:
impl I n t e r c o n n e c t {
// . . .
With this modification the BIOS goes a little further and configures DMA
channel 2 to send a Linked List to the GPU.
109
3.9 DMA Linked Lists
Navigating the linked list is pretty straightforward: the BIOS puts the address
of the first list header is the DMA channel’s base address. We read the high byte
of the header to know the size of the packet (in words, not counting the header).
Packets are continuous in RAM so the data follows the header word directly.
Once the packet data has been sent to the device we look at the low 24bits of
the header. If it’s 0xffffff then we’re done, otherwise it contains the address
of the next header and we loop.
I’m not sure about if linked list mode is supported only by channel 2 (the
GPU) or if it’s available for other ports. As far as I can tell it’s only ever used
to send commands to the GPU however, I’ll have to remember test that.
By the way, interesting bit of information for us emulator writers: it seems
that while the DMA offers a great deal of flexibility with a lot options and flags
only a handful of configs are ever used for each channel. PCSX-R hardcodes
those configs and simply ignores more exotic flag combinations (even though
they’re technically possible) and mednafen, while supporting most options, has
an optimized fast path for the common configs. The Nocash’s docs also lists
those common configs (and the few odd variations in some games). It means
that we can probably go a long way even if we don’t support some obscure
configurations.
Here’s what my simple linked list synchronization mode implementation looks
like:
impl I n t e r c o n n e c t {
// . . .
// / Execute DMA t r a n s f e r f o r a p o r t
f n do dma(&mut s e l f , p o r t : Port ) {
// DMA t r a n s f e r has been s t a r t e d , f o r now l e t ’ s
// p r o c e s s e v e r y t h i n g i n one p a s s ( i . e . no
// c h o p p i n g o r p r i o r i t y h a n d l i n g )
match s e l f . dma . c h a n n e l ( p o r t ) . s y n c ( ) {
Sync : : L i n k e d L i s t => s e l f . d o d m a l i n k e d l i s t ( p o r t ) ,
=> s e l f . d o d m a b l o c k ( p o r t ) ,
}
}
i f c h a n n e l . d i r e c t i o n ( ) == D i r e c t i o n : : ToRam {
p a n i c ! ( ” I n v a l i d DMA d i r e c t i o n f o r l i n k e d l i s t mode” ) ;
}
loop {
// I n l i n k e d l i s t mode , each e n t r y s t a r t s with a
110
// ” h e a d e r ” word . The h i g h b y t e c o n t a i n s t h e number
// o f words i n t h e ” p a c k e t ” ( not c o u n t i n g t h e h e a d e r
// word )
l e t h e a d e r = s e l f . ram . l o a d 3 2 ( addr ) ;
w h i l e remsz > 0 {
addr = ( addr + 4 ) & 0 x 1 f f f f c ;
remsz −= 1 ;
}
addr = h e a d e r & 0 x 1 f f f f c ;
}
c h a n n e l . done ( ) ;
}
}
Since we haven’t implement the GPU yet I just display the command word
without further processing. We’ll have to hook our GPU rendering code here
when it’s done. Let’s get a bit further in our DMA implementation before we
start working on the GPU, don’t have anything interesting to display yet.
w h i l e remsz > 0 {
// . . .
match c h a n n e l . d i r e c t i o n ( ) {
D i r e c t i o n : : FromRam => {
l e t s r c w o r d = s e l f . ram . l o a d 3 2 ( c u r a d d r ) ;
111
match p o r t {
Port : : Gpu => p r i n t l n ! ( ”GPU data { : 0 8 x} ” ,
src word ) ,
=> p a n i c ! ( ” Unhandled DMA d e s t i n a t i o n p o r t
{} ” ,
p o r t a s u8 ) ,
}
}
// . . .
}
c h a n n e l . done ( ) ;
}
}
We still can’t do much more than printing the raw GPU data but at least the
DMA part seems to work as intended. If we try to interpret the GPU commands
sent through the linked list we can guess what it’s doing30 :
• First it displays a black quadrilateral that takes the whole screen (command
0x28000000). It does this several times.
• Then it appears to load a texture (maybe the background with the text?)
• Then it draws the same quadrilateral again but with a dark-grey color
(command 0x28030303 where 0x030303 is a 24bit BGR colour)
• Then it draws it again repetedly, slowly changing the colour to a lighter
grey, it looks like the “fade-in” effect at the very beginning of the boot
animation. (commands 0x28060606, 0x28090909 etc. . . to 0x28b4b4b4)
• Then it adds three more draw commands: 0x380000b2 which draws a
shaded quadrilateral and two 0x300000b2 commands which draw shaded
triangles.
112
• We want to create a reasonably accurate internal representation of the PSX
GPU. Mainly we want to update the register values to reflect the current
GPU state instead of our current hardcoded values. This will layout a
basic GPU state machine that we’ll improve later when we’ll implement
video timings, interrupts and other delicacies.
• We’ll also implement a very simple and innacurate OpenGL renderer.
That’ll give us the opportunity to implement some of the very boring low
level OpenGL boilerplate and we’ll have some visual feedback for debugging
the rest of the emulator.
In order to do this we’ll start back from the beginning, review all the GPU
register accesses (both from the CPU and DMA) and attempt to implement
them as best as we can.
113
// / ou tp ut must u s e e x t e r n a l a s s e t s ( pre−r e n d e r e d t e x t u r e s ,
MDEC,
// / e t c . . . )
d i s p l a y d e p t h : DisplayDepth ,
// / Output i n t e r l a c e d v i d e o s i g n a l i n s t e a d o f p r o g r e s s i v e
i n t e r l a c e d : bool ,
// / D i s a b l e t h e d i s p l a y
d i s p l a y d i s a b l e d : bool ,
// / True when t h e i n t e r r u p t i s a c t i v e
i n t e r r u p t : bool ,
// / DMA r e q u e s t d i r e c t i o n
d m a d i r e c t i o n : DmaDirection ,
}
// / Depth o f t h e p i x e l v a l u e s i n a t e x t u r e page
#[ d e r i v e ( Copy ) ]
enum TextureDepth {
// / 4 b i t s p e r p i x e l
T4Bit = 0 ,
// / 8 b i t s p e r p i x e l
T8Bit = 1 ,
// / 15 b i t s p e r p i x e l
T15Bit = 2 ,
}
// / Video o ut pu t h o r i z o n t a l r e s o l u t i o n
#[ d e r i v e ( Copy ) ]
s t r u c t H o r i z o n t a l R e s ( u8 ) ;
impl H o r i z o n t a l R e s {
// / C r e a t e a new H o r i z o n t a l R e s i n s t a n c e from t h e 2 b i t f i e l d ‘
hr1 ‘
// / and t h e one b i t f i e l d ‘ hr2 ‘
f n f r o m f i e l d s ( hr1 : u8 , hr2 : u8 ) −> H o r i z o n t a l R e s {
l e t hr = ( hr2 & 1 ) | ( ( hr1 & 3 ) << 1 ) ;
H o r i z o n t a l R e s ( hr )
}
// / R e t r i e v e v a l u e o f b i t s [ 1 8 : 1 6 ] o f t h e s t a t u s r e g i s t e r
f n i n t o s t a t u s ( s e l f ) −> u32 {
l e t H o r i z o n t a l R e s ( hr ) = s e l f ;
( hr a s u32 ) << 16
}
}
// / Video o ut pu t v e r t i c a l r e s o l u t i o n
#[ d e r i v e ( Copy ) ]
enum V e r t i c a l R e s {
// / 240 l i n e s
Y240Lines = 0 ,
// / 480 l i n e s ( o n l y a v a i l a b l e f o r i n t e r l a c e d ou tp ut )
114
Y480Lines = 1 ,
}
// / Video Modes
#[ d e r i v e ( Copy ) ]
enum VMode {
// / NTSC: 480 i60H
Ntsc = 0 ,
// / PAL : 576 i50Hz
Pal = 1 ,
}
// / D i s p l a y a r e a c o l o r depth
#[ d e r i v e ( Copy ) ]
enum DisplayDepth {
// / 15 b i t s p e r p i x e l
D15Bits = 0 ,
// / 24 b i t s p e r p i x e l
D24Bits = 1 ,
}
// / Requested DMA d i r e c t i o n .
#[ d e r i v e ( Copy ) ]
enum DmaDirection {
Off = 0 ,
Fifo = 1 ,
CpuToGp0 = 2 ,
VRamToCpu = 3 ,
}
115
d m a d i r e c t i o n : DmaDirection : : Off ,
}
}
}
For the time being we can implement the GPUSTAT register read. It’s a
read-only register since writes to the GPUSTAT register address end up in the
GP1 register. We’ll see how the GPU config is modified in a minute.
impl Gpu {
// . . .
// / R e t r i e v e v a l u e o f t h e s t a t u s r e g i s t e r
pub f n s t a t u s (& s e l f ) −> u32 {
l e t mut r = 0 u32 ;
r |= ( s e l f . p a g e b a s e x a s u32 ) << 0 ;
r |= ( s e l f . p a g e b a s e y a s u32 ) << 4 ;
r |= ( s e l f . s e m i t r a n s p a r e n c y a s u32 ) << 5 ;
r |= ( s e l f . t e x t u r e d e p t h a s u32 ) << 7 ;
r |= ( s e l f . d i t h e r i n g a s u32 ) << 9 ;
r |= ( s e l f . d r a w t o d i s p l a y a s u32 ) << 1 0 ;
r |= ( s e l f . f o r c e s e t m a s k b i t a s u32 ) << 1 1 ;
r |= ( s e l f . p r e s e r v e m a s k e d p i x e l s a s u32 ) << 1 2 ;
r |= ( s e l f . f i e l d a s u32 ) << 1 3 ;
// B i t 1 4 : not s u p p o r t e d
r |= ( s e l f . t e x t u r e d i s a b l e a s u32 ) << 1 5 ;
r |= s e l f . h r e s . i n t o s t a t u s ( ) ;
r |= ( s e l f . v r e s a s u32 ) << 1 9 ;
r |= ( s e l f . vmode a s u32 ) << 2 0 ;
r |= ( s e l f . d i s p l a y d e p t h a s u32 ) << 2 1 ;
r |= ( s e l f . i n t e r l a c e d a s u32 ) << 2 2 ;
r |= ( s e l f . d i s p l a y d i s a b l e d a s u32 ) << 2 3 ;
r |= ( s e l f . i n t e r r u p t a s u32 ) << 2 4 ;
r |= ( s e l f . d m a d i r e c t i o n a s u32 ) << 2 9 ;
// B i t 31 s h o u l d change d e p e n d i n g on t h e c u r r e n t l y drawn
// l i n e ( whether i t ’ s even , odd o r i n t h e v b l a c k
// a p p a r e n t l y ) . Let ’ s not b o t h e r with i t f o r now .
r |= 0 << 3 1 ;
// Not s u r e about t h a t , I ’m g u e s s i n g t h a t i t ’ s t h e s i g n a l
// c h e c k e d by t h e DMA i n when s e n d i n g data i n Request
// s y n c h r o n i z a t i o n mode . For now I b l i n d l y f o l l o w t h e
// Nocash s p e c .
l e t dma request =
match s e l f . d m a d i r e c t i o n {
// Always 0
DmaDirection : : O f f => 0 ,
// Should be 0 i f FIFO i s f u l l , 1 o t h e r w i s e
DmaDirection : : F i f o => 1 ,
// Should be t h e same a s s t a t u s b i t 28
DmaDirection : : CpuToGp0 => ( r >> 2 8 ) & 1 ,
// Should be t h e same a s s t a t u s b i t 27
116
DmaDirection : : VRamToCpu => ( r >> 2 7 ) & 1 ,
};
r |= d m a r e q u e s t << 2 5 ;
r
}
}
You can see that I don’t support bit 14: the Nocash spec says that when this
bit is set on the real hardware just messes up the display in a weird way. We
can probably assume that it’s not a commonly used feature for the moment.
As before I hardcode the “ready” bits to 1 since we have a long way to go
before we have the necessary infrastructure to emulate them accurately. We’ll
need to emulate the various internal FIFOs and the rate at which they empty
for instance. That will come later.
In general I’m not entirely sure how the DMA state machine synchronizes
with the GPU. We’ll have to hope it’s not too critical for now. As we progress if
we start to notice that our emulator seems to misbehave because of a broken
GPU DMA we’ll have to investigate further.
impl Gpu {
// . . .
match opcode {
117
0 xe1 => s e l f . gp0 draw mode ( v a l ) ,
=> p a n i c ! ( ” Unhandled GP0 command { : 0 8 x} ” , v a l ) ,
}
}
s e l f . texture depth =
match ( v a l >> 7 ) & 3 {
0 => TextureDepth : : T4Bit ,
1 => TextureDepth : : T8Bit ,
2 => TextureDepth : : T15Bit ,
n => p a n i c ! ( ” Unhandled t e x t u r e depth {} ” , n ) ,
};
We can now call our new gp0 method from the interconnect:
impl I n t e r c o n n e c t {
// . . .
// / S t o r e 32 b i t word ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e 3 2 (&mut s e l f , addr : u32 , v a l : u32 ) {
// . . .
// . . .
}
}
31 I’ve tried quickly disassembling the surrounding code but I couldn’t really figure out what
it’s trying to do. I’ll have to take the time to dig deeper at some point. . .
118
// / Handle w r i t e s t o t h e GP0 command r e g i s t e r
pub f n gp0(&mut s e l f , v a l : u32 ) {
l e t opcode = ( v a l >> 2 4 ) & 0 x f f ;
match opcode {
0 x00 => ( ) , // NOP
0 xe1 => s e l f . gp0 draw mode ( v a l ) ,
=> p a n i c ! ( ” Unhandled GP0 opcode { : 0 8 x} ” , v a l ) ,
}
}
}
119
I tried to get the reset value from my console, unfortunately some of the
values like display horiz * and display line * cannot be read directly from
any register as far as I can tell so I’m going to use the values given by the NoCash
specs instead.
impl Gpu {
// . . .
match opcode {
0 x00 => s e l f . g p 1 r e s e t ( v a l ) ,
=> p a n i c ! ( ” Unhandled GP1 command { : 0 8 x} ” , v a l ) ,
}
}
// / GP1( 0 x00 ) : s o f t r e s e t
f n g p 1 r e s e t (&mut s e l f , : u32 ) {
s e l f . interrupt = false ;
s e l f . d m a d i r e c t i o n = DmaDirection : : O f f ;
120
// XXX s h o u l d a l s o i n v a l i d a t e GPU c a c h e i f we e v e r
implement i t
}
}
The reset command is supposed to flush the command FIFO and the texture
cache but we don’t emulate those yet so I just added a note to remember to
modify the function when we add support for one of those.
The texture window * parameters are used to crop a texture. The drawing area *
parameters are used to describe a drawing window, the GPU won’t draw anything
outside of this area.
The drawing offset * parameters are a constant offset that’s added to all
the vertex. It lets you translate a scene in VRAM without having to recompute
all the coordinates on the CPU.
The display vram *, display horiz * and display line * parameters are
used to describe which portion of the VRAM are drawn on the screen. If you’re
not familiar with the wonderful world of analog video it might not be immediately
obvious what those parameters do so let me give a quick overview of the GPU’s
video output.
121
4.6 GPUREAD register placeholder
After those commands the BIOS reads from the register at offset 0 in the GPU
(the same address where GP0 commands are written). This register is GPUREAD
and is used to retrieve data generated by certain commands, typically to read
parts of the framebuffer back in RAM. The problem is that so far no such
command has been issued so I’m not sure why the BIOS attempts to read from
there. For now let’s return 0 and we’ll implement it properly later:
impl Gpu {
// . . .
// / R e t r i e v e v a l u e o f t h e ” r e a d ” r e g i s t e r
pub f n r e a d (& s e l f ) −> u32 {
// Not implemented f o r now . . .
0
}
}
s e l f . h r e s = H o r i z o n t a l R e s : : f r o m f i e l d s ( hr1 , hr2 ) ;
s e l f . v r e s = match v a l & 0 x4 != 0 {
f a l s e => V e r t i c a l R e s : : Y240Lines ,
t r u e => V e r t i c a l R e s : : Y480Lines ,
};
s e l f . i n t e r l a c e d = v a l & 0 x20 != 0 ;
i f v a l & 0 x80 != 0 {
p a n i c ! ( ” Unsupported d i s p l a y mode { : 0 8 x} ” , v a l ) ;
}
}
}
122
4.8 GP1 DMA direction command
After that the BIOS issues the GP1 command 0x04000000. Opcode 0x04 simply
sets the DMA direction (to Off in this case):
impl Gpu {
// . . .
loop {
// . . .
w h i l e remsz > 0 {
addr = ( addr + 4 ) & 0 x 1 f f f f c ;
remsz −= 1 ;
}
// . . .
}
// . . .
}
}
123
framebuffer it won’t write anything outside of the drawing area even if a draw
command clips outside.
impl Gpu {
// . . .
You see that the drawing area top value can range from 0 to 1023. It’s
strange because the GPU VRAM only has 512 lines so anything beyond that
value won’t be rendered. The horizontal coordinate, drawing area left, has
the same resolution but this one is normal: the VRAM has 2048 bytes per lines
but since the GPU draws 16 bits per pixel (15bit RGB + mask bit) you can only
fit 1024 pixels per VRAM line.
Unsuprisingly the next command is 0xe403c27f which sets the bottom-right
corner of the drawing area, the parameter packing is the same:
impl Gpu {
// . . .
After those two commands the top-left corder is at [0, 1] while the bottom-
right is at [639, 240]. The coordinates are inclusive so the drawing area resolution
is 640x240 which looks like a standard NTSC field resolution.
shift to 16bits in order to get the correct sign in an i16, then we can shift back to 11bits.
124
}
}
This particular command sets the offset to [0, 1] which matches the drawing
area top-left corner so everything is coherent so far. I’m not sure why the BIOS
doesn’t start at [0, 0] but I guess wasting one line doesn’t matter much for
displaying the boot logo.
The mask bit behaves a bit like OpenGL’s stencil masks, it prevents the
GPU from overwriting a pixel if its mask bit is set and masking is enabled.
125
The current command sets the start coordinates to [0, 241] which is immedi-
ately below the drawing area we configured before. I assume it’s because the
BIOS will use a form of double buffering and won’t draw directly to the displayed
area.
Note that those commands use a different packing format for their parameters.
impl CommandBuffer {
f n new ( ) −> CommandBuffer {
CommandBuffer {
buffer : [ 0 ; 12] ,
33 Those coordinates are not in VRAM but rather in the output’s video signal system of
coordinates.
126
len : 0,
}
}
// / C l e a r t h e command b u f f e r
f n c l e a r (&mut s e l f ) {
s e l f . len = 0;
}
s e l f . l e n += 1 ;
}
}
&s e l f . b u f f e r [ index ]
}
}
It’s just a glorified array which can contain up to 12 words and keeps the
count of how many words have been pushed into it. The std::ops::Index
mumbo jumbo just overloads the [] operator to let us access CommandBuffer
elements like a regular array.
We can add an instance of this CommandBuffer to our GPU state and we’ll
also add a counter of the number of remaining parameters and a function pointer
to the method which implements the command (it will save us having to match
the opcode twice):
pub s t r u c t Gpu {
// . . .
// / B u f f e r c o n t a i n i n g t h e c u r r e n t GP0 command
gp0 command : CommandBuffer ,
// / Remaining words f o r t h e c u r r e n t GP0 command
gp0 command remaining : u32 ,
// / P o i n t e r t o t h e method i m p l e m e n t i n g t h e c u r r e n t GP) command
gp0 command method : f n (&mut Gpu) ,
}
We can now modify our GP0 register handler to use this new infrastructure:
impl Gpu {
// . . .
l e t ( l e n , method ) =
match opcode {
127
0 x00 =>
( 1 , Gpu : : gp0 nop
a s f n (&mut Gpu) ) ,
0 x28 =>
( 5 , Gpu : : gp0 quad mono opaque
a s f n (&mut Gpu) ) ,
0 xe1 =>
( 1 , Gpu : : gp0 draw mode
a s f n (&mut Gpu) ) ,
0 xe2 =>
( 1 , Gpu : : g p 0 t e x t u r e w i n d o w
a s f n (&mut Gpu) ) ,
0 xe3 =>
( 1 , Gpu : : g p 0 d r a w i n g a r e a t o p l e f t
a s f n (&mut Gpu) ) ,
0 xe4 =>
( 1 , Gpu : : g p 0 d r a w i n g a r e a b o t t o m r i g h t
a s f n (&mut Gpu) ) ,
0 xe5 =>
( 1 , Gpu : : g p 0 d r a w i n g o f f s e t
a s f n (&mut Gpu) ) ,
0 xe6 =>
( 1 , Gpu : : g p 0 m a s k b i t s e t t i n g
a s f n (&mut Gpu) ) ,
=> p a n i c ! ( ” Unhandled GP0 command { : 0 8 x} ” ,
val ) ,
};
s e l f . gp0 command . c l e a r ( ) ;
}
// / GP0( 0 x00 ) : No O p e r a t i o n
f n gp0 nop (&mut s e l f ) {
// NOP
}
}
We’re still missing the implementation of the gp0 quad mono opaque function
that’s supposed to render the primitive in the framebuffer. We could start
drawing to the screen right away but since we only have a black rectangle so far
it wouldn’t be very interesting. Let’s put a placeholder for now and continue a
little further before we fire up OpenGL:
impl Gpu {
// . . .
128
}
// / R e t r i e v e v a l u e o f t h e s t a t u s r e g i s t e r
pub f n s t a t u s (& s e l f ) −> u32 {
l e t mut r = 0 u32 ;
// . . .
r |= s e l f . h r e s . i n t o s t a t u s ( ) ;
// XXX Temporary hack : i f we don ’ t e m u l a t e b i t 31 c o r r e c t l y
// s e t t i n g ‘ v r e s ‘ t o 1 l o c k s t h e BIOS :
// r |= ( s e l f . v r e s a s u32 ) << 1 9 ;
r |= ( s e l f . vmode a s u32 ) << 2 0 ;
// . . .
}
}
This is not very satisfactory of course but that should allow us to keep going
with our first GPU implementation. Soon after that we’ll start working on
accurate timings and we’ll be able to emulate bit 31 properly.
129
// / GP0( 0 x01 ) : C l e a r Cache
f n g p 0 c l e a r c a c h e (&mut s e l f ) {
// Not implemented
}
}
// / C u r r e n t mode o f t h e GP0 r e g i s t e r
gp0 mode : Gp0Mode ,
}
impl Gpu {
pub f n new ( ) −> Gpu {
Gpu {
// . . .
// . . .
}
// / P o s s i b l e s t a t e s f o r t h e GP0 command r e g i s t e r
enum Gp0Mode {
// / D e f a u l t mode : h a n d l i n g commands
Command ,
// / Loading an image i n t o VRAM
ImageLoad ,
}
I also renamed gp0 command remaining into gp0 words remaining since it
will also count the remaining number of image words to load.
We can then tweak the gp0 method to handle this new mode:
130
impl Gpu {
// . . .
l e t ( l e n , method ) =
match opcode {
// . . .
0 xa0 =>
( 3 , Gpu : : g p 0 i m a g e l o a d a s f n (&mut Gpu) ) ,
// . . .
};
s e l f . gp0 command . c l e a r ( ) ;
}
s e l f . g p 0 w o r d s r e m a i n i n g −= 1 ;
if s e l f . g p 0 w o r d s r e m a i n i n g == 0 {
// We have a l l t h e p a r a m e t e r s , we can run
// t h e command
( s e l f . gp0 command method ) ( s e l f ) ;
}
}
Gp0Mode : : ImageLoad => {
// XXX Should copy p i x e l data t o VRAM
if s e l f . g p 0 w o r d s r e m a i n i n g == 0 {
// Load done , s w i t c h back t o command mode
s e l f . gp0 mode = Gp0Mode : : Command ;
}
}
}
}
}
I added the gp0 image load command which I consider to be 3 words long.
The method uses those parameters to compute the number of words we must
expect as part of the image data and puts it back in gp0 words remaining while
switching gp0 mode to ImageLoad:
impl Gpu {
// . . .
l e t width = r e s & 0 x f f f f ;
l e t h e i g h t = r e s >> 1 6 ;
131
// S i z e o f t h e image i n 16 b i t p i x e l s
l e t i m g s i z e = width ∗ h e i g h t ;
w h i l e remsz > 0 {
// . . .
match c h a n n e l . d i r e c t i o n ( ) {
D i r e c t i o n : : FromRam => {
l e t s r c w o r d = s e l f . ram . l o a d 3 2 ( c u r a d d r ) ;
match p o r t {
Port : : Gpu => s e l f . gpu . gp0 ( s r c w o r d ) ,
=>
p a n i c ! ( ” Unhandled DMA d e s t i n a t i o n p o r t {} ” ,
p o r t a s u8 ) ,
}
}
// . . .
}
// . . .
}
// . . .
}
}
Our emulator now loads textures to the GPU and then discards them imme-
diately. Beautiful.
132
4.21 GP1 Display Enable command
After that the bios issues GP1 command 0x03000000 which is used to set the
value of of our display disabled field:
impl Gpu {
// . . .
l e t width = r e s & 0 x f f f f ;
l e t h e i g h t = r e s >> 1 6 ;
We don’t have to do anything more: after this command the BIOS will expect
the image data to be available through the GPUREAD register. Right now our
implementation of this register always returns 0 so it will read that as many
times as it wants.
133
missing a few commands before we can proceed to implement the OpenGL
renderer itself.
The first one is 0x380000b2 which draws a shaded quadrilateral. It means
that unlike the previous quad command this one takes one color per vertex and
fills the shape with a Gouraud shading which creates a gradient between those
values. We’ll see that this type of shading is trivial to implement in OpenGL.
This command takes 8 parameters: 4 vertex position and their assorted colors.
As for the other drawing commands let’s put a placeholder for the moment:
impl Gpu {
// . . .
l e t ( l e n , method ) =
match opcode {
// . . .
0 x38 =>
( 8 , Gpu : : g p 0 q u a d s h a d e d o p a q u e
a s f n (&mut Gpu) ) ,
// . . .
};
// . . .
}
// . . .
}
l e t ( l e n , method ) =
match opcode {
// . . .
0 x30 =>
( 6 , Gpu : : g p 0 t r i a n g l e s h a d e d o p a q u e
a s f n (&mut Gpu) ) ,
// . . .
};
// . . .
}
134
// . . .
}
l e t ( l e n , method ) =
match opcode {
// . . .
0 x2c =>
( 9 , Gpu : : g p 0 q u a d t e x t u r e b l e n d o p a q u e
a s f n (&mut Gpu) ) ,
// . . .
};
// . . .
}
// . . .
}
135
4.27 GP1 Reset Command Buffer command
And we finish this sequence with GP1 command 0x01000000 which clears the
command FIFO. We don’t implement the FIFO itself yet but we can at least
reset the GP0 state machine to a default state:
impl Gpu {
// . . .
I take the opportunity add a call to this function in gp1 reset since it should
also clear the command buffer.
And that’s it! We have our entire GPU command sequence to display the
boot logo. Now we can implement a basic OpenGL renderer to visualize it all.
136
s d l c o n t e x t : s d l 2 : : s d l : : Sdl ,
window : s d l 2 : : v i d e o : : Window ,
g l c o n t e x t : s d l 2 : : v i d e o : : GLContext ,
}
impl Re ndere r {
s d l 2 : : v i d e o : : g l s e t a t t r i b u t e ( GLContextMajorVersion , 3 ) ;
s d l 2 : : v i d e o : : g l s e t a t t r i b u t e ( GLContextMinorVersion , 3 ) ;
l e t g l c o n t e x t = window . g l c r e a t e c o n t e x t ( ) . unwrap ( ) ;
gl : : load with ( | s |
s d l 2 : : v i d e o : : g l g e t p r o c a d d r e s s ( s ) . unwrap ( )
as ∗ const c v o i d ) ;
Re ndere r {
sdl context : sdl context ,
window : window ,
gl context : gl context ,
}
}
}
The function sdl2::init calls the global SDL2 initialization routine. For
now we’re only using the VIDEO subsystem. In the SDL2 C API this function
doesn’t return anything but the rust bindings return an object that’s used
to call SDL Quit automatically when it’s destroyed. In C you have to call
SDL Quit explicitly when your program exits (or whenever you don’t need the
SDL anymore).
After that the two gl set attribute calls say that we’re going to use
OpenGL 3.334 .
Then Window::new creates the window itself with a resolution of 1024x512
(the resolution of the VRAM) and OpenGL support. I named the window “PSX”
because I don’t have any imagination.
We can retreive the window’s OpenGL context with the gl create context
method and then we must load the OpenGL function pointers. You don’t really
need to understand that part in details, it’s some glue between the OpenGL
and SDL libraries, you just need to make sure it’s done before we start calling
OpenGL commands.
Finally we store the SDL context, window and OpenGL context in the newly
created Renderer object. We need to put an instance of this struct in our GPU:
pub s t r u c t Gpu {
// . . .
34 At the time of writing OpenGL 4.5 is the latest version but 3.3 is more widely supported
and should suffice for what we’re doing although we may end up using a couple extensions.
137
// / OpenGL r e n d e r e r
r e n d e r e r : o p e n g l : : Renderer ,
}
impl Gpu {
pub f n new ( ) −> Gpu {
Gpu {
// . . .
r e n d e r e r : o p e n g l : : Re nder er : : new ( ) ,
}
}
}
If everything works well our emulator should now create a window when
starting up. The window’s contents are garbage however (on my system it
contains a chunk of the screen). We can clear it by issuing the following calls:
impl Re ndere r {
// C l e a r t h e window
unsafe {
gl : : ClearColor ( 0 . , 0. , 0. , 1.0) ;
g l : : C l e a r ( g l : : COLOR BUFFER BIT) ;
}
Re ndere r {
// . . .
}
}
}
The unsafe keyword is there because as far as Rust is concerned all OpenGL
calls are a C foreign function interface and are therefore potentially memory
unsafe. The ClearColor35 function sets the clear color (duh): the first three
parameters are the red, green and blue components and the fourth is the alpha
parameter. They all are floating point integers in the range [0.0, 1.0]. In this
case all the color components are 0.0 so the color is black and alpha is set to 1.0
which means it’s fully opaque.
The Clear function then applies this color to the entire color buffer. You’ll
notice that we just give the type of buffer we want to clear as parameter, not a
handle to a specific buffer. That’s the way most of the OpenGL API works: you
“bind” various types of object to an implicit global context and the subsequent
function calls act on the currently bound object of for a given type. In this case
we haven’t bound anything ourselves, by default the color buffer will be the
window’s framebuffer.
The gl swap window forces a window update and displays the result of
the previous commands. With this addition the window should now appear
completely black. Progress!
35 The OpenGL C API concatenates the “gl” prefix to symbols (“GL ” for macros) so in
C ClearColor would be glClearColor and COLOR BUFFER BIT would be GL COLOR BUFFER BIT.
When searching for an OpenGL symbol online it’s sometimes better to use the C form.
138
5.2 Drawing the primitives
Now let’s do something more interesting: drawing the primitives. This is the
part where we’ll have to write a whole lot of OpenGL glue so take a deep breath
and dive in.
Let’s choose a primitive to start with, I’ve decided to use GP0(0x30),the
gouraud shaded triangle. It’s a simple shape with some basic shading. It has
three vertex, each having a position in VRAM and a color. Let’s create structs
to hold those attributes in a shader-friendly fashion:
// / P o s i t i o n i n VRAM.
#[ d e r i v e ( Copy , Clone , D e f a u l t , Debug ) ]
pub s t r u c t P o s i t i o n ( pub GLshort , pub GLshort ) ;
impl P o s i t i o n {
// / P a r s e p o s i t i o n from a GP0 p a r a m e t e r
pub f n f r o m g p 0 ( v a l : u32 ) −> P o s i t i o n {
l e t x = val as i16 ;
l e t y = ( v a l >> 1 6 ) a s i 1 6 ;
P o s i t i o n ( x a s GLshort , y a s GLshort )
}
}
// / RGB c o l o r
#[ d e r i v e ( Copy , Clone , D e f a u l t , Debug ) ]
pub s t r u c t C o l o r ( pub GLubyte , pub GLubyte , pub GLubyte ) ;
impl C o l o r {
// / P a r s e c o l o r from a GP0 p a r a m e t e r
pub f n f r o m g p 0 ( v a l : u32 ) −> C o l o r {
l e t r = v a l a s u8 ;
l e t g = ( v a l >> 8 ) a s u8 ;
l e t b = ( v a l >> 1 6 ) a s u8 ;
let colors = [
C o l o r : : f r o m g p 0 ( s e l f . gp0 command [ 0 ] ) ,
C o l o r : : f r o m g p 0 ( s e l f . gp0 command [ 2 ] ) ,
C o l o r : : f r o m g p 0 ( s e l f . gp0 command [ 4 ] ) ,
139
];
Now we need to implement this push triangle method that will put the
attributes in a list of vertex to render. That’s where the fun begins.
First we need to setup somme buffers to hold the data. There are several
ways to send data to the GPU, I’ve decided to go with persistently mapped
buffers. The idea is that we’re going to ask OpenGL to allocate some memory
that will be shared between the GPU and us. We’ll fill it with our data and
when we’re ready we’ll tell the GPU to use it to draw the scene. Easy.
To avoid duplicating a bunch of code let’s make a generic Buffer struct
holding an attribute buffer and its mapping:
// Write o n l y b u f f e r with enough s i z e f o r VERTEX BUFFER LEN
elements
pub s t r u c t B u f f e r <T> {
// / OpenGL b u f f e r o b j e c t
o b j e c t : GLuint ,
// / Mapped b u f f e r memory
map : ∗mut T,
}
unsafe {
// G e n e r a t e t h e b u f f e r o b j e c t
g l : : G e n B u f f e r s ( 1 , &mut o b j e c t ) ;
// Bind i t
g l : : B i n d B u f f e r ( g l : : ARRAY BUFFER, o b j e c t ) ;
// Compute t h e s i z e o f t h e b u f f e r
l e t e l e m e n t s i z e = s i z e o f : : <T>() a s G L s i z e i p t r ;
l e t b u f f e r s i z e = e l e m e n t s i z e ∗ VERTEX BUFFER LEN a s
GLsizeiptr ;
// A l l o c a t e b u f f e r memory
g l : : B u f f e r S t o r a g e ( g l : : ARRAY BUFFER,
buffer size ,
ptr : : null () ,
access ) ;
// Remap t h e e n t i r e b u f f e r
memory = g l : : MapBufferRange ( g l : : ARRAY BUFFER,
0,
buffer size ,
a c c e s s ) a s ∗mut T ;
// R e s e t t h e b u f f e r t o 0 t o a v o i d hard−to−r e p r o d u c e
140
bugs
// i f we do s o m e t h i n g wrong with u n i t i a l i z e d memory
l e t s = s l i c e : : f r o m r a w p a r t s m u t ( memory ,
VERTEX BUFFER LEN a s
usize ) ;
Buffer {
object : object ,
map : memory ,
}
}
// / S e t e n t r y a t ‘ index ‘ t o ‘ v a l ‘ i n t h e b u f f e r .
pub f n s e t (&mut s e l f , i n d e x : u32 , v a l : T) {
i f i n d e x >= VERTEX BUFFER LEN {
panic ! ( ” b u f f e r overflow ! ” ) ;
}
unsafe {
l e t p = s e l f . map . o f f s e t ( i n d e x a s i s i z e ) ;
∗p = v a l ;
}
}
}
That’s a lot of code to simply allocate a buffer! Let’s walk through it:
• First GenBuffers creates a new buffer object. That doesn’t allocate the
buffer memory, it basically just creates a handle.
• This handle is then bound with BindBuffer, from then on the commands
targetting ARRAY BUFFER will use this buffer.
• We must then compute the size of the buffer in bytes. I’ve decided to
hardcode the length of the buffer in VERTEX BUFFER LEN, ideally it should
be big enough to hold an entire scene (otherwise we’ll have to make several
draw calls per frame), but not too big in order not to waste memory. We’ll
probably want to better tune that constant later.
• Once we know how much room we need we can ask OpenGL to allocate
it for us. We request MAP WRITE BIT since we want to write-only access
141
to the buffer and MAP PERSISTENT BIT to be able to hold the mapping
persistently (instead of having to remap it for each frame).
We can add our two buffers to the Renderer right now but creating buffers
without having any shaders to render them isn’t very useful.
out v e c 3 c o l o r ;
v o i d main ( ) {
// Convert VRAM c o o r d i n a t e s ( 0 ; 1 0 2 3 , 0 ; 5 1 1 ) i n t o
// OpenGL c o o r d i n a t e s ( − 1 ; 1 , −1;1)
f l o a t xpos = ( f l o a t ( v e r t e x p o s i t i o n . x ) / 5 1 2 ) − 1 . 0 ;
// VRAM p u t s 0 a t t h e top , OpenGL a t t h e bottom ,
// we must m i r r o r v e r t i c a l l y
f l o a t ypos = 1 . 0 − ( f l o a t ( v e r t e x p o s i t i o n . y ) / 2 5 6 ) ;
OpenGL shader language, also called GLSL, looks a bit like C but don’t
let that fool you, it’s actually quite different. For one you can see that the
parameters and return values are not given in the main prototype, instead they’re
given at the global scope as in and out parameters.
142
We have two in parameters: the vertex position (a pair of signed integers)
and its color (a triplet of unsigned integers). The main function is called once
for each vertex. Our triangle as three vertices so it’ll be called 3 times.
The shader sets two output variables: color (a triplet of three floats) and
gl Position which is a a builtin GLSL variable, a vector of four floats. The last
two components of gl Position are the z (depth) coordinate which is always
0 for us since we’re drawing in 2D and the w parameter (the homogeneous
component) which should be 1.0 for a position. This last parameter is used for
perspective correct projection36 .
You can see that the OpenGL horizontal and vertical screen coordinates go
from -1.0 to 1.0 (no matter the actual resolution of the screen) and that the
vertical coordinates go in the opposite direction than the Playstation VRAM
addressing. OpenGL colors are also floats in the range [0.0, 1.0].
You can see that our vertex shader does all the work of converting coordinates
and colors from the Playstation internal representation to the OpenGL format.
In general we’ll want to offload as much computation as possible to the GPU
since I’m expecting the emulation bottleneck to be on the CPU.
After all Playstation graphics are extremely simple compared to modern
games, for instance modern GPUs have gigabytes worth of video RAM compared
to the Playstation’s puny 2MB. Even if we enhance the graphics significantly
our graphic cards shouldn’t break a sweat if we’re careful not to write extremely
poorly optimized shader code.
in vec3 c o l o r ;
out v e c 4 f r a g c o l o r ;
v o i d main ( ) {
f r a g c o l o r = vec4 ( c o l o r , 1 . 0 ) ;
}
Pretty straightforward: the output color frag color (the name is arbitrary)
takes the value of the input attribute color and a fourth value which is the alpha
channel to handle transparent pixels. In our case the pixels are fully opaque so
it’s hardcoded to 1.0.
If you’re not familiar with OpenGL you’re probably puzzled, what’s the value
of this color parameter exactly? A triangle has three vertices, potentially each
with a different color, so which one do we get here?
36 If you’re not familiar with homogeneous coordinates don’t worry, all you have to know for
now is that you have to set the w component to 1.0 for a position and 0.0 for a vector.
37 There are actually a couple more stages before that in modern OpenGL like the tesselation
143
What happens is that in this case OpenGL tells the GPU to interpolate the
value of the color based on its distance to the three vertices and their respective
color. That means that we’ll get a smooth gradient which is exactly what we
need for the gouraud shading. OpenGL does all the hard work for us!
// / Ve rt e x s h a d e r o b j e c t
v e r t e x s h a d e r : GLuint ,
// / Fragment s h a d e r o b j e c t
f r a g m e n t s h a d e r : GLuint ,
// / OpenGL Program o b j e c t
program : GLuint ,
// / OpenGL V er te x a r r a y o b j e c t
v e r t e x a r r a y o b j e c t : GLuint ,
// / B u f f e r c o n t a i n i n g t h e v e r t i c e p o s i t i o n s
p o s i t i o n s : B u f f e r <P o s i t i o n >,
// / B u f f e r c o n t a i n i n g t h e v e r t i c e c o l o r s
c o l o r s : B u f f e r <Color >,
// / C u r r e n t number o r v e r t i c e s i n t h e b u f f e r s
n v e r t i c e s : u32 ,
}
impl Re ndere r {
// ” S l u r p ” t h e c o n t e n t s o f t h e s h a d e r f i l e s . Note : t h i s i s
// a c o m p i l e −time t h i n g .
l e t vs src = i n c l u d e s t r ! ( ” vertex . g l s l ”) ;
l e t f s s r c = i n c l u d e s t r ! ( ” fragment . g l s l ” ) ;
// Compile our s h a d e r s . . .
let vertex shader = compile shader ( vs src ,
144
g l : : VERTEX SHADER) ;
l e t fragment shader = compile shader ( f s s r c ,
g l : : FRAGMENT SHADER) ;
// . . . Link our program . . .
l e t program = l i n k p r o g r a m (&[ v e r t e x s h a d e r ,
fragment shader ] ) ;
// . . . And u s e i t .
unsafe {
g l : : UseProgram ( program ) ;
}
// G e n e r a t e our v e r t e x a t t r i b u t e o b j e c t t h a t w i l l h o l d our
// v e r t e x a t t r i b u t e s
l e t mut vao = 0 ;
unsafe {
g l : : GenVertexArrays ( 1 , &mut vao ) ;
// Bind our VAO
g l : : BindVertexArray ( vao ) ;
}
// Setup t h e ” p o s i t i o n ” a t t r i b u t e . F i r s t we c r e a t e
// t h e b u f f e r h o l d i n g t h e p o s i t i o n s ( t h i s c a l l a l s o
// b i n d s i t )
l e t p o s i t i o n s = B u f f e r : : new ( ) ;
unsafe {
// Then we r e t r e i v e t h e i n d e x f o r t h e a t t r i b u t e i n t h e
// s h a d e r
l e t i n d e x = f i n d p r o g r a m a t t r i b ( program ,
” vertex position ”) ;
// Enable i t
g l : : EnableVertexAttribArray ( index ) ;
unsafe {
l e t i n d e x = f i n d p r o g r a m a t t r i b ( program ,
” vertex color ”) ;
g l : : EnableVertexAttribArray ( index ) ;
Re ndere r {
sdl context : sdl context ,
window : window ,
145
gl context : gl context ,
vertex shader : vertex shader ,
fragment shader : fragment shader ,
program : program ,
v e r t e x a r r a y o b j e c t : vao ,
positions : positions ,
colors : colors ,
nvertices : 0 ,
}
}
// . . .
}
Quite a lot of code to go through here. I put the code for our two shaders
described earlier in two files named “vertex.glsl” and “fragment.glsl” respectively.
I retreive their contents here using Rust’s include str directive. Then I ask
OpenGL to compile both shaders using the compile shader helper function:
pub f n c o m p i l e s h a d e r ( s r c : &s t r , s h a d e r t y p e : GLenum) −> GLuint {
l e t shader ;
unsafe {
shader = g l : : CreateShader ( shader type ) ;
// Attempt t o c o m p i l e t h e s h a d e r
l e t c s t r = C S t r i n g : : new ( s r c . a s b y t e s ( ) ) . unwrap ( ) ;
g l : : S h a d e r S o u r c e ( s h a d e r , 1 , &c s t r . a s p t r ( ) , p t r : : n u l l ( ) ) ;
g l : : CompileShader ( s h a d e r ) ;
// Extra b i t o f e r r o r c h e c k i n g i n c a s e we ’ r e not u s i n g a
// DEBUG OpenGL c o n t e x t and c h e c k f o r e r r o r s can ’ t do i t
// p r o p e r l y :
l e t mut s t a t u s = g l : : FALSE a s GLint ;
g l : : G e t S h a d e r i v ( s h a d e r , g l : : COMPILE STATUS, &mut s t a t u s ) ;
if s t a t u s != ( g l : : TRUE a s GLint ) {
p a n i c ! ( ” Shader c o m p i l a t i o n f a i l e d ! ” ) ;
}
}
shader
}
unsafe {
program = g l : : CreateProgram ( ) ;
f o r &s h a d e r i n s h a d e r s {
g l : : AttachShader ( program , s h a d e r ) ;
}
g l : : LinkProgram ( program ) ;
// Extra b i t o f e r r o r c h e c k i n g i n c a s e we ’ r e not u s i n g a
// DEBUG OpenGL c o n t e x t and c h e c k f o r e r r o r s can ’ t do i t
// p r o p e r l y :
l e t mut s t a t u s = g l : : FALSE a s GLint ;
146
g l : : GetProgramiv ( program , g l : : LINK STATUS , &mut s t a t u s ) ;
if s t a t u s != ( g l : : TRUE a s GLint ) {
p a n i c ! ( ”OpenGL program l i n k i n g f a i l e d ! ” ) ;
}
}
program
}
Once the program is linked UseProgram activates it. We can then setup our
position and color attributes.
l e t i n d e x = u n s a f e { g l : : G e t A t t r i b L o c a t i o n ( program , c s t r ) } ;
i f index < 0 {
p a n i c ! ( ” A t t r i b u t e \”{}\” not found i n program ” , a t t r ) ;
}
i n d e x a s GLuint
}
147
• The fourth parameter is the “stride” which is a number of bytes the GPU
will skip between each value. Since we don’t have any padding in our buffer
we set it to 0.
• The last parameter is an optional pointer to some data that will be copied
as the initial value of the attribute buffer. We don’t have any data to put
in at that point (and we could do it through our Buffer mapping if we
wanted anyway) so we set it to NULL.
After this call our position buffer will be ready for use!
We then go through the same sequence for our color buffer, the only difference
being the parameters to the VertexAttribIPointer call: this time we have
three values per vertex and the type is UNSIGNED BYTE.
Finally I put it all in the Renderer struct along with an nvertices variable
that will hold the current number of vertices ready to be drawn in the vertex
buffers.
In order to clean everything up properly when we exit we need a destructor
to release the resources:
impl Drop f o r Rend erer {
f n drop(&mut s e l f ) {
unsafe {
g l : : DeleteVertexArrays (1 , &s e l f . v e r t e x a r r a y o b j e c t ) ;
gl : : DeleteShader ( s e l f . vertex shader ) ;
gl : : DeleteShader ( s e l f . fragment shader ) ;
g l : : DeleteProgram ( s e l f . program ) ;
}
}
}
// / Add a t r i a n g l e t o t h e draw b u f f e r
pub f n p u s h t r i a n g l e (&mut s e l f ,
positions : [ Position ; 3] ,
colors : [ Color ; 3 ] ) {
for i in 0 . . 3 {
// Push
s e l f . positions . set ( s e l f . nvertices , positions [ i ]) ;
s e l f . colors . set ( s e l f . nvertices , colors [ i ]) ;
s e l f . n v e r t i c e s += 1 ;
148
}
}
}
The draw command itself is not very complicated but we need to be careful
to synchronize ourselves properly with the GPU. That means flushing our buffers
before we ask the GPU to start drawing and then waiting for the rendering to
finish before we touch the buffers again:
impl Re ndere r {
// . . .
g l : : DrawArrays ( g l : : TRIANGLES,
0,
s e l f . n v e r t i c e s as GLsizei ) ;
}
// Wait f o r GPU t o c o m p l e t e
unsafe {
l e t s y n c = g l : : FenceSync ( g l : : SYNC GPU COMMANDS COMPLETE,
0) ;
loop {
l e t r = g l : : ClientWaitSync (
sync ,
g l : : SYNC FLUSH COMMANDS BIT,
10000000) ;
i f r == g l : : ALREADY SIGNALED | |
r == g l : : CONDITION SATISFIED {
// Drawing done
break ;
}
}
}
// R e s e t t h e b u f f e r s
s e l f . nvertices = 0;
}
}
The call to MemoryBarrier makes sure the data written to the mapped buffer
is visible by the GPU instead of, say, stuck in a CPU cache. We could avoid this
call by mapping the buffer with the MAP COHERENT BIT access flag set but that
might make writing to the buffers slower so it’s not necessarily better.
The DrawArrays function is where the magic happens: it tells the GPU to
draw nvertices as triangles. Once this command is issued the GPU will start
working asynchronously so we must be careful: if we start pushing new data to
the buffers before the GPU is done we might overwrite attributes that are still
in use which may cause glitches.
To avoid that we simply wait for the GPU to finish by using a fence:
FencSync creates a fence waiting for the current commands to complete and
ClientWaitSync is used to wait for completion.
149
Finally we reset nvertices to 0 to start anew.
This method is actually pretty suboptimal: we stall our emulator completely
when the GPU is working. We could improve this by using double buffering on
for our attributes but let’s leave that for later.
This draw command will render everything but it won’t display anything
until we swap the window’s buffer. We can add a display command to do just
that:
impl Render {
// . . .
Now we need to figure out when to call this method. Normally we’d want
to call it at each VSYNC, so 60 or 50 times per second depending on the video
mode but we don’t support GPU timings yet. Instead for the time being we can
find a command that the BIOS calls once per frame and put the display call in
there. Once such command seems to be “Set Drawing Offset” so let’s put our
call to display in there:
impl Gpu {
// . . .
We should now finally be ready to draw ou first triangles. If you restart the
emulator you should end up with the image in figure 2.
The two triangles start back-to-back and then move and shrink to their final
position. Since we don’t yet draw the background quad they’re all drawn on top
of each other which gives this color smearing effect. Note that the image has
a weird aspect ratio (2:1) and that the logo is not centered, it’s because we’re
displaying the entire VRAM framebuffer instead of just the 640x480 portion
configured in the video output.
150
Figure 2: First output of our OpenGL renderer
messages38 :
// / Check f o r OpenGL e r r o r s u s i n g ‘ g l : : GetDebugMessageLog ‘ . I f a
// / s e v e r e e r r o r i s e n c o u n t e r e d t h i s f u n c t i o n p a n i c s . I f t h e OpenGL
// / c o n t e x t doesn ’ t have t h e DEBUG a t t r i b u t e t h i s ∗ p r o b a b l y ∗ won ’ t
do
// / a n y t h i n g .
pub f n c h e c k f o r e r r o r s ( ) {
l e t mut f a t a l = f a l s e ;
loop {
l e t mut b u f f e r = v e c ! [ 0 ; 4096];
l e t count =
unsafe {
g l : : GetDebugMessageLog ( 1 ,
b u f f e r . len ( ) as GLsizei ,
&mut s o u r c e ,
&mut mtype ,
&mut id ,
&mut s e v e r i t y ,
&mut m e s s a g e s i z e ,
b u f f e r . a s m u t p t r ( ) a s ∗mut
GLchar )
};
i f count == 0 {
// No m e s s a g e s l e f t
break ;
}
38 I’m leaving out the definition of the various Debug* types which are just thin wrappers
around the OpenGL values, as always check the repository if you want to see the entire code.
151
b u f f e r . truncate ( m e s s a g e s i z e as u s i z e ) ;
l e t message =
match s t r : : f r o m u t f 8 (& b u f f e r ) {
Ok(m) => m,
Err ( e ) => p a n i c ! ( ”Got i n v a l i d message : {} ” , e ) ,
};
p r i n t l n ! ( ”OpenGL [ { : ? } | { : ? } | { : ? } | 0 x { : x } ] {} ” ,
s e v e r i t y , s o u r c e , mtype , id , message ) ;
if severity . i s f a t a l () {
// Something i s v e r y wrong , don ’ t d i e j u s t y e t i n o r d e r
to
// d i s p l a y any a d d i t i o n a l e r r o r message
f a t a l = true ;
}
}
if fatal {
p a n i c ! ( ” F a t a l OpenGL e r r o r ” ) ;
}
}
We can then call the check for errors method after critical sections: in
‘draw‘ for instance to check for errors in the past frame but also at the end of
‘new‘ to make sure the initialization went well. There’s one caveat though: the
debug extension only works when we use a debug OpenGL context. We can get
one by setting the CONTEXT DEBUG attribute before we create the window:
sdl2 : : video : : g l s e t a t t r i b u t e (
GLAttr : : GLContextFlags ,
s d l 2 : : v i d e o : : GL CONTEXT DEBUG. b i t s ( ) ) ;
A debug context might be slower than a normal one though so we’ll probably
want to only activate this for troubleshooting (via a command line flag or
something like that). For now performances don’t matter in the least so we can
leave it enabled at all times.
The error messages themselves are vendor specific but hopefully they should
be helpful. For instance with my radeon card if I mess up my vertex shader by
replacing vec3 by vec4 in the color affectation I get the following message:
OpenGL [High|ShaderCompiler|Error|0x1] 0:19(10):
error: too few components to vec4
152
We can emulate that behavior in a push quad method:
Impl Re nder er {
// . . .
// Push t h e f i r s t t r i a n g l e
for i in 0 . . 3 {
s e l f . positions . set ( s e l f . nvertices , positions [ i ]) ;
s e l f . colors . set ( s e l f . nvertices , colors [ i ]) ;
s e l f . n v e r t i c e s += 1 ;
}
// Push t h e 2nd t r i a n g l e
for i in 1 . . 4 {
s e l f . positions . set ( s e l f . nvertices , positions [ i ]) ;
s e l f . colors . set ( s e l f . nvertices , colors [ i ]) ;
s e l f . n v e r t i c e s += 1 ;
}
}
}
We must duplicate the two vertices shared by the two triangles across one of
the quad’s diagonal so we end up with 6 vertices for a single quad. It’s possible
to avoid that duplication (for instance by using indexed rendering) but at that
point it would be premature optimization.
Now all that’s left to do is to is use push quad to draw the monochrome and
shaded quadrilaterals:
impl Gpu {
// . . .
// Only one c o l o r r e p e a t e d 4 t i m e s
l e t c o l o r s = [ C o l o r : : f r o m g p 0 ( s e l f . gp0 command [ 0 ] ) ; 4];
s e l f . r e n d e r e r . push quad ( p o s i t i o n s , c o l o r s ) ;
}
153
f n g p 0 q u a d s h a d e d o p a q u e (&mut self ) {
let positions = [
P o s i t i o n : : from gp0 ( s e l f . gp0 command [ 1 ] ) ,
P o s i t i o n : : from gp0 ( s e l f . gp0 command [ 3 ] ) ,
P o s i t i o n : : from gp0 ( s e l f . gp0 command [ 5 ] ) ,
P o s i t i o n : : from gp0 ( s e l f . gp0 command [ 7 ] ) ,
];
let colors = [
Color : : from gp0 ( self . gp0 command [ 0 ] ) ,
Color : : from gp0 ( self . gp0 command [ 2 ] ) ,
Color : : from gp0 ( self . gp0 command [ 4 ] ) ,
Color : : from gp0 ( self . gp0 command [ 6 ] ) ,
];
s e l f . r e n d e r e r . push quad ( p o s i t i o n s , c o l o r s ) ;
}
}
Even though we use per-vertex colors it’s easy to draw monochrome primitives
by repeating the same color. We have encountered a third quad command,
gp0 quad texture blend opaque but since we don’t support textures we can’t
implement that correctly yet. In the meantime we can use a solid color instead,
it won’t look right but at least we’ll see something:
impl Gpu {
// . . .
s e l f . r e n d e r e r . push quad ( p o s i t i o n s , c o l o r s ) ;
}
}
Lo and behold, we should now have something that looks very much like the
“Sony Computer Entertainment” boot logo, minus the text which is contained in
the textures. Figure 3 shows the expected output.
As before the black area at the right and bottom of the image is due to the
fact that we display the entire framebuffer instead of just the part configured
in the video output. You can see that a single 640x480 image already takes
more than half of the entire VRAM and we’re only displaying a very simple
logo. Game developers back then had to be very careful with VRAM usage
(and memory usage in general). This is also one of the reasons most games are
rendered at lower resolutions like 640x240, but we’ll see that later.
Note that there are two ways to split a quadrilateral in two triangles by
cutting along either diagonal. The choice is significant, figure 4 shows the result
of splitting across the other diagonal40 . You can see that the main “tilted square”
40 I modified push quad: instead of rendering triangles with vertex indexes [0, 1, 2] and
154
Figure 3: Playstation boot logo without textures
behind the two triangles is shaded differently. If your emulator’s output looks
like this it means that you’re not rendering the quads in the right order, you
need to split along the other diagonal.
155
// Drawing o f f s e t
uniform i v e c 2 o f f s e t ;
v o i d main ( ) {
ivec2 position = vertex position + o f f s e t ;
// Convert VRAM c o o r d i n a t e s ( 0 ; 1 0 2 3 , 0 ; 5 1 1 ) i n t o
// OpenGL c o o r d i n a t e s ( − 1 ; 1 , −1;1)
f l o a t xpos = ( f l o a t ( p o s i t i o n . x ) / 5 1 2 ) − 1 . 0 ;
// VRAM p u t s 0 a t t h e top , OpenGL a t t h e bottom ,
// we must m i r r o r v e r t i c a l l y
f l o a t ypos = 1 . 0 − ( f l o a t ( p o s i t i o n . y ) / 2 5 6 ) ;
// . . .
}
Uniforms are inputs that are shared across all the intances of the shader. So
instead of having an offset vertex attribute with one entry per vertex we can
have a single variable that will be used for an entire batch of primitives.
To be able to modify the value of the uniform from our code we must retreive
the index like we did for the vertex attributes. We can then set its value using
Uniform2i41 :
impl Re ndere r {
// . . .
// / Index o f t h e ” o f f s e t ” s h a d e r u n i f o r m
u n i f o r m o f f s e t : GLint ,
}
impl Re ndere r {
// R e t r e i v e and i n i t i a l i z e t h e draw o f f s e t
l e t u n i f o r m o f f s e t = f i n d p r o g r a m u n i f o r m ( program ,
” offset ”) ;
unsafe {
g l : : Uniform2i ( u n i f o r m o f f s e t , 0 , 0) ;
}
Re ndere r {
// . . .
// . . .
}
We can now add a method to set the value of the uniform. We need to be
careful to draw the currently buffered primitives before we change the offset
since those were supposed to be drawn with the previous value and might end
up located at the wrong place:
41 The 2i part means that the function works on ivec2s, there are other Uniform* functions
156
impl Re ndere r {
// . . .
// / S e t t h e v a l u e o f t h e u n i f o r m draw o f f s e t
pub f n s e t d r a w o f f s e t (&mut s e l f , x : i 1 6 , y : i 1 6 ) {
// F o r c e draw f o r t h e p r i m i t i v e s with t h e c u r r e n t o f f s e t
s e l f . draw ( ) ;
// Update t h e u n i f o r m v a l u e
unsafe {
g l : : Uniform2i ( s e l f . u n i f o r m o f f s e t ,
x a s GLint ,
y a s GLint ) ;
}
}
}
Finally we can get rid of our drawing x offset and drawing y offset
member variables in the GPU and call set draw offset directly instead.
The fact that we have to force a partial draw every time the offset is changed
means that in pathological cases this might end up being slower. For instance
if a game draws thounsands of triangles, changing the offset between each one,
we’ll issue thousands of partial draw commands. In this case it would probably
be faster to simply add the offset before we push the Positions in the attribute
buffer.
f n main ( ) {
l e t b i o s = B i o s : : new(&Path : : new ( ” roms /SCPH1001 . BIN” ) ) . unwrap ( ) ;
// We must i n i t i a l i z e SDL b e f o r e t h e i n t e r c o n n e c t i s c r e a t e d
since
// i t c o n t a i n s t h e GPU and t h e GPU n e e d s t o c r e a t e a window
l e t s d l c o n t e x t = s d l 2 : : i n i t ( : : s d l 2 : : INIT VIDEO ) . unwrap ( ) ;
loop {
157
for in 0 . . 1 000 000 {
cpu . r u n n e x t i n s t r u c t i o n ( ) ;
}
// See i f we s h o u l d q u i t
f o r e i n event pump . p o l l i t e r ( ) {
match e {
Event : : KeyDown { k e y c o d e : KeyCode : : Escape , . . } =>
return ,
Event : : Quit { . . } => r e t u r n ,
=> ( ) ,
}
}
}
}
When the Quit event is encountered (window closes, received SIGINT etc...)
we return from main, effectively exiting the program. For convenience I also quit
when the Escape key is pressed in the window.
The inner for loop is needed because checking for events before every in-
struction slows everything down very significantly so I only check once for every
million instruction executed.
// / T r a i t r e p r e s e n t i n g t h e a t t r i b u t e s o f a p r i m i t i v e a d d r e s s a b l e
// / memory l o c a t i o n .
pub t r a i t A d d r e s s a b l e {
// / R e t r e i v e t h e width o f t h e a c c e s s
f n width ( ) −> AccessWidth ;
// / B u i l d an A d d r e s s a b l e v a l u e from an u32 . I f t h e A d d r e s s a b l e
is 8
// / o r 16 b i t s wide t h e MSBs a r e d i s c a r d e d t o f i t .
f n f r o m u 3 2 ( u32 ) −> S e l f ;
// / R e t r e i v e t h e v a l u e o f t h e A d d r e s s a b l e a s an u32 . I f t h e
// / A d d r e s s a b l e i s 8 o r 16 b i t s wide t h e MSBs a r e padded with 0 s
.
f n a s u 3 2 ( s e l f ) −> u32 ;
}
158
We can then implement this trait for u8, u16 and u32:
impl A d d r e s s a b l e f o r u8 {
f n width ( ) −> AccessWidth {
AccessWidth : : Byte
}
f n f r o m u 3 2 ( v : u32 ) −> u8 {
v a s u8
}
impl A d d r e s s a b l e f o r u16 {
f n width ( ) −> AccessWidth {
AccessWidth : : Halfword
}
impl A d d r e s s a b l e f o r u32 {
f n width ( ) −> AccessWidth {
AccessWidth : : Word
}
// / Memory r e a d
f n l o a d <T : A d d r e s s a b l e >(& s e l f , addr : u32 ) −> T {
s e l f . i n t e r . l o a d ( addr )
}
// / Memory w r i t e
f n s t o r e <T : A d d r e s s a b l e >(&mut s e l f , addr : u32 , v a l : T) {
i f s e l f . s r & 0 x10000 != 0 {
// Cache i s i s o l a t e d , i g n o r e w r i t e
p r i n t l n ! ( ” Ignoring s t o r e while cache i s i s o l a t e d ” ) ;
159
return ;
}
s e l f . i n t e r . s t o r e ( addr , v a l ) ;
}
}
We can then replace the various load* and store* functions used in the
CPU code by the generic versions. Most of the time the compiler can’t infer the
type properly (since we’re casting all over the place to get the correct width and
sign extension) so we have to explicitly tell it which of u8, u16 or u32 to use.
For instance our LB implementation becomes:
impl Cpu {
// . . .
// / Load Byte ( s i g n e d )
f n o p l b (&mut s e l f ,
instruction : Instruction ,
d e b u g g e r : &mut Debugger ) {
l e t i = i n s t r u c t i o n . imm se ( ) ;
let t = instruction . t () ;
let s = instruction . s () ;
// Cast a s i 8 t o f o r c e s i g n e x t e n s i o n
l e t v = s e l f . l o a d :: < u8>(addr , d e b u g g e r ) a s i 8 ;
// Put t h e l o a d i n t h e d e l a y s l o t
s e l f . l o a d = ( t , v a s u32 ) ;
}
}
// / I n t e r c o n n e c t : l o a d v a l u e a t ‘ addr ‘
pub f n l o a d <T : A d d r e s s a b l e >(& s e l f , addr : u32 ) −> T {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
160
if l e t Some ( o f f s e t ) = map : :DMA. c o n t a i n s ( a b s a d d r ) {
r e t u r n s e l f . dma reg ( o f f s e t ) ;
}
p a n i c ! ( ” unhandled l o a d a t a d d r e s s { : 0 8 x} ” , addr ) ;
}
}
You can see that the Addressable::from u32 function can be used to return
a literal value without having to know the real type being used.
The store function is pretty straightforward:
impl I n t e r c o n n e c t {
// . . .
// / I n t e r c o n n e c t : s t o r e ‘ v a l ‘ i n t o ‘ addr ‘
pub f n s t o r e <T : A d d r e s s a b l e >(&mut s e l f , addr : u32 , v a l : T) {
l e t a b s a d d r = map : : m a s k r e g i o n ( addr ) ;
161
return ;
}
return ;
}
p a n i c ! ( ” unhandled s t o r e i n t o a d d r e s s { : 0 8 x } : { : 0 8 x} ” ,
addr , v a l . a s u 3 2 ( ) ) ;
}
}
// / Fetch t h e l i t t l e e n d i a n v a l u e a t ‘ o f f s e t ‘
pub f n l o a d <T : A d d r e s s a b l e >(& s e l f , o f f s e t : u32 ) −> T {
l e t o f f s e t = o f f s e t as u s i z e ;
l e t mut v = 0 ;
f o r i i n 0 . . T : : width ( ) a s u s i z e {
v |= ( s e l f . data [ o f f s e t + i ] a s u32 ) << ( i ∗ 8 )
162
}
// / S t o r e t h e 32 b i t l i t t l e e n d i a n word ‘ v a l ‘ i n t o ‘ o f f s e t ‘
pub f n s t o r e <T : A d d r e s s a b l e >(&mut s e l f , o f f s e t : u32 , v a l : T) {
l e t o f f s e t = o f f s e t as u s i z e ;
f o r i i n 0 . . T : : width ( ) a s u s i z e {
s e l f . data [ o f f s e t + i ] = ( v a l >> ( i ∗ 8 ) ) a s u8 ;
}
}
}
The BIOS doesn’t have a store method since it’s read-only and we can reuse
the RAM’s load code without any change.
This looping and bit fiddling might seem a little under-optimized but LLVM
seems to handle it well and generates code which looks almost exactly like the
previous non-generic version. And we have less code duplication, so all is good.
let r =
match o f f s e t {
0 => s e l f . r e a d ( ) ,
4 => s e l f . s t a t u s ( ) ,
=> u n r e a c h a b l e ! ( ) ,
};
match o f f s e t {
0 => s e l f . gp0 ( v a l ) ,
4 => s e l f . gp1 ( v a l ) ,
=> u n r e a c h a b l e ! ( ) ,
}
163
}
}
// / DMA r e g i s t e r r e a d
f n dma reg<T : A d d r e s s a b l e >(& s e l f , o f f s e t : u32 ) −> T {
// . . .
// / DMA r e g i s t e r w r i t e
f n s e t d m a r e g <T : A d d r e s s a b l e >(&mut s e l f , o f f s e t : u32 , v a l : T)
{
i f T : : width ( ) != AccessWidth : : Word {
p a n i c ! ( ” Unhandled { : ? } DMA s t o r e ” , T : : width ( ) ) ;
}
// . . .
}
}
Now our code should build and behave exactly like it did before. On my
system the performance is the performance is the same as far as I can tell. This
more generic infrastructure will show its usefulness soon enough.
164
more easily (GPU debugging comes to mind). For this reason I’m just going to
describe the low level debugging interface in this guide, you’ll decide what kind
of frontend you want to build on top.
7.2 Breakpoints
Breakpoints are triggered when a certain instruction gets executed. The instruc-
tion is identified by its memory address. We can store the breakpoint addresses
in a vector:
pub s t r u c t Debugger {
// / V e c t o r c o n t a i n i n g a l l a c t i v e b r e a k p o i n t a d d r e s s e s
b r e a k p o i n t s : Vec<u32 >,
}
165
f n d e l b r e a k p o i n t (&mut s e l f , addr : u32 ) {
s e l f . b r e a k p o i n t s . r e t a i n (|& a | a != addr ) ;
}
}
Finally we can implement the method pc change that will be called before
every instruction to look for a breakpoint at the current address. Needless to
say this code is in a very critical path and must be as fast as possible:
impl Debugger {
// . . .
The debug method is where the debugging frontend should be notified that
the execution stopped and wait for the user to resume the execution.
Using a vector to store the breakpoints might seem sub-optimal since it has
linear lookup time. A tree-based collection could theoritically work in logarithmic
time. We have to consider two things however: we want to optimize for the
common case where no debugging is taking place and no breakpoint is set and
even when we’re debugging we probably won’t be using thousands of breakpoints
simultaneously.
Iterating over an empty vector should be very cheap: a simple test of the
length of the vector and we exit the loop immediately. And even for small
non-empty vectors it will probably be faster than a more complex structure
(strong cache locality, no cache thrashing, no indirections, easy prefetching).
For these reasons I don’t think it’s necessary to bother using anything more
complicated than a good old vector, the constant cost probably matters more
than the linear complexity for our usage.
Finally we can plug pc change in our CPU:
impl Cpu {
// . . .
// Save t h e a d d r e s s o f t h e c u r r e n t i n s t r u c t i o n t o s a v e i n
// ‘EPC‘ i n c a s e o f an e x c e p t i o n .
s e l f . c u r r e n t p c = s e l f . pc ;
// . . .
}
166
pub f n pc(& s e l f ) −> u32 {
s e l f . pc
}
}
I pass the debugger object from the main function in order to be able to start
a debugging session at the press of a key:
f n main ( ) {
// . . .
loop {
for in 0 . . 1 000 000 {
cpu . r u n n e x t i n s t r u c t i o n (&mut d e b u g g e r ) ;
}
// See i f we s h o u l d q u i t
f o r e i n event pump . p o l l i t e r ( ) {
match e {
Event : : KeyDown { k e y c o d e : KeyCode : : Pause , . . } =>
d e b u g g e r . debug(&mut cpu ) ,
Event : : KeyDown { k e y c o d e : KeyCode : : Escape , . . } =>
return ,
Event : : Quit { . . } => r e t u r n ,
=> ( ) ,
}
}
}
}
7.3 Watchpoints
Being able to break on a specific instruction is useful but sometimes we want to
know when a certain location in memory is loaded or modified. In order to do
that we can implement read and write watchpoints that will respectively check
each load and store address and trigger the debugger when a watched address is
encountered.
As for breakpoints we’ll store the watchpoint addresses in vectors:
pub s t r u c t Debugger {
// / V e c t o r c o n t a i n i n g a l l a c t i v e r e a d w a t c h p o i n t s
r e a d w a t c h p o i n t s : Vec<u32 >,
// / V e c t o r c o n t a i n i n g a l l a c t i v e w r i t e w a t c h p o i n t s
w r i t e w a t c h p o i n t s : Vec<u32 >,
}
167
The methods for adding, removing and testing the watchpoints will therefore
look very similar to the breakpoint implementation:
impl Debugger {
// . . .
// / D e l e t e w r i t e w a t c h p o i n t a t ‘ addr ‘ . Does n o t h i n g i f t h e r e
was no
// / b r e a k p o i n t s e t f o r t h i s a d d r e s s .
f n d e l w r i t e w a t c h p o i n t (&mut s e l f , addr : u32 ) {
s e l f . w r i t e w a t c h p o i n t s . r e t a i n (|& a | a != addr ) ;
}
168
}
You can see that I put a few comments about unaligned access and regions,
I’m not entirely sure what’s the right thing to do here. I guess we’ll see how we
want the debugger to behave as we’re using it.
Now we just have to plug the memory read and write methods in our generic
load and store functions in the CPU:
impl Cpu {
// . . .
// / Memory r e a d
f n l o a d <T : A d d r e s s a b l e >(&mut s e l f ,
addr : u32 ,
d e b u g g e r : &mut Debugger ) −> T {
d e b u g g e r . memory read ( s e l f , addr ) ;
s e l f . i n t e r . l o a d (&mut s e l f . tk , addr )
}
// / Memory w r i t e
f n s t o r e <T : A d d r e s s a b l e >(&mut s e l f ,
addr : u32 ,
v a l : T,
d e b u g g e r : &mut Debugger ) {
d e b u g g e r . memory write ( s e l f , addr ) ;
if s e l f . sr . cache isolated () {
s e l f . c a c h e m a i n t e n a n c e ( addr , v a l ) ;
} else {
s e l f . i n t e r . s t o r e (&mut s e l f . tk , addr , v a l ) ;
}
}
}
// Fetch i n s t r u c t i o n a t PC
l e t pc = s e l f . c u r r e n t p c ;
l e t i n s t r u c t i o n = I n s t r u c t i o n ( s e l f . i n t e r . l o a d ( pc ) ) ;
// . . .
}
}
169
An other problem is that you might be using this CPU load method in your
debugger to read the memory’s contents. Obviously you don’t want to recursively
trigger the debugger when you use it to read some memory location where a
watchpoint happens to live. Instead we can create an other method used for
loading data for debugging purposes42 . I named this method examine:
impl Cpu {
// . . .
// / Debugger memory r e a d
pub f n examine<T : A d d r e s s a b l e >(&mut s e l f , addr : u32 ) −> T {
s e l f . i n t e r . l o a d (&mut s e l f . tk , addr )
}
}
section 7.1.
170
is valid or not. When fetching an instruction if the tag is mismatched or the
entry is not valid it’ll have to be fetched from main memory, otherwise we can
directly use the cached value.
Let’s take a concrete example shown in table 9. Suppose the CPU wants to
run code from address 0x80005384. First we need to figure out which cacheline
matches this address, for that we need to shift the address two bits to the right
(since we have 4 32bit words per cache line) and then take the 8 LSBs (since
we have 256 cachelines in total). In this case we end up in cacheline number 56
(0x38).
Now that we have identified the cacheline we need to see if it already contains
data for the current address, after all any address ending in 0x38X will match
the same cache location. In order to do that we compare the tag stored in the
line with bits [31:12] of the instruction address, in this case 0x80005. If the tag
doesn’t match we consider it invalid and we have to fetch it from RAM.
If the tag is the one we’re looking for however we just have to check the valid
bit for the instruction we’re looking for. Bits [3:2] give us the location in the
4-word cacheline, bits [1:0] are always 0 since all instructions are word-aligned43
so in this case we’re looking for the 2nd word in the cacheline. If the valid bit is
set we can use it directly, otherwise the instruction is invalid and we must fetch
it from main RAM.
171
List of Tables
1 Playstation memory map . . . . . . . . . . . . . . . . . . . . . . 10
2 KSEG2 memory map . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 SCPH1001.BIN BIOS checksums . . . . . . . . . . . . . . . . . . . 11
4 R3000 CPU general purpose registers . . . . . . . . . . . . . . . . 17
5 R3000 CPU special purpose registers . . . . . . . . . . . . . . . . 18
6 16 to 32bit conversion: influence of sign extension . . . . . . . . . 24
7 Special cases in divisions . . . . . . . . . . . . . . . . . . . . . . . 59
8 DMA Channel Control register description . . . . . . . . . . . . . 98
9 Anatomy of cached address 0x80005384 . . . . . . . . . . . . . . 171
List of Figures
1 OpenGL shaded RGB triangle . . . . . . . . . . . . . . . . . . . 144
2 First output of our OpenGL renderer . . . . . . . . . . . . . . . . 151
3 Playstation boot logo without textures . . . . . . . . . . . . . . . 155
4 Playstation boot logo with bad quad rendering . . . . . . . . . . 155
172