Design of A Minimal Processor On An Fpga
Design of A Minimal Processor On An Fpga
Andrey Akhmetov
October 1, 2015
Abstract
This document describes the development of a small processor us-
ing the Verilog programming language, a software-based simulator, and
FPGA hardware. This processor was made as a learning exercise and is
not necessarily efficient, reliable, or compliant with best-practices.
I would like to additionally thank the many helpful members of the
##fpga channel on the Freenode IRC network for their helpful tips and
expertise.
1 Disclaimer
The goal of this project was to act as a learning exercise. The decisions here
should not be considered best-practice. The implementation made here is not
necessarily coherent, efficient, or useful for anything other than a small intro-
ductory learning exercise.
2 Introduction
Much of the development of this device is based around a rather vague goal
of being able to execute machine code, and more specific constraints of the
available hardware and its properties.
3 Overall architecture
The overall architecture of the processor is an eight-bit, non-pipelined processor,
with only in-order execution. While features such as out-of-order execution,
1
Figure 1: A display of memory accesses on one of the two ports of any given
block RAM.
superscalar pipelining, and register renaming are useful for performance, they
are considerably more difficult to understand and implement than the more
basic design here, and consume more FPGA resources, such as circuitry to
detect register access conflicts.
Due to instruction and addressing limitations (such as a bigger code memory
than addressable data memory space) a Harvard architecture is used, meaning
that code is stored separately from data. Instructions to write to program
memory, that break this abstraction, are under consideration. However, it is
more likely that there will be a process to upload new code using an external
interface such as GPIOs.
4 Memory accesses
An eight-bit architecture was chosen due to the estimated number of imple-
mented opcodes, and the available memory access patterns. While the actual
memory width is nine bits, the fashion in which memory is pre-assigned in
Verilog separates one of the bits from the rest and packs the ninth bits of dif-
ferent words into a single source code character, making programming by hand
significantly more difficult. The memory access pattern of 8-bit data by 2048
different possible addresses allows for a reasonable 16K of program memory,
more than sufficient for the small programs necessary for the learning purposes
of this project.
The block RAMs on the FPGA are dual-ported, synchronous SRAMs, where
the width of the two ports need not be the same. A write-enable signal can be
used to write. Either a read or a write operation (selected by an input signal,
write-enable) can occur on each clock cycle, as shown in 1. When the RAM
is being used in dual-ported mode, two memory accesses (reads, writes, or one
of each) can happen on each clock cycle, although write conflicts to the same
address the two ports may present a danger of corruption, and thus avoided in
the design.
The behavior of the read output when writing can be configured in software,
to either pass through the new value to the output, return the stored value (get
and set), or continue returning the result of the previous read operation. This
is unimportant since there are no get-and-set scenarios in the current design.
2
4.1 Program memory
Due to the dual-ported nature of the program memory, it is possible to perform
two 8-bit reads at the same time. This lends itself especially well to fetching an
8-bit instruction, and an 8-bit parameter (represented in Assembly languages
similarly to INSTR imm8. This would not have been a guarantee for a single-
ported 16-bit memory since instructions are aligned on 8 bits, so an instruction
and its parameter may cross a 16-bit boundary.
The availability of a very efficient hardware adder that can be used on the
address lines of the second port allows for one register, PC (short for Program
Counter), to be used to determine the address of the next instruction to fetch,
as well as the address of its parameter. The first port reads the opcode into a set
of 8 FPGA registers, and the second port simultaneously reads the parameter
into a second set of 8 FPGA registers, on the clock rising edge. If the parameter
is not needed, later logic will ignore it.
4.3 Registers
The processor contains 16 registers (the reason being discussed in the Opcodes
section), named by hexadecimal 0 through F. Some of these registers are paired
together for a specific feature, but may be used as general purpose registers.
Registers 0 and 1 are paired to be the accumulator (0 being the low byte), 2
and 3 are the comparison operand, and 4 and 5 are an address (used for jumps
and memory addressing). Due to addresses being 11 bits, only the 11 lowest
bits of register 4-5 are considered when jumping or addressing. See Table 1.
The hardware implementation is currently a register file in a block RAM that
is dual-ported, and 16 bits wide (18 bits including two unused parity bits). Up to
two register fetches can occur on a single clock cycle, and a single register write
can occur on the next cycle (the second bank’s write enable is always off since
there are no opcodes requiring more than one write, but this is flexible). The
data width is 16 bits so that operations that need to fetch wide registers such
as A, X, or AD, can do so using one of the ports, and fetch a parameter using
the other. If an 8-bit register needs to be fetched, the 16-bit block containing it
is fetched and connected to a multiplexer switching on the LSB of the register
number. The half being used is connected combinatorially to the ALU’s logic
to calculate the result, and the other half is written back “as is” (if it is not,
then the neighboring register will be zeroed on any write).
3
Table 1: Registers and their features
ID Feature
0
Accumulator
1
2
Product/comparison
3
4
Address
5
6 General-purpose only
7 General-purpose only
8 General-purpose only
9 General-purpose only
A General-purpose only
B General-purpose only
C General-purpose only
D General-purpose only
E General-purpose only
F Hardware write
To increase performance (at the expense of FPGA resources) the module can
be easily replaced with one that stores values in distributed RAM (presumably
within SLICEM resources on the FPGA), uses multiplexers for fetching and a
4-to-16 decoder connected to the write-enable of one or more of the registers.
This will eliminate the need to wait for a clock edge to fetch inputs (as is needed
with SRAM) but eliminates the possibility of writing with a second port without
a set of multiplexers at the write circuitry, further increasing utilization of the
FPGA resources.
4.4 PTR
Due to memory timing constraints and the limited number of read/write ports
on the block RAMs, a separate 8-bit-long pointer (denoted PTR) can be used
for addressing memory dynamically.
5 Opcodes
The choice of available operations and how they map to opcodes was based
largely around the features that needed to be made available, and the widths
of the opcodes and parameters. One clock cycle is required to fetch an opcode,
which can be decoded during the clock period (as it is available at the start
of the clock period). Illegal instructions cause the processor to halt, at which
point the debug switch can be used to inspect the current opcode and program
counter. This decision was made as it is not known whether a currently-unused
opcode might take a parameter in a future version, and simply incrementing the
program counter and continuing would cause drastic issues even if the corrupt
or skipped instruction itself was not at fault.
Many of the choices surrounding assigning operations to opcodes were made
4
Table 2: LDC opcode
Instruction Destination
0 0 0 1 A B C D
for efficiency. Where possible, single bits of the opcode are used to multiplex
between various possible outputs or results, rather than having to use more
expensive lookup table or matching logic.
5
Some instructions that perform such operations reference the accumulator,
a 16-bit representation of registers 0 and 1 (with register 0 being the low byte).
For example, instruction ADDA (0x41) adds the value of an 8-bit register to the
16-bit accumulator. The register file memory is configured so that the old value
is still maintained on the output during a write, so either of the accumulator’s
component registers can be referenced without risk of data corruption due to
the value of the adder’s combinatorial output changing on the clock edge.
6
Table 4 – continued from previous page
Opcode (hex) Parameter (hex) Mnemonic Action
4B lreg,rreg OR Performs lreg OR rreg, writes result
to lreg.
4C lreg,rreg MOV Writes value of rreg to lreg.
4C-4F none Reserved
50 lreg,rreg MUL Sets register X to register lreg × reg-
ister rreg.
51-5F none Reserved
60 none JP Jumps unconditionally.
61 none JPN Reserved, with no effect.
62 none JPC Jump if carry flag set.
63 none JPNC Jump if carry flag not set.
64 none JPEQ Jump if A = X.
65 none JPNEQ Jump if A ̸= X.
66 none JPZ Jump if A = 0.
67 none JPNZ Jump if A ̸= 0.
68-6F none Reserved
70 ignore,reg PSR Switch memory page to 3 lowest bits
of register reg
71 8-bit parameter PSC Switch memory page to 3 lowest bits
of param
72 8-bit parameter PTRC Set PTR to param
73 ignore,reg PTRR Set PTR to value of reg
74 ignore,reg WRPTR Write value of reg to RAM address
at PTR
75 ignore,reg RDPTR Read from RAM address at PTR
into reg
80 ignore,reg INCR Increment reg
81 ignore,reg DECR Decrement reg
82 none INCP Increment PTR
83 none DECR Decrement PTR
90 none CALL Call unconditionally.
91 none CALLN Reserved, with no effect.
92 none CALLC Call if carry flag set.
93 none CALLNC Call if carry flag not set.
94-95 none Reserved3
96 none CALLZ Call if A = 0.
97 none CALLNZ Call if A ̸= 0.
98-9F none Reserved
A0 none RET Return (call stack)
A1-FD none Reserved
FE none WRLED No effect, illegal instruction
FF none WRSEG Write register F to 7-segment display
7
Figure 2: The timing diagram describing the operation of the FSM in the ALU.
6 ALU/CU
The key component of this project had been the ALU (arithmetic and logic unit),
implemented as a Verilog module. Currently the program, data, and register
memory modules are included inside the ALU at the Verilog source level, but
they can be easily moved outside if I decide to somehow interface with external
memory using a simple synchronous address/data bus. The ALU contains its
own control circuitry, as a result of my misunderstanding the organization and
naming of parts of modern processors when initially implementing it.
The timings of the memory accesses and computation were key in planning
out the ALU, which is implemented as a state machine. Each operation is
executed in three states (see 3). Each state corresponds to one clock cycle. The
ALU begins with the program counter set to zero, and in an initialization state.
On a rising clock edge, as the opcode is made available on the RAM read port,
the ALU transitions into state DECODE, where the opcode is examined in a
series of if-elseif statements. A switch is not used, since large block of opcodes
(for example, LDC) are treated the same way and comparison of the upper 4
bits suffices for that check. The register IDs and RAM addresses that need to
be fetched are written to the appropriate address lines on the block RAMs.
At the following transition, state EXEC begins. The program counter is
updated and the register values are read and processed by a combinatorial circuit
based on the opcode, and the output is connected to the write data lines on the
block RAM. Write-enable is activated as appropriate, and on the clock cycle,
the data is written and the WRITE state is entered. At this point, write-enable
is deactivated, and the write is finished. For an unknown reason the synthesis
tool does not respect setup times on the PC and PC+1 values set on the address
lines of the program memory, so this clock cycle acts to respect this setup time
as well. At the next clock edge, the ALU transitions back into the DECODE
stage for the next instruction.
1 This is implemented by not incrementing the PC value. A button used to resume execution
address.
3 For an optimization to better utilize both block RAM ports. Unexpected results (compares
AD to A instead)
8
7 Call stack
The opcodes within the 9Xh block are used to call a subroutine, using a hardware
stack. The primary deciding factor for its size was the register file and the block
RAM that it needed to fit in. Because the registers take up 128 bits of space,
and 16384 bits are available in a single block RAM without additional write-
enable and multiplex logic, the call stack is 128 levels deep. In the case of a
stack overflow, the processor will stop for debugging. In order to allow calls
with only a few clock cycles, the old register values remain when calling to a
specific stack depth. This is most evident in calling a routine that writes to a
register, returning from it, and calling a routine that reads that same register
without first writing anything to it.
This is not an issue as the RAM can be used to pass parameters and re-
turn values. Additionally, existing programming languages already assume that
certain memory locations can have garbage if uninitialized, including high-level
languages such as Java (in regard to the stack). Remarkably, CPU architectures
such as Itanium have special values describing uninitialized or garbage values.[4]
8 Code upload
Code upload was done from a Raspberry Pi due to limited available hardware,
and was implemented as a serial protocol. Currently, the protocol is a very
simple bit-banged protocol with minimal error correction:
An acknowledgement is sent over the data line after each byte, and acts
as a response to the parity sent by the FPGA. If the parity is incorrect or a
framing issue occurred, a low value (NACK) is sent. If the parity and framing
are correct, a high value (ACK) is sent instead.
The READY signal acts as a minimal form of framing, to allow a misalign-
ment error to be handled by sending low signals until the READY signal goes
low itself, followed by another low signal (NACK) to prevent the byte from
being written.
The parity allows for basic checking of single-bit errors (for which case a
negative acknowledgement and a retransmit are sent).
9
9 Practical implementation result
9.1 Resource utilization and statistics
The processor, when implemented, has requirements that are easily satisfied by
the Spartan-3E FPGA, which is a fairly small and low-cost device. These figures
do not include tangential functionality such as the code upload circuitry.
My implementation Picoblaze
Architecture 8-bit 8-bit
Instruction store size 2048 1024
ALU width Mixed (8-bit with 16-bit accumulator 8-bit
and comparison)
Registers 16*8-bit 16*8-bit
IOs 8 status LEDs, byte display4 256 in, 256 out
Call stack None5 31-position
Cycles per instruction 2 theoretical6 2
FPGA slices 501 96
• Timing: It had been found that reads from the block RAM were failing
under some circumstances.
• Debuggability: No provision so far had been implemented for a debugger
interface other than LEDs for internal flags and 7-segment through opcode
FF.
• Missing opcodes and features: Certain opcodes and features were
missing in the first version of this learning exercise.
10
9.3 Timing
In implementing opcodes that read from both register file ports on the read
stage, (specifically MOV) it was found that the FPGA implementation was
failing to read the register file using port B, while port A would succeed. Fur-
thermore, this issue could not be reproduced in behavioral and post-synthesis
simulation. It appears from discussions with other FPGA programmers that
this may be a bug in the ISE software used. For this reason, an extra clock
cycle was added as a workaround to a few different operations.
The current clock frequency is 50MHz for testing, but timing constraints
with frequencies as high as 100MHz were satisfied, although specialized profiles
such as physical synthesis and lengthy place-and-route runs had to be used to
reach this goal. The highest clock frequency reached was 101.033MHz.
9.4 Debuggability
Due to limited I/O resources, debugging is done by using 8 LEDs and a pair of
7-segment displays that are wired to display one byte. By using multiplexers
and a few switches, the seven-segment display can be toggled between displaying
program output (using opcode FF), the 8 lowest bits of the program counter,
or the current opcode executing.
Additionally, a separate switch was used as a clock-enable to pause and
resume execution. The LEDs display state such as operation, paused operation,
jump flag, and carry flag.
In case of an invalid opcode, or stack overflow/underflow during call/return,
respectively, the processor “locks up” by remaining in the same state with an
unchanging program counter. This allows for easy debugging by using some
of these features to probe the processor’s state. Additionally, with a VGA
controller implemented as a separate project, it is possible for an error to cause
a jump into debugging code in a separate block RAM (using a multiplexer)
or a separate location in the current block RAM, in which case the debugger
can print the state of the processor including registers, type of error, and other
information. However, extra instructions will need to be added to fully support
debugging, including ones to read the program counter and possibly format a
byte to ASCII hex in hardware.
11
References
[1] PicoBlaze 8-bit Embedded Microcontroller User Guide. UG129. Rev 2.0. Xil-
inx Inc. June 2011.
[2] Spartan-3E FPGA Family. DS312. Rev. 4.1. Xilinx Inc. July 2013.
[3] Spartan-3 Generation FPGA User Guide. UG331. Rev. 1.8. Xilinx Inc. June
2011.
12