Simple CPU
Simple CPU
Simple CPU :)
Modern CPU's are complex beasts, highly optimised and tricky to understand. This makes it very difficult to see why it was constructed in the way it was. Part of the
problem is the requirement for backwards compatibility i.e. a new processor has to be able to run code from the previous generations. This incremental development
can result in a very confusing instruction set and 'cluttered' hardware architecture. Also, when you do start to dig down the literature a lot of things are not always
fully disclosed i.e. to protect those industrial secrets. However, at their hearts all processors are simple machines and in some respects have not changed that much
since the 1940s i.e. they still use instructions and perform a Fetch-Decode-Execute cycle. Therefore, to prove that processors are actually very simple to build and
understand i developed a simple 8bit CPU architecture that can be implemented in an FPGA. To be honest it wasn't really designed, it evolved, therefore, the
hardware could be optimised quite a bit. However, the aim was to break the processor down into its fundamental building blocks i.e. Boolean logic gates. Then
combining these to form more complex components e.g. adders, multiplexers, flip-flop, counters etc, which are at the heart of any computer. The basic block
diagram of this computer is shown in figure 1, a very simple machine, made from registers, multiplexers and an adder. The operation of this machine and its
components were discussed in Lectures from a top level view point. To give a different point of view i'm now going to explain its operation from the bottom up. This
processor will be implemented in a Spartan 3 FPGA, its hardware defined in schematics. Each schematic can be downloaded and simulated using the Xilinx ISE
ISim tool.
Logic: every block within the computer can be considered to be made from Boolean logic gate, however, this category refers specific, larger logic blocks e.g.
adders, address decoders, instruction decoders etc.
Multiplexers: from one point of view a computer just moves information from one point to another. Controlling the path taken by this information are
multiplexers, switching junctions, allowing information to be passed between functional blocks.
Registers: fast, short term memory. As part of the Fetch-Decode-Execute cycle a computer needs to remember its state, the instruction to be processed and any
results generated.
Memory: this computer uses a classic Von Neumann architecture i.e. one memory, storing both the program (instructions) to be executed and the data to be
processed in the same memory device.
This processor has three multiplexers (MUX) controlling the data and address buses. Multiplexers are switches allowing the processor to select information from
multiple data sources and route it to a single destination. To select which data source should be used a multiplexer has one or more control lines as shown in figure 2.
This 2:1 MUX has two data inputs (A,B), one output (Z) and an input (SEL), selecting which one of the two inputs should be connected to its output. This
hardware’s operation is defined by its truth table shown in figure 3. When SEL=0 input A is connected to the output Z. When SEL=1 input B is connected to the
output Z.
Figure 2 : 2:1 bit multiplexer, circuit diagram (top), truth table (bottom)
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 1/19
12/26/2017 Simple CPU
This multiplexer can select between two bits, however, within the processor we need to select between 8bit buses. To achieved this the multiplexer is replicated eight
times i.e. one per bit in the bus, as shown in figure 3. Note, each mux2_1_1 rectangle (symbol) contains the circuit shown in figure 2. To save space i have only
shown the first three multiplexers, a full circuit diagram is available here: (Link ). The circuit symbol for this 8bit multiplexer is shown in figure 4, it's interface has
three 8bit buses (thick lines) and one signal (thin lines):
Figure 3 : 2:1 byte multiplexer circuit diagram (first three stages only)
In addition to 2:1 multiplexers shown in figure 1 the ALU also needs a 4:1 multiplexer (discussed later) i.e. a multiplexer that has four inputs and one output. This
can be constructed from three 2:1 byte multiplexers, as shown in figure 5. Note, each mux2_1_8_v1 rectangle (symbol) contains the circuit shown in figure 3. The
Xilinx schematics and symbols for these multiplexers can be downloaded here: (Link ).
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 2/19
12/26/2017 Simple CPU
All number crunching is performed in the Arithmetic and Logic Unit (ALU), implementing its main arithmetic functions. To implement an ADD function, one of the
core instructions of any computer, there are a number of different hardware solutions, each having advantages and disadvantages. However, to avoid an in depth
discussions of their relative merits i'm going to stick with the basic half adder, as shown in figure 6. This circuit adds together two bits A and B, producing a Sum
and Carry. Working through the truth table you can see the addition process 0+0=0, 0+1=1, 1+0=1. As this is a binary (base 2) machine, the maximum value any
digit can store is 1, therefore, when A=B=1 the Sum output can not represent the result of 2, so a Carry is generated. This Carry would then be added to next digit in
the number.
On its own a half adder is not that useful as we need to add together 8bit numbers, but it can be used as the building block for a full adder, as shown in figure 7.
Note, each half_adder rectangle (symbol) contains the circuit shown in figure 6. This circuit can add together three bits: A,B and Cin. Another way to think about
this circuit is that it counts the number of 1's:
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 3/19
12/26/2017 Simple CPU
Again at first glance the ability to add three bits together may not seem that useful, however, if we replicate this hardware eight times, connecting the Carry out
(Cout) of the previous stage to the Carry in (Cin) of the next we can build a Ripple adder, as shown in figure 8. Note, again each rectangle (symbol) contains a sub-
circuit, in this case full_adder as shown in figure 7. To save space i have only shown the first three full adders, a full circuit diagram is available here: (Link ). An
alternative way of considering this hardware is the pseudo code shown in figure 9. Each full adder is a modulus 2 adder, conceptually the addition process starts with
the least significant digit (LSD) and 'ripples' through the hardware to the most significant digit (MSD) i.e. bits X0 and Y0 are added together to produce a Sum Z0
and a Carry C1, this Carry is then added to the next significant digits X1 and Y1 etc. This sequential behaviour does limit the hardware's performance, but, don't
forget that the hardware associated with each digit's addition is all working in parallel e.g. best case performance: 987+12=999, no carries would be generated, the
additions of 9+0, 8+1 and 7+2 would all be performed in parallel and complete in one unit of time. However, worst case performance: 999+1=1000, this would take
four units of time owing to the carries having to ripple through the hardware from the LSD to the MSD. Note, its silly to say, but important to remember that
hardware is not software. When analysing hardware some elements may have a sequential behaviours, but ALL logic gates will be working in parallel.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 4/19
12/26/2017 Simple CPU
The top level circuit symbol for the ripple adder is shown in figure 10 (containing the circuit shown in figure 8), this symbol's interface has three 8bit buses (thick
lines) and two signals (thin lines):
The complete ALU circuit diagram is shown in figure 11. With a few small modifications the ripple adder can be used to implement a subtraction function. The
subtraction hardware can be implemented using 2s complement i.e. subtraction by the addition of negative numbers. To generate a negative number each bit is
inverted and then 1 is added to the result, as shown in the example below.
To invert each bit an array of XOR logic gates are used, bitwise_inv_v1, as shown in figure 12. This circuit has an 8bit input bus (A), each bit is XORed with the
signal EN. XORing a bit with 0, returns the same value. XORing a bit with 1, returns the inverse of that value. To add 1 the Carry-In (Cin) signal to the ripple adder
is set to 1, incrementing the final result. In the ALU this is controlled using signals S2 and S3. The ADD and SUB functions can be simulated using the Xilinx ISim
software tools, as shown in figure 13. In this waveform diagram the calculations 123+45 and 123-45 are performed. Remember when you use a 2's complemented
representation in a calculation the final Carry-out is ignored. The Xilinx schematics, symbols and VHDL testbench for this ALU can be downloaded here: (Link ).
A B Z
0 0 0 A XOR 0 = A
0 1 1 A XOR 1 = NOT A
1 0 1
1 1 0
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 5/19
12/26/2017 Simple CPU
The Carry-in input can also be used to perform an increment function i.e. Z=A+0+1. This is quite a common requirement within a program e.g. incrementing a
counter, or the processor's program counter. To achieve this function the B input of the adder needs to be set to zero. To do this the replicate_v1 and bitwise_and_v1
circuits shown in figures 14 and 15 are used. The replicate_v1 component uses buffers to drive the same signal onto each bit of its output bus Z. These signals are
then ANDed with the data on the B input of the ALU. If they are ANDed with 1, the value on the B input of the adder is unaffected. If they are ANDed with 0, the
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 6/19
12/26/2017 Simple CPU
value on the B input of the adder is set to zero. In the ALU this is controlled using signals S2 and S4. The INC function can be simulated using the Xilinx ISim
software tools, as shown in figure 16. In this waveform diagram the calculations 123+1 and 45+1 are performed.
Figure 14 : replicate
The bitwise_and_v1 component is also used in the ALU to perform the bitwise logical AND function i.e. Z=A AND B, as shown in the example below.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 7/19
12/26/2017 Simple CPU
A B Z 10101010
0 0 0 A AND 0 = 0 & 11110000
0 1 0 A AND 1 = A ----------
1 0 0 10100000
1 1 1 ----------
To select the ALU's function the final 4:1 multiplexer is used, control lines S0 and S1 selecting either the adder's output, bitwise AND output, input A or Input B.
The full set of control signals are shown below.
S4 S3 S2 S1 S0 Z
0 0 0 0 0 ADD (A+B)
0 0 0 0 1 BITWISE AND (A&B)
0 0 0 1 0 INPUT A
0 0 0 1 1 INPUT B
0 1 1 0 0 SUBTRACT (A-B)
1 0 1 0 0 INCREMENT (A+1)
1 0 0 0 0 INPUT A
0 0 1 0 0 ADD (A+B)+1
0 1 0 0 0 SUBTRACT (A-B)-1
Computer's execute instructions using the Fetch-Decode-Execute cycle, therefore, the processor must remember what phase it is in so that it can progress to the next.
This temporary memory is implemented using flip-flops, each storing 1 bit of data, as defined by the state table shown in figure 18. A flip-flop has an input pin D
and an output pin Q, the value on D is written to Q when there is a change from a logic 0 to a logic 1 on the CLK pin. Another way to think about the CLK pin is
that it is the write, or update control signal i.e. the CLK line is pulsed to store a value. Owing to electronic reasons which i will quick skip over all CLK lines must
be connected to the same system clock i.e. a square wave signal that determines the operating speed of the processor. This would mean that every flip-flop would
update its output every clock cycle, which would not be very useful. Therefore, to control when different flip-flops update their outputs we use the clock enable
input pin CE. If CE=0 the CLK pin is ignored. If CE=1 the CLK pin is used.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 8/19
12/26/2017 Simple CPU
Figure 18 : D-type flip-flip
Within the processor temporary values are stored in registers, within a register each bit is stored in a flip-flop, as shown in figures 19,20 and 21. To make larger
registers multiple smaller registers are grouped together. Note, the rectangular components register_4 and register_8 contain the circuit diagrams shown in figure 19
and 20 respectively. The operation of the 8bit register can be simulated using the Xilinx ISim software tools, as shown in figure 22. In this waveform diagram the
values 123 and 45 are stored in the register, using the CLK, CLR and CE lines to control when these values are updated. The Xilinx schematics, symbols and VHDL
testbench for these registers can be downloaded here: (Link ).
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 9/19
12/26/2017 Simple CPU
Flip-flops are also used to generate the sequence of control signals needed to perform the functions defined by each instruction. These are contained within the
decoder block (figure 1), the circuit diagram symbol form this component is shown in figure 23. Inside this component are the instruction decoding logic and
sequence generators needed to control the processor, as shown in figure 24. A high resolution diagram can be downloaded here: (Link ).
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 10/19
12/26/2017 Simple CPU
This processor has a slightly modified Fetch-Decode-Execute cycle, to save an adder i added an additional phase, it now has a Fetch-Decode-Execute-Increment
cycle. The final phase incrementing the PC to the address of the next instruction (when needed). To identify which phase the processor is in a sequence generator is
used, as shown in figures 25 and 26. This is a simple ring counter, using a one-hot encoded value to indicate the processor's state, as shown in figure 27. Initially the
value 1000 is loaded into the counter (fetch code), on each clock pulse the one-hot bit is then moved along the flip-flop chain, looping back to the start after four
clock cycles. To determine the processor's state you simply identify which bit position is set to a logic 1.
1000 : Fetch
0100 : Decode
0010 : Execute
0001 : Increment
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 11/19
12/26/2017 Simple CPU
Figure 27 : Sequence generator simulation
During the Fetch phase the current instruction, pointed to by the program counter (PC) is loaded into the instruction register (IR). Then in the decode phase the high
byte of the 16bit instruction is decoded by the instruction decoder shown in figure 28. A high resolution diagram can be downloaded here: (Link ). As discussed in
Lectures this processor only has a very limited instruction set:
The top 4-6 bits defining the opcode, where X=Not used, K=Constant, A=Instruction Address, P=Data Address. For those of you who know your machine code you
will recognise these instructions are based on the original PicoBlaze machine code (Link), as this is the next processor architecture we will look at in Lectures. The
instruction decoder converts the unique 8bit opcode into a one-hot value, these are then used during the Decode and Execute phases to control the processor's
hardware. To ensure these signals are not active during the Fetch and Increment phases they are ANDed with the logical OR of the Decode and Execute signals from
the sequence generator, as shown in figure 29.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 12/19
12/26/2017 Simple CPU
The processor supports unconditional and conditional JUMP instructions. The conditional JUMP instructions are based on the result of the last ALU operation i.e.
the zero and carry bits for the ADD, SUB and AND instructions. These are stored in a 2bit register, as shown in figure 30. Note, the zero bit is generated by an 8bit
NOR gate connected to the ALU's output bus i.e. a NOR only produce a 1 when all of its inputs are 0.
The sequence_generator, instruction_decoder and status_register form the core elements of the processor's control unit (decoder block in figure 1). The signals from
these units are used to generate the control signals for the system MUXs, ALU and REGs, as shown in figure 31. This table defines the state of each control signal,
for each phase, to implement that instruction's function. This table also defines the control signals needed for the Fetch and Increment phases. The logic that drives
these control signals (right hand side of figure 24) is derived from this table.
Most of this control logic is quite intuitive, a slightly more complex bit is the Jump logic shown in figure 32. If the processor is in the Execute phase, the instruction
decoder and status signals determine if the program counter (PC) should be updated i.e. should the jump address be loaded into the PC. If a JUMP instruction is
taken, then the system does not need to increment the PC, as it already contains the address of the next instruction. Therefore, when the processor is in the Increment
phase it checks to see if a jump has been taken, if there has been the PC is not enabled i.e. the result PC+1 is not stored in the program counter.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 13/19
12/26/2017 Simple CPU
The memory used in the system (figure 33) stores both instructions and data i.e. a Von Neumann architecture. This could be constructed from the built in memory
components used in the FPGA, but they are a real pain to configure i.e. initialise with the machine code and data values needed. Therefore, decided to cheat and use
a bit of VHDL. This is a Hardware Description Language (HDL) representation of what the memory should do, abstracting away from the low level logic gates its
actually made from. This allow me to simply type in the binary values as shown in figure 34. Note, only shown the first few values. This description is then
synthesised (converted) by the Xilinx tools into the required hardware components. The complete computer system is shown in figure 35. A high resolution diagram
can be downloaded here: (Link ).
Figure 33 : RAM
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 14/19
12/26/2017 Simple CPU
The test program shown in figure 34 (machine code, comments on the right in green), loads the value stored in memory location 10 and adds 10 to it. If this does not
generate an overflow i.e. a value larger than 255, the result is written back to memory location 10. If it does generate an overflow i.e. 250+10, the value is saturated
to the maximum value i.e. 255. Therefore, program counts from 0 to 250 and then stops, as shown in the simulation shown in figure 36, look at the bottom row, this
shows the contents of memory location 10 as a hexadecimal value. The Xilinx schematics, symbols and VHDL testbench for the complete system can be
downloaded here: (Link ). Have a play, modify the data values or write your own program, then re-run the simulation (top_level_v1_tb).
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 15/19
12/26/2017 Simple CPU
A requirement when writing you first program on any new processor is to print "HELLO WORLD" to the screen. Normally a simple task, however, for a processor
with no screen and only seven basic instructions a little more of a challenge. The simplest way to add a screen is to use a serial terminal (Link)(Link), then bit-bang
out serial data packets (Link). However, you first need an output port, as shown in figure 37. This is simply a flip-flop who's CE pin is only enabled for a specific
address, in this example address 0xFF (255). When data is written to this address, RAM is updated as normal, but, data bit 0 is also stored in this flip-flop, its Q
output being connected to the TX line of a serial bus.
Each character in the "HELLO WORLD" message string is stored in memory, locations 0xF0 to 0xFD, as an ASCII values (Link). To simplify the program they are
actually stored as their inverted forms e.g. H = 0x48 (01001000), inverted = 0xB7 (10110111). The program reads each character's bits and outputs these on the
serial port. The serial data packet format for the character 'K' (0x4B = 01001011) is shown in figure 38 (Link). Each bit is allocated a time slice on the serial port's
line. The default speed for the serial port is 9600 bits per second i.e. each bit is valid for 104 us (1/9600), packets start with a start bit (1) and finish with a stop bit
(0). These packets are received by a terminal program running on a remote computer and displayed on its screen. The serial packets and terminal display are shown
in figures 39 - 41.
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 16/19
12/26/2017 Simple CPU
The program to send the message string "HELLO WORLD" to the serial terminal is shown below. Hopefully most of the code is self explanatory :). The next
character to be transmitted is read from memory location 0xE0, a bit mask is applied using a bitwise AND to select the desired bit. Based on this result a conditional
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 17/19
12/26/2017 Simple CPU
jump then selects a 0 or a 1, which is stored to memory location 0xFF i.e. the serial port. This is repeated seven more times until all eight character bits have been
outputted. The program below could of been substantially reduced if a SHIFT instruction had been included. The final twist is how the program scans through the
message string. As this processor does not support indirect addressing modes, self modifying code is used to access the next character in the string i.e. instructions
from 0x4F to 0x58. The program reads an INPUT instruction from memory, it then adds 1 to this instruction. As the INPUT instruction's address field is in the lower
byte this changes the address to the address of the next character in the string. This modified instruction is then written back to memory and executed by the
program. It goes without saying that self modifying coded i.e. a program that rewrites itself is not a good idea, however, its very useful in this case. The program
then fetches the next character, storing it to memory location 0xE0, if this is a NULL the program finishes, otherwise the program jumps to memory location 0x02
and repeats the TX code. Note, from a software structure point of view would of been nice if the processor had supported subroutines, perhaps for version 2.
Had to make a few small changes to the schematics and VHDL files to minimise the hardware size e.g. in its present form the memory is implemented by the
software tools as a 256:1 16bit multiplexer, which takes up quite a bit of space. Adding a clock allows the memory to be mapped to a BlockRam i.e. the default
RAM on the FPGA. The final project that prints "HELLO WORLD" can be downloaded here: (Link).
What next for this computer, i what to see if i can get it to fit into a Xilinx 9572 CPLD, which we use for teaching hardware design, would need to use external
RAM/ROM, however, the main limitation is that this programmable hardware only has a very small amount of hardware e.g. 72 flip-flops.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Back
https://www-users.cs.york.ac.uk/~mjf/simple_cpu/index.html 19/19