Experiments in Computer System Design: Technical Report
Experiments in Computer System Design: Technical Report
2010
Table of Contents
Part 1
Introduction and Perspective 3
The Tiny Stack Machine (TSM) 4
The Tiny Register Machine TRM-1 5
The Tiny Register Machine TRM-2 6
The Tiny Register Machine TRM-3 6
The Implementation of the TRM-3 9
The data processing unit 9
The shifter 12
The multiplier 13
The divider 14
The local data memory 15
The control unit 16
Input and output 18
Stalling the processor 18
Interrupts 19
PART 2
The environment and the Top Module 21
A transmitter for the RS-232 serial line 23
A receiver for the RS-232 serial line 24
A floating-point unit 25
Implementation of floating-point operations 27
PART 3
About memories 32
A DDR memory as an external device 32
Introducing a Direct Memory Access Channel (DMA) 36
Initializing and refreshing the SDRAM memory 38
TRM-0: Architecture and instruction set 39
The implementation of TRM-0 41
PART 4
Multiprocessor-systems and interconnects 43
Point-to-point connection: The buffered channel 43
The ring structure 44
1
The implementation of a token ring 46
A software driver 48
A test setup 49
Broadcast 50
Discussion 51
PART 5
The principle of cache memories 53
The direct mapped cache 54
Implementing the cache 55
Acknowledgement 60
2
Technical Report 2. 8. 2010
PART 1
3
is presented in a second step. It includes a direct memory access channel
(DMA). Of course a final solution will be the use of a cache memory.
This small, initial set of hardware components can be augmented freely. Other
processors may be added, more sophisticated links, and drivers for other
devices, several of which are available on the development board currently used
(ML-505). On the host computer reside a compiler (in our case for the
programming language Oberon), and a system for (down-) loading the bitstream
file (configuration file) onto the target FPGA..
This configurable system actually emerged from a project, whose ultimate goal
was an application suitable to demonstrate the power of multi-processor
systems. Its somewhat grandiose title was Supercomputer in a Pocket. The
target application was a surveillance system for heart diseases. Signal analysis
required reasonably large computing power, its being carried by patients required
reasonably low power consumption and a small size to fit into a pocket. The
system elements described below were used in this application.
4
Many instructions of a stack-oriented architecture fit into a single byte. Even
instructions containing a short, immediate operand fit into a byte. Instructions
containing a larger literal, such as an address, are placed in two (or 3) bytes.
Therefore, a variable-length instruction scheme is required.
As is to be expected, the price for an expression stack and a variable-length
instruction fetch machinery is an increase in complexity of the decoding circuitry.
The longer signal path lengths and propagation times result in a longer clock
cycle and thus a decrease in speed.. This is particularly noticeable, if pipelining is
desired to speed up instruction interpretation. This complexity proved to be such
that it was decided to abandon the idea of a stack architecture and to return to a
register array in place of the register stack. The price is less code density, longer
codes, and that program length hits the available limits sooner.
op d b 0 n n = 5-bit literal
op d b 1 a d, b, a = 4-bit regno
5
An unpleasant property of the short instruction (18 bits) is that the field for
immediate operands and addresses is very short. The decision to use not only
16, but all 18 bits of the FPGA’s block memories alleviated this problem to some
degree. However, the necessity to use a separate instruction each time a
constant or an address longer than 5 bits is present, was felt to be too
detrimental to code density and efficiency, and a better solution was sought,
resulting in the TRM-2.
op d 0 n n = 10-bit literal
op d 1 s d, s = 3-bit regno
6
We can now turn to the selection of instructions to be represented. This selection
is essentially determined by the programming language envisaged, i.e. the
operators used in expressions. However, in this respect most general-purpose
languages feature the same requirements: Arithmetic and logical instructions. In
detail, they are the following:
op operation
0 MOV R.d := R.s (in place of R.s may stand the literal imm)
1 NOT R.d := ~R.s
2 ADD R.d := R.d + R.s
3 SUB R.d := R.d - R.s
4 AND R.d := R.d & R.s
5 BIC R.d := R.d & ~R.s
6 OR R.d := R.d | R.s
7 XOR R.d := R.d xor R.s
8 MUL R.d := R.d * R.s
9 DIV R.d := R.d div R.s
10 ROR R.d := R.d ror R.s (rotate right)
11 BR PC := R.s (see below)
In addition there are the Load and Store instructions providing access to
memory. Their format is slightly different. The actual address is computed as the
sum of base address R.s and an offset:
op d off s
op operation
12 LD R.d := Mem[R.s + adr]
13 ST Mem[R.s + adr] := R.d
The only remaining instructions are branch instructions used for implementing
conditional and repetitive statements, i.e. if, while, repeat and for statements.
They are executed conditionally, i.e. when a condition is satisfied. These
conditions are the result of preceding register instructions, and they are held in 4
condition registers N, Z, C, V, defined as shown below. The branch and link
instruction is unconditional. It is used to implement procedure calls. It stores the
current value of the program counter PC in R7.
Branch instructions
4 4 10
14 cond off
15 off BL
7
code mnemonic condition
0000 EQ equal (zero) Z
0001 NE not equal ~Z
0010 CS carry set C
0011 CC carry clear ~C
0100 MI negative (minus) N
0101 PL positive (plus) ~N
0110 VS overflow set V
0111 VC overflow clear ~V
1000 HI high ~(~C|Z)
1001 LS less or same ~C|Z
1010 GE greater or equal ~(N≠V)
1011 LT less than N≠V
1100 GT greater than ~((N≠V)|Z)
1101 LE less or equal (N≠V)|Z
1110 true T
1111 false F
Special instructions
The TRM furthermore features some special instructions. The first to be
mentioned is an instruction to obtain the high part of a product. Multiplications
generate a result of 64 bits. The high-order part is usually ignored, but it is stored
in a special register H. The LDH instruction fetches this value.
MOV d 1 1 0 LDH
The instruction to return from a procedure is BR, branching with the address
taken from register R.s. This instruction also allows the current PC+1 to be stored
in R.d. This instruction is used for calling procedures which are a formal
parameter or are represented by a variable (methods).
BR d 1 10 s BLR
The TRM also features an interrupt facility. There are 2 external signals that can
cause an interrupt. It functions like a procedure call. As the place in a program
where an interrupt may be triggered is unknown, the state of the machine must
be preserved in order to be recovered after the interrupt was handled. Thus the
TRM switches to interrupt mode, in which it uses a second bank of registers and
stores the PC and the condition bits in R7 of this bank. Further interrupts are
immediately disabled.
The return from interrupt to normal mode is caused by an RTI instruction, a slight
variant of the BR. It restores PC and the condition bits, and re-enables interrupts.
The interrupt facility requires a processor status register (PSR) indicating the
processor mode and whether or not interrupts are enabled.
8
BR - 1 01 s RTI
BR - 0 stat LDPSR
The LDPSR instruction loads the Program Status Register from its literal field
with the following bit assignments:
0 interrupt 0 enable
1 interrupt 1 enable
2 processor mode (0 = normal; 1 = interrupted)
3 cache enable (if available)
The hardware interface of the TRM module follows the example of typical micro-
processors. The inputs are:
clk the processor clock (116 MHz)
rst reset, active low
stall if high, causes the processor to stall
irq0, irq1 interrupt signals, active high
inbus 32-bit bus
The output signals are
iord, iowr read and write enable
ioadr 6-bit I/O address
outbus 32-bit bus
The iord signal is included, because some read commands may not only read
data, but also change the state of the device, such as moving a buffer pointer
ahead for sequential access.
The processor essentially consists of two sections, the data processing unit,
computing the results of single instructions, and the control unit, controlling the
sequence of instructions.
9
But not only. The second factor are the resources available. The early RISC
designs held to principle that every instruction should be executed in a single
clock tick. This is readily possible for addition and the logical operations. But
already a shifter may cause difficulties, let alone multiplication and division. They
are inherently more complex than the former. There are three solutions to the
dilemma: The first is to provide more and faster circuitry – possible only within
limits -, and the second is to give up the principle, i.e. to allow some operations to
take more than a single clock cycle. The third solution is to omit the operation
altogether. Indeed, early RISC designs left out multiplication and division
instructions, as these are relatively rare operations – in particular division.
In our context, we let multiplication and division take 32 cycles. This requires that
the control unit can be stopped from progressing to the next instruction. The
signal indicating such delay is called stall. Both the multiplication and the division
units have a stall signal as output. Fortunately, it proved to be possible to
implement a full barrel shifter operating within a single cycle.
The processing unit consists of the Arithmetic/Logic Unit (ALU) and a set of 8
registers. The ALU is – apart from multiplier and divider - a purely combinational
circuit yielding results of arithmetic or logical operations. The main data path of
the processor forms a loop from selected source registers (A, B) through the
ALU to a multiplexer (aluRes) back to a destination register. The multiplexer in
the A-path determines, whether the A-operand is a register or a literal, i.e. a
constant in the instruction IR. The additional registers H and CC store the high-
order part of a product, and the conditions N, Z, C, V respectively.
Registers
IR
B
H ALU CC
10
wire [3:0] op;
wire [9:0] imm;
wire [2:0] ird, irs, dst;
wire [7:0] off;
wire [31:0] AA, A, B, s1, s2, s3, divRes, remRes;
wire [32:0] aluRes;
wire [63:0] mulRes;
wire MOV, NOT, ADD, SUB, MUL, DIV, AND, BIC, OR, XOR, ROR;
wire BR, LDR, ST, Bc, BL, ADSB;
assign op = IR[17:14];
assign ird = IR[13:11];
assign irs = IR[2:0];
assign imm = {22'b0, IR[9:0]};
assign off = {4'b0, IR[10:3]};
assign MOV = (op == 0);
assign NOT = (op == 1);
assign ADD = (op == 2);
assign SUB = (op == 3);
assign AND = (op == 4);
assign BIC = (op == 5);
assign OR = (op == 6);
assign XOR = (op == 7);
assign MUL = (op == 8);
assign DIV = (op == 9);
assign ROR = (op == 10);
assign BR = (op == 11);
assign LDR = (op == 12);
assign ST = (op == 13);
assign Bc = (op == 14);
assign BL = (op == 15);
assign ADSB = (IR[17:15] == 1); // ADD | SUB
assign A = (IR[10]) ? AA: {22’b0, imm};
assign regwr = (~ST & ~ … );
assign aluRes =
(MOV) ? A :
(NOT) ? ~A :
(ADD) ? {B[31], B} + {A[31], A} :
(SUB) ? {B[31], B} - {A[31], A} :
(AND) ? B & A :
(BIC) ? B & ~A :
(OR) ? B | A :
(MUL) ? mulRes[31:0] :
(DIV) ? divRes : B ^ A; // XOR
The register bank is implemented by 32 1-bit LUT RAM-slices, expressed in
Verilog by a generate statement. There are 2 addresses (register numbers). The
first is dst (stemming from instruction field ird), controlling RAM input D (regmux)
and RAM output B, and the second is irs, controlling RAM output AA.
genvar i;
generate //dual port register file
for (i = 0; i < 32; i = i+1)
begin: rf32
11
RAM16X1D_1 # (.INIT(16'h0000))
rfa(
.DPO(AA[i]), // data out
.SPO(B[i]),
.A0(dst[0]), // R/W address, controls D and SPO
.A1(dst[1]),
.A2(dst[2]),
.A3(intMd),
.D(regmux[i]), // data in
.DPRA0(irs[0]), // read-only adr, controls DPO
.DPRA1(irs[1]),
.DPRA2(irs[2]),
.DPRA3(intMd),
.WCLK(~clk),
.WE(regwr));
end
endgenerate
Apart from the 8 32-bit registers the data processing unit contains four 1-bit
registers: N, Z, C and V. Together they form the condition code. It is set by the
general instructions and tested by conditional branch instructions. N indicates
whether a result is negative, and Z whether it is zero. C and V hold the carry and
overflow bits of additions and subtractions. There is also the 32-bit register H
holding the high order part of products or the remainder of divisions.
always @ (posedge clk)
if (regwr) begin
N <= aluRes[31];
Z <= (aluRes == 0);
C <= (ADSB) ? aluRes[32] : (ROR) ? s3[0] : C;
V <= (ADSB) ? (aluRes[32] ^ aluRes[31]) : V;
H <= (MUL) ? mulRes[63:32] : (DIV) ? remRes : H;
end
The Shifter
The TRM has only a single shift instruction. It rotates to the right. The rotate
mode was chosen, because it does not lose any information; all bits are still
present unchanged, albeit at another position. Hence, all other shift modes can
be derived from rotation with the help of masking. The shifter is a barrel shifter.
This implies that any amount of shift is possible with one instruction, i.e. the shift
count ranges from 0 to 31.
Typically, shifters are built from a series of multiplexers, the first shifting by 0 or
1, the second by 0 or 2, etc. the fifth by 0 or 16. Here, we use 4-input
multiplexers (a number favored by Xilinx FPGA cells), and thus can reduce the
series from 5 to 3, denoted by s1, s2, and s3. Now the first multiplexer shifts by
0, 1, 2, or 3, the second by 0, 4, 8, or 12, and the third by 0 or 16. A generate
statement is used to build the 32 multiplexers for each stage. The shift count is
A[4:0]. The output s3 goes to regmux instead of aluRes. (Note: “%” denotes
modulo in Verilog).
wire [1:0] sc1, sc0;
wire [31:0] s1, s2, s3;
12
assign sc0 = A[1:0];
assign sc1 = A[3:2];
generate
for (i = 0; i < 32; i = i+1)
begin: rotblock
assign s1[i] = (sc0 == 3) ? B[(i+3)%32] : (sc0 == 2) ? B[(i+2)%32] :
(sc0 == 1) ? B[(i+1)%32] : B[i];
assign s2[i] = (sc1 == 3) ? s1[(i+12)%32] : (sc1 == 2) ? s1[(i+8)%32] :
(sc1 == 1) ? s1[(i+4)%32] : s1[i];
assign s3[i] = A[4] ? s2[(i+16)%32] : s2[i];
end
endgenerate
The Multiplier
The multiplier is declared as a separate module, instantiated by the following
statement:
Multiplier mulUnit (CLK(clk), .mul(MUL),
.A(A), .B(B),
.stall(stallM), .mulRes(mulRes));
The multiplier described below follows the traditional algorithm of n add-shift
steps, where n is the word length, here 32.
s := 0; (*x is the multiplier, y the multiplicand*)
REPEAT
IF ODD(x) THEN z := z+y END ;
x := x DIV 2; z := z DIV 2; INC(s) (*right shift*)
UNTIL s = 32
This implies that the multiplier is a state machine. Its state is a counter S running
from 0 to 32. We use a double-length register, here called Hi (initialized to 0) and
Lo (initialized with the multiplier B. In each step, the multiplicand A is added to
the high part, if the least bit of the multiplier is 1. Then the register is shifted one
bit to the right.
+
Hi Lo
13
introduce Hix and Bx as extended versions of Hi and B. This is necessary,
because the carry bit of addition must not be lost. It enters register Hi with the
right shift.
module Multiplier(
input CLK, mul,
output stall,
input [31:0] A, B,
output [63:0] mulRes);
reg [5:0] S; // state
reg [31:0] Hi, Lo; // high and low parts of partial product
wire [32:0] p, Hix, Bx;
assign stall = mul & ~S[5];
assign Hix = {Hi[31], Hi};
assign Bx = {B[31], B};
assign p = (S == 0) ? (A[0] ? Bx : 0) :
Lo[0] ? ((S == 31) ? (Hix - Bx) : (Hix + Bx)) : Hix;
assign mulRes = {Hi, Lo};
always @ (posedge(CLK)) begin
if (mul & stall) begin
Hi <= p[32:1];
Lo <= (S == 0) ? {p[0], A[31:1]} : {p[0], Lo[31:1]};
S <= S + 1; end
else if (mul) S <= 0;
end
endmodule
The parameter mul indicates a multiplication in progress. The stall signal is
asserted, when mul is 1 and S has not yet reached the value 32.
The FPGA used in this project features (a large number of) DSPs (digital signal
processors). A DSP can be used to speed up multiplication, because it can
multiply two 18-bit numbers in a single clock tick. Thus, we need only 4 (instead
of 32) cycles for a multiplication of two 32-bit arguments. We refrain from
presenting this solution here, because it is rather complicated and highly
dependent on the particular DSP design.
The Divider
The divider is declared as a separate module, instantiated by the following
statement:
Divider divUnit(.clk (clk), .div(DIV),
.x(B), y(A),
.stall(stallD),
.quot(divRes), .rem(remRes));
The divider described below follows the traditional algorithm with n shift-subtract
steps, where n is the wordlength.
s := 0; r := x; q := 0; (*x is the dividend, y the divisor*)
REPEAT (*q*y + r = x*)
r := 2*r; q := 2*q; INC(s); (*left shift*)
32
IF r >= Y THEN r := r - Y END ; (*Y = 2 *y*)
14
UNTIL s = 32
(*q is the quotient, r the remainder*)
–
R Q
This implies that also the divider is a state machine. Its state is represented by
the counter S running from 0 to 32
module Divider(
input clk, div,
output stall,
input [31:0] x, y,
output [31:0] quot, rem);
reg [5:0] S; // state
reg [31:0] R, Q; // remainder, quotient
wire [31:0] xa, rsh, qsh, d;
assign stall = div & ~S[5];
assign xa = (x[31]) ? -x : x;
assign rsh = (S == 0) ? 0 : {R[30:0], Q[31]};
assign qsh = (S == 0) ? {xa[30:0], ~d[31]} : {Q[30:0], ~d[31]};
assign d = rsh - y;
assign quot = (~x[31]) ? Q : (R == 0) ? -Q : -Q-1;
assign rem = (~x[31]) ? R : (R == 0) ? 0 : y - R;
always @ (posedge(clk)) begin
if (div & stall) begin
R <= (~d[31]) ? d : rsh; Q <= qsh; S <= S + 1; end
else if (div) S <= 0;
end
endmodule
The dividend is taken as the absolute value of x. In case of a negative x, a
correction is made after the computation of quotient and remainder:
IF x < 0 THEN
IF r = 0 THEN q := -q ELSE q := -q-1; r := y-r END
END
15
assign dmin = B;
dbram32 DM (.wda(dmin), //write port
.aa (dmadr[10:0]),
.wea (dmwr),
.clka (clk),
.rdb (dmout), //read port
.ab (dmadr[10:0]),
.enb (1'b1),
.clkb (clk));
assign regmux =
(LDR) ? dmout : // read for LDR instructions only
(ROR) : s3 :
aluRes;
(Other terms will be added to regmux, the registers’ input, later).
PC
P-mem
4K x 18
+1
0
decode
16
.clkb (clk));
assign IR = (PC[0]) ? pmout[35:18] : pmout[17:0];
assign nxpc = PC + 1;
assign pcmux =
(~rst) ? 0 :
(stall0) ? PC :
(BL) ? IR[13:0] + nxpc :
(Bc & cond) ? {IR[9], IR[9], IR[9:0]} + nxpc :
(BR & IR[10]) ? A[11:0] : nxpc;
The sequencing of instructions is finally achieved by the statement
always @ (posedge clk) PC <= pcmux;
This rather straight-forward scheme was used for the TRM-1.
Unfortunately, reading data from local memory is slow compared to functions
implemented by the normal logic cells (LUT). It required the use of a clock rate
not greater than 58.3 MHz. A simple measure called (single stage) pipelining
allows to double the clock rate to 116.6 MHz. It requires two incarnations of IR
and PC. An instructions is first fetched with address PCf into IRf, and thereafter
moved from IRf to IR. While it is interpreted from IR, the next instruction is
fetched into IRf. This sequential flow is broken by branch instructions. In their
case, a NOP instruction must be inserted, causing a hiccup, i.e. a delay of one
tick. The pipelining machinery is described as follows:
localparam NOP = 18'b111011110000000000; // never jump
reg [11:0] PCf, PC;
reg [17:0] IR;
reg stall1;
wire [17:0] IRf; //36-bit register IRf is contained in module pbram
wire [11:0] pcmux, nxpcF, nxpc;
always @ (posedge clk) begin
PCf <= pcmux;
if (~rst) begin PC <= 0; IR <= NOP; end
else if (stall0) begin PC <= PC; IR <= IR; end
else if ((Bc & cond) | BL | BR & IR[10])
begin PC <= pcmux; IR <= NOP; end
else begin PC <= PCf; IR <= IRf; end
end
The signal cond determines, whether a branch is taken or not. It is derived from
the various condition code registers. Bit IR[10] inverts the sense of the condition.
assign cond = IR[10] ^ // xor
((ird == 0) & Z | // EQ, NE
(ird == 1) & C | // CS, CC
(ird == 2) & N | // MI, PL
(ird == 3) & V | // VS, VC
(ird == 4) & ~(~C|Z) | // HI, LS
(ird == 5) & ~S | // GE, LT
(ird == 6) & ~(S|Z) | // GT, LE
(ird == 7)); // T, F
17
Input and Output
Input and output is handled in the conventional way by including the memory
data buses in the processor’s interface, and by reserving a (small) portion of the
address space for external devices (address-mapped I/O). Addresses 0FC0H –
0FFFH are designated for devices. This is a range of 64 addresses. If such an
address is generated, the signal ioenb becomes active.
Registers PC
P-mem
outbus
IR 4K x 18
B +1
D-mem
A IR
2K x 32
H ALU CC
aluRes
decode
pcmux
inbus
regmux
18
The problem is solved by introducing a facility to stall the instruction fetch when
the mentioned cases occur. The necessary additions to the TRM circuit are listed
below: stall is an input to the TRM.
reg stall1;
wire stall0, stallM, stallD;
assign stall0 = (LDR & ~stall1) | stallM | stallD | stall;
assign pcmux =
(~rst) ? 0 :
(stall) ? PC : ... : nxpc;
always @ (posedge clk) begin // stall generation
if (~rst) stall1 <= 0;
else stall1 <= (LDR & ~stall1);
end
stall0
stall1
stallM
Interrupts
An interrupt facility is necessary, if the processor needs to be able to respond
quickly to signals from external devices, i.e. where (occasional) polling of such
signals is inadequate. Interrupts are based on letting (external) signals determine
the choice of the next instruction at any time, i.e. by directly letting them control
pcmux. Our TRM features two distinct interrupt signals, irq0 and irq1:
assign pcmux =
(~rst) ? 0 :
(irq0 & intAck)? 2:
(irq1 & intAck)? 3: : ... : nxpc;
Of course, it is mandatory to preserve the processor state upon interrupt,
because the interrupt may occur at any arbitrary point in the program. The
interrupt resembles somewhat a procedure call, and the response to an interrupt
that of executing a procedure (called interrupt handler). In the first place, the
processor stores its current PC value in a link register, from where it can be
recovered after the interrupt had been serviced. Then a fixed value according to
the interrupt source is forced to the PC (2 or 3 in our case). Typically an interrupt
handler would save the values of all other registers, or at least those which the
handler makes itself use of. This saving and later restoring of all registers is time-
consuming and not acceptable, if hard real-time constraints have to be met.
19
The TRM therefore features a second bank of 8 registers. Upon interrupt, the
processor switches to interrupt mode by setting intMd, and to the use of the
alternate bank (address bit 3). It thereby disables further interrupts, and then
deposits the PC and the flag registers N, Z, C, V in the link register of the
alternate bank. For all this, an extra cycle must be inserted. It is marked by the
signal intAck.
As a consequence, a special return instruction must be provided which, in
addition to restoring the PC also switches back to the normal register bank and
restores N, Z, C, V. This is done by a BR instruction with IR[8] being set.
It is of course necessary to disable interrupt signals. Thus we introduce state
registers intEnb0 and intEnb1. Evidently, a special instruction is required to set
these registers and abuse a form of the BR instruction for this purpose (with bit
10 being zero). We call this instruction Set Processor Status (PSR).
The additions necessary for the interrupt system are listed below, and there are
remarkably few of them.
reg intEnb, intAck, intMd, intAck;
wire irq0e, irq1e;
assign irq0e = irq0 & intEnb0;
assign irq1e = irq1 & intEnb1;
always @ (posedge clk) begin // interrupt and mode handling
if (~rst) begin intEnb0 <= 0; intEnb1 <= 0; intMd <= 0; intAck <= 0; end
else if ((irq0e | irq1e) & ~intMd & ~stall0 & ~(IR == NOP)) begin
intAck <= 1; intMd <= 1; end
else if (BR & IR[10] & IR[8]) intMd <= 0; // return from interrupt
else if (BR & ~IR[10]) begin // SetPSR
intEnb0 <= IR[0]; intEnb1 <= IR[1]; intMd <= IR[2]; end
if (intAck & ~stall0) intAck <= 0;
end
Furthermore, we must provide an additional case in the code governing the PC:
else if ((irq0e | irq1e) & intAck) begin PC <= PCf; IR <= NOP; end
For regmux the additional case intAck must be included, bringing it to its final
form:
assign pcmux =
(~rst) ? 0 :
(stall0) ? PCf :
(irq0e & intAck) ? 2 :
(irq1e & intAck) ? 3 :
(BL) ? IR[11:0] + nxpc :
(Bc & cond) ? {IR[9], IR[9], IR[9:0]} + nxpc :
(BR & IR[10]) ? A[11:0] : nxpcF;
This concludes the description of the TRM processor implementation.
20
Technical Report 8. 8. 2010
PART 2
TRMTop
The principal purpose of the top module is to connect signals of one module with
signals of another module (or with external signals). This connecting occurs
under control of the TRM, i.e. according to the TRM’s interface signals ioadr,
iowr, and iord. Hence, the main components to be found in the top module are
multiplexers and decoders driven by ioadr. This is shown by the diagram, in
which the boxes in the middle represent individual devices, which can be either
implemented by other modules or (exceptionally) in the top module itself, as in
the cases of dip switches and LEDs.
The I/O addresses driving the decoders and multiplexers in this top module are: .
adr input output
4 data Rx data Tx RS-232
5 status -- bit 0: RxRdy, bit 1: TxRdy
6 millisec timer reset timer interrupt (tick)
7 8 dip switches 10 LEDs
21
TRM ioadr
decoder
outbus
multiplexer
inbus
In this sample top module one instance of each of TRM, FPU, RS232R, and
RS232T are created (imported). Furthermore 8 dip switches are made available
as inputs and 10 LEDs as outputs. They are represented as signals swi and leds
in the top module’s interface (heading). And so are the serial input RxD and
output TxD. Signals CLKBN and CLKBP stem from an oscillator, and rstIn from a
push button. We refrain from presenting the clock generation circuitry in detail,
but emphasize that the entire design is synchronous, i.e. driven by the single
clock clk.
Another feature of this top module is a timer (cnt1) counting elapsed
milliseconds. It is driven by another counter (cnt0) which counts, according to the
clock rate of 116.6 MHz, up to 116600 and then advances cnt1 and sets the tick
register to 1. The tick signal is fed to the TRM’s irq0 input, and thus may cause
an interrupt every millisecond, if enabled.
Hint: The FPU can be deleted by simply dropping its instantiation.
module TRM3Top(
input CLKBN,
input CLKBP,
input rstIn,
input RxD,
input [7:0] swi,
output TxD,
output [9:0] leds);
wire ClockIn;
wire PLLBfb;
wire pllLock;
wire clk, CLKx;
reg rst, tick;
wire[5:0] ioadr;
wire iord, iowr, stall, io4, io5, io6, io7, io16;
wire[31:0] inbus, outbus, fpubus;
wire [7:0] dataTx, dataRx;
wire rdyRx, doneRx, startTx, rdyTx;
reg [9:0] Lreg; // for LEDs
22
reg [17:0] cnt0; //driver of the millisecond counter
reg [31:0] cnt1; // millisecond counter
TRM trmx(.clk(clk), .rst(rst), .stall(stall), .irq0(tick), .irq1(1’b0),
.inbus(inbus), .ioadr(ioadr), .iord(iord), .iowr(iowr), .outbus(outbus));
FPU fpu(.clk(clk), .rst(rst), .stall(stall), .iowr(iowr & io16),
.ioadr(ioadr[1:0]), .inbus(outbus), .outbus(fpubus));
RS232R receiver(.clk(clk), .rst(rst), .RxD(RxD), .done(doneRx), .data(dataRx), .rdy(rdyRx));
RS232T transmitter(.clk(clk), .rst(rst), .start(startTx), .data(dataTx), .TxD(TxD), .rdy(rdyTx));
assign io4 = (ioadr == 4);
assign io5 = (ioadr == 5);
assign io6 = (ioadr == 6);
assign io7 = (ioadr == 7);
assign io16 = (ioadr[5:2] == 4'b0100);
assign inbus = io4 ? {24'b0, dataRx} :
io5 ? {30'b0, rdyTx, rdyRx} :
io6 ? cnt1 :
io7 ? swi : fpubus;
assign dataTx = outbus[7:0];
assign startTx = iowr & io4;
assign doneRx = iord & io4;
assign leds = Lreg;
always @(posedge clk)
if (~rst) begin tick <= 0;cnt0 <= 0; Lreg <= 0; end
else begin
if (iowr & io6) tick <= 0;
if (iowr & io7) Lreg <= outbus[9:0];
else if (cnt0 == 116600) begin
cnt1 <= cnt1 + 1; cnt0 <= 0; tick <= 1;end
else cnt0 <= cnt0 + 1;
end
always @(posedge clk) rst <= rstIn & pllLock;
endmodule
stop bit
d0 – d7
start bit
23
Fig. 2.3. The RS-232 data format
module RS232T(
input clk, rst,
input start, // request to accept and send a byte
input [7:0] data,
output rdy,
output TxD);
module RS232R(
input clk, rst,
input done, // "byte has been read"
input RxD,
output rdy,
output [7:0] data);
24
reg [8:0] tick;
reg [3:0] bitcnt;
reg [7:0] shreg;
A Floating-point Unit
Scientific computation is almost without exception based on floating-point
arithmetic. Fractional numbers (type REAL) are represented by a pair mantissa-
exponent, i.e.
x = m×Be 1.0 ≤ m < B
where B is a fixed base. The universally adopted, single-precision IEEE Standard
defines B = 2 and
x = <s, m’, e’> m = 1.m’, e = e’ - 127, and 1.0 ≤ m < 2.0
with a sign bit s, an exponent e’ of 8 bits, and a mantissa m’ of 23 bits. The
leading 1 of m is suppressed. A few examples of real numbers and their
representation in hexadecimal form are:
x e m
0.5 -1 1.0 3F000000
1.0 0 1.0 3F800000
1.5 0 1.5 3FC00000
1.75 0 1.75 3FE00000
2.0 1 1.0 40000000
10.0 3 1.25 41200000
100.0 6 1.5625 42C80000
25
31 23 0
s exponent mantissa
26
Floating-point division is not discussed here. It is a rather rare operation, too
rare to warrant a complex circuit. It is better implemented by an iterative method
in software. The following algorithm computes x = 1/a.
x := 1.0; z := 1.0 - a;
REPEAT x := x*(1.0+z); z := z*z UNTIL z = 0.0
module FPU(
input clk, rst, iowr,
input [1:0] ioadr,
input [31:0] inbus,
output stall,
output [31:0] outbus);
reg [31:0] X, Y; // arguments
reg [26:0] s; // pipe reg
reg mulR;
wire io0, io1, io2, io3;
wire [27:0] x0, y0;
27
wire [36:0] x1, y1;
wire [40:0] x2, y2;
wire [26:0] x3, y3;
wire [7:0] xe, ye;
wire [8:0] dx, dy, e0, e1;
wire [7:0] sx, sy; // shift counts
wire [1:0] sx0, sx1, sy0, sy1;
wire sxh, syh;
wire [26:0] ss;
wire [31:0] Sum;
28
begin: shiftblk0
assign x1[i] = (sx0 == 3) ? x0[i+3] : (sx0 == 2) ? x0[i+2] : (sx0 == 1) ? x0[i+1] : x0[i];
assign y1[i] = (sy0 == 3) ? y0[i+3] : (sy0 == 2) ? y0[i+2] : (sy0 == 1) ? y0[i+1] : y0[i];
end
for (i = 0; i < 25; i = i+1)
begin: shiftblk1
assign x2[i] = (sx1 == 3) ? x1[i+12] : (sx1 == 2) ? x1[i+8] : (sx1 == 1) ? x1[i+4] : x1[i];
assign y2[i] = (sy1 == 3) ? y1[i+12] : (sy1 == 2) ? y1[i+8] : (sy1 == 1) ? y1[i+4] : y1[i];
end
for (i = 0; i < 25; i = i+1)
begin: shiftblk2
assign x3[i] = sxh ? 0 : (sx[4]) ? x2[i+16] : x2[i];
assign y3[i] = syh ? 0 : (sy[4]) ? y2[i+16] : y2[i];
end
endgenerate
assign ss = (X[31] ? -x3 : x3) + (Y[31] ? -y3 : y3); // add or subtract
always @ (posedge(clk)) s <= ss[26] ? -ss : ss;
assign z24 = ~s[25] & ~ s[24];
assign z22 = z24 & ~s[23] & ~s[22];
assign z20 = z22 & ~s[21] & ~s[20];
assign z18 = z20 & ~s[19] & ~s[18];
assign z16 = z18 & ~s[17] & ~s[16];
assign z14 = z16 & ~s[15] & ~s[14];
assign z12 = z14 & ~s[13] & ~s[12];
assign z10 = z12 & ~s[11] & ~s[10];
assign z8 = z10 & ~s[9] & ~s[8];
assign z6 = z8 & ~s[7] & ~s[6];
assign z4 = z6 & ~s[5] & ~s[4];
assign z2 = z4 & ~s[3] & ~s[2];
assign u[4] = z10; // u = shift count of post normalization
assign u[3] = z18 & (s[17] | s[16] | s[15] | s[14] | s[13] | s[12] | s[11] | s[10])
| z2;
assign u[2] = z22 & (s[21] | s[20] | s[19] | s[18])
| z14 & (s[13] | s[12] | s[11] | s[10])
| z6 & (s[5] | s[4] | s[3] | s[2]);
assign u[1] = z24 & (s[23] | s[22])
| z20 & (s[19] | s[18])
| z16 & (s[15] | s[14])
| z12 & (s[11] | s[10])
| z8 & (s[7] | s[6])
| z4 & (s[3] | s[2]);
assign u[0] = ~s[25] & s[24]
| z24 & ~s[23] & s[22]
| z22 & ~s[21] & s[20]
| z20 & ~s[19] & s[18]
| z18 & ~s[17] & s[16]
| z16 & ~s[15] & s[14]
| z14 & ~s[13] & s[12]
| z12 & ~s[11] & s[10]
| z10 & ~s[9] & s[8]
| z8 & ~s[7] & s[6]
| z6 & ~s[5] & s[4]
| z4 & ~s[3] & s[2];
assign e1 = e0 - u + 1;
29
assign u0 = u[1:0]; // u = shift count
assign u1 = u[3:2];
assign t0 = {s[25:0], 16'b0};
generate // normalize, shift left
for (i = 16; i < 42; i = i+1)
begin: shiftblk4
assign t1[i] = (u0 == 3) ? t0[i-3] : (u0 == 2) ? t0[i-2] : (u0 == 1) ? t0[i-1] : t0[i];
end
for (i = 16; i < 42; i = i+1)
begin: shiftblk5
assign t2[i] = (u1 == 3) ? t1[i-12] : (u1 == 2) ? t1[i-8] : (u1 == 1) ? t1[i-4] : t1[i];
end
for (i = 16; i < 42; i = i+1)
begin: shiftblk6
assign t3[i] = u[4] ? t2[i-16] : t2[i];
end
endgenerate
assign t4 = t3[41:17] + 1; // rounding
assign Sum = (xe == 0) ? Y : (ye == 0) ? X :
((t3[41:17] == 0) | e1[8]) ? 0 : {ss[26], e1[7:0], t4[23:1]};
It is remarkable that the program of the FPU is almost as long as that for the
entire processor TRM. It is therefore of interest to compare its performance with
that of a solution implementing real arithmetic by software. The result of a
comparison indicates that the hardware solution performs between 10 and 30
times faster than the software implementation. The extreme case is that of
subtraction with almost identical operands. This leads to a long post-
normalization shift, which is done in a loop in software. This case is a weak point
30
of floating-point arithmetc in general. It implies a loss of precision and is called
cancellation.
31
Technical Report 8. 8. 2010
PART 3
About memories
In the early years of computers, memories had been considered as an integral
part of the central computing unit. This remained so through the eras of magnetic
drum memories, magnetic core memories, and static semiconductor memories
(SRAM). A change came with the RISCs (Reduced Instruction Set Computer),
which more strongly decouple memory and processing unit. Whereas the speed
of processors increased dramatically, the speed of memories also increased, but
at a lesser pace. But their capacity increased substantially, mainly due to
dynamic random access memories (DRAM). Cells in static RAMs consist of two
transistors and have 2 stable states. Thus they hold a bit (until given a new
value), and therefore they are called static. The dynamic RAM holds a bit in a
small capacitor coupled with a single transistor. This cell requires less space on a
die and therefore is dominant for large capacity devices.
The DRAM has, however, a few drawbacks. The most prominent is that
capacitors leak and discharge through the transistor. Therefore the charge must
be refreshed. This is achieved by reading the cell and restoring the old value
(through recharge). Refreshing requires additional circuitry, which must not
interfere with normal data access. DRAMs are typically refreshed at least every
millisecond.
Memory chips of the latest provenience have capacities in the order of a gigabyte
and therefore require large multiplexers for reading and decoders for writing. As
a consequence, access is slower than for smaller devices. In the last decades,
the speeds of processors and of memories have increasingly diverged. Two
remedies are in use: 1. Data in memory are accessed in larger portions than
single words or bytes. 2. Buffers are placed in the data path between memory
and processor. These buffers are fast memories, called caches. Modern
processors feature cache memories on-chip. Naturally, caches further complicate
memory access, leading to more complex circuit. It is common that such cache
mechanisms are to be invisible (transparent) to the computer user and to the
software. We will here first show how a large DDR memory is interfaced with the
TRM.
32
and it was designed by Ch. Thacker of Microsoft Research in Mountain View.
Thereby we obtain some freedom to ignore details of this particular type of DDR.
In addition to being periodically refreshed, the DDR memory must be initialized at
startup. This involves the loading of certain constants. Also, once the RAMs have
been configured, the individual delay lines associated with the FPGA data pins
must be adjusted to center the strobe in the "data valid" window.
These complicated task appear to require a substantial amount of circuitry. This
can be avoided by employing a dedicated, simple, programmed processor for
these tasks. The design of such a processor is described below. It is called TRM-
0. Once the system is running, the TRM-0 controls the periodic refresh of the
RAMs. Note that calibration can fail. The signal CalFailed is available to
programs as a status bit of the DDR interface. The TRM-0 will be presented at
the end of this Part.
Let us now describe the top module that connects to the DDR-Controller as an
external device of TRM. We start by showing the heading (interface) of this Top
module. In addition to the signals of the top module described in Part-1 of this
Report, it contains all signals leading to the memory chip on the ML-505 board.
They are directly passed on to the DDR-Controller module.
module TRM3DTop(
input CLKBN,
input CLKBP,
input rstIn,
input RxD,
input [7:0] swi,
output TxD,
output [9:0] leds,
inout [63:0] DQ, //the 64 DQ pins, signals to the memory chip
inout [7:0] DQS, //the 8 DQS pins
inout [7:0] DQS_L,
output [1:0] DIMMCK, //differential clock to the DIMM
output [1:0] DIMMCKL,
output [12:0] A, //addresses to DIMMs
output [7:0] DM,
output [1:0] BA, //bank address to DIMMs
output RAS, CAS, WE, ODT, ClkEn, S0);
The connections between the various modules are best sketched by the following
block diagram: The registers and signals, in addition to those present in the basic
version of TRM3Top are:
reg Read, Write; // DDR commands
reg RBempty1, Write1; // delayed DDR signals
reg RDrdy, shiftRD;
reg [22:0] Address;
reg [255:0] RD; // read data buffer from DDR
reg [255:0] WD; // write data buffer to DDR
wire AFfull, WBfull, RBempty, WriteAF, ReadRB, WriteWB;
wire [127:0] ReadData, WriteData;
33
Data are read and written in blocks of 256 bits (8 words). The memory can be
considered as consisting of 256-bit elements. When writing, first 8 words are
deposited into the WD buffer by 8 consecutive IO commands (with address 10).
Addr
ioadr
iowr decode and
iord seqencing logic RAS
Read CAS
WriteAF WE
ReadRB ODT
WriteWB CkEnb
AFfull …
RBenpty
WBfull
TRM-0 D0 – D2
injectCmd
34
CONST A0 = 0FFFFFFCAH; A1 = 0FFFFFFCBH;
TYPE Block = ARRAY 8 OF INTEGER;
35
Introducing a Direct Memory Access Channel (DMA)
The solution presented above is the simplest, as far as hardware is concerned.
Its drawback is, however, low speed. This can be remedied by avoiding the use
of one instruction for each word transferred, and instead to transfer an entire
block through a single instruction. This solution introduces an important concept
of computer architecture: The direct memory access channel. It postulates that
not only the processor, but also other agents may obtain direct memory access.
The two driver procedures are then simplified to:
PROCEDURE Write(dst: INTEGER; VAR B: Block);
BEGIN
REPEAT UNTIL ~BIT(A1, 1); (*write buffer not full*);
PUT(A0, ADR(B) + 2000000H); (*DMA transfer of 8 words from B*)
REPEAT UNTIL ~BIT(A1, 2); (*command buffer not full?*)
PUT(A1, 2000000H + dst); (*write DDR*)
END Write;
36
assign dmadr = (dmaenb) ? dmaAdr : ((irs == 7) ? 0 : AA[11:0]) + off);
assign dmwr = (dmaenb) ? dmawr : ST & ~ioenb;
assign dmin = (dmaenb) ? dmain : B;
assign dmaout = dmout;
The only further change to the TRM logic is the dmaenb signal stalling the
processor:
assign stall0 = (LDR & ~stall1) | stallM | stallD | stall | dmaenb;
The major addition to the top module is the state machine. It controls the block
transfer between the TRM output to the write buffer WD, and the block transfer
between the read buffer RD and the TRM input. The state machine is triggered
by a PUT statement with I/O address 10. The command word contains the
address of local memory in bits 0 – 10, and either a read in bit 24 or a write in bit
25.
wcnt = 7
37
wire [127:0] ReadData, WriteData; // data from/to DDR
assign WriteAF = Read | Write1;
assign WriteWB = Write | Write1; // commands to DDR controller
assign ReadRB = ~RBempty;
assign WriteData = (Write1) ? WB[127:0] : WB[255:128];
assign dmaenb = ~(state == 0);
assign dmawr = (state == 1);
assign dmain = RB[31:0];
The state machine is expressed in Verilog by the following statements:
always @(posedge clk)
begin
if ((ioadr == 11) & iowr) begin // DDR command
ddradr <= outbus[22:0];
Read <= outbus[24]; RDrdy <= 0;
Write <= outbus[25];
end
else begin Read <= 0; Write <= 0;
if (~RBempty1 & RBempty) RDrdy <= 1;
end
38
PROCEDURE refresh;
BEGIN prechargeall; wait(1); (*unit of delay = 32ns*) refreshall; wait(3);
END refresh;
Start: inhibit DDR; wait(6000); toggleDDR; setDIMMclk;
InitMem: wait(12); prechargeall; wait(1);
babk2; bank3; babk1; MRS1;
wait(49); (*wait for DLL to lock*) refresh; refresh;
MRS2; MRS3; MRS4; wait(11); prechargeall; wait(1);
Calibrate: inhibitDDR; set Force;
wait(1); … ; wait(3); WriteCmd; toggle(StartDQcal); n := 0;
REPEAT ReadCmd; DEC(n) UNTIL n = 0;
wait(16); refresh; enableDDR; clear Force;
REPEAT wait(768); disbleDDR; refresh; enableDDR END
39
Program
Memory
Register bank
8 x 12 2K x 18
IR
ALU N, Z
decode
PC
+1
op d 0 n n =11-bit literal
p
op d 1 s d, s = 3-bit regno
p
op operation
0 BIS R.d := R.d | R.s (in place of R.s may stand the literal n)
1 BIC R.d := R.d & ~R.s (bit clear)
2 ADD R.d := R.d + R.s
3 SUB R.d := R.d - R.s
4 MOV R.d := R.s
5 not used
If s = 7, Din instead of R.7 is used as source
Branch instructions
d, s = 3-bit regno
7 cond 1 n
p
op cond operation
6 CL R.d := PC+1; PC := n
40
7 0 BEQ PC := n, if Z
7 1 BNE PC := n, if ~Z
7 2 BLT PC := n, if N
7 3 BGE PC := n, if ~N
7 4 BLE PC := n, if N|Z
7 5 BGT PC := n, if ~(N|Z)
7 6 B PC := n
7 7 NOP
reg N, Z, T;
reg [10:0] PC;
reg [11:0] R0, R1, R2, R3, R4, R5, R6, R7;
assign D0 = R4;
assign D1 = R5;
assign D2 = R6;
assign D3 = R7;
assign trig = T;
41
assign A = (~IR[11]) ? {IR[10], IR[10:0]} :
(src == 0) ? R0 :
(src == 1) ? R1 :
(src == 2) ? R2 :
(src == 3) ? R3 :
(src == 4) ? R4 :
(src == 5) ? R5 :
(src == 6) ? R6 : Din;
assign B = (dst == 0) ? R0 :
(dst == 1) ? R1 :
(dst == 2) ? R2 :
(dst == 3) ? R3 :
(dst == 4) ? R4 :
(dst == 5) ? R5 :
(dst == 6) ? R6 : R7;
assign nxpc = PC + 1;
assign pcmux = ((op == 6) | (op == 7) & cond) ? A : nxpc;
42
Technical Report 12. 8. 2010
PART 4
reg loaded;
reg [31:0] Buf;
43
always @ (posedge clk)
if (~rst) loaded <= 0;
else if (wreq) begin Buf <= indata; loaded <= 1; end
else if (rdreq) loaded <= 0;
endmodule
If a higher degree of decoupling between the sending and the receiving nodes is
required, a buffer with several slots must be provided. In the following example,
16 entries are provided. The buffer is implemented by 32 LUT slices with the
macro RAM16X1D_1. The buffer is organized circularly; the counters are modulo
16 due to the fact that they consist of 4 bits.
module Channel6(
input clk, rst,
input wreq, rdreq,
output empty, full,
input [31:0] indata,
output [31:0] outdata);
genvar i;
generate //dual port register file
for (i = 0; i < 32; i = i+1)
begin: rf32
RAM16X1D_1 # (.INIT(16'h0000))
rfa(
.DPO(outdata),
.SPO(),
.A0(out[0]), // R/W address, controls D and SPO
.A1(out[1]),
.A3(out[3]),
.D(indata), // data in
.DPRA0(in[0]), // read-only adr, controls DPO
.DPRA1(in[1]),
.DPRA2(in[2]),
.DPRA3(in[3]),
.WCLK(~clk),
.WE(wreq));
end
endgenerate
It is noteworthy and important that the interface of the two versions of channels
are identical, and therefore easily interchangeable. The interfacing of such
channels and TRMs occurs in the same way as that between RS-232 lines and
TRM, as presented in Part 2 of this Report.
44
Point-to-point connections are less suitable in a multi-processor system, where
no pairs of processors are a-priory known to communicate particularly frequently.
In this case, a system is required that potentially connects every node with every
other node. The traditional solution in this case is a bus. It inherently carries the
problems of delays, of access priorities, and of bottlenecks. Also, since buses are
usually implemented with tri-state gates, it is not easily practicable on FPGAs, as
they do not contain tri-state gates.
The most general soultion is a crossbar switch, a martix of gates. Each row
represents an input, each column an output. Crossbar switches are fast, but
require many resources. FPGAs are not particularly suitable for their
implementation, mostly because of the relative scarcity of long wires.
A likely alternative is the ring structure, where every processor is included as a
ring node. The ring has technically the advantage that it consists of
unidirectional, point-to-point connections only. It is therefore simple to implement
and simple to operate. However, depending on the number of nodes lying
between source and destination, there may be delays involved, Also, long
messages may monopolize the ring, thus inducing longer waits for nodes also
requesting access.
Nevertheless, we present here a basic implementation of a ring node as an
example of how processors may be connected in a simple way on an FPGA chip.
Each node contains a register between the ring input and ring output. This
register holds one data element and introduces a latency (delay of the data
traveling through the ring) of a single clock cycle. The node also contains two
buffers, one for the received data, and one for the data to be sent. Their purpose
is to decouple the nodes in time and thereby to increase the efficiency of the
connections. We postulate that the data are always sequences of bytes, and they
are called messages.
outdata indata
A B
slot
ringin ringout
The figure shows that if a node is sending a message over the ring, buffer B is
fed to the ring output. If a message is received, the ring input is fed to buffer A.
Otherwise the input is transmitted to the output, with a single cycle’s delay.
45
We have chosen the elements of messages to be bytes. The ring and its
registers, called slots, are 10 bits wide, 8 for data and 2 for a tag to distinguish
between data and control bytes.
We postulate the following conventions: Messages are sequences of (tagged)
bytes. The first element of a message is a header. It indicates the destination and
the source number of the nodes engaged in the transmission. We assume a
maximum of 16 nodes, resulting in 4-bit node numbers. The last element is the
trailer. In between lie an arbitrary number of data bytes.
tag data
10 source, destination header
00 xxxxxxxx data byte
01 00000000 trailer
11 00000000 token
When no messages are to be transferred, the ring is said to be idle. When a
node is ready to send a message, it must be granted permission in order to avoid
collisions with messages sent by other nodes. One may imagine a central
agency to rotate a pointer among the nodes, and the node so designated having
the permission to send its message. As we wish to avoid a central agency, we
instead insert a special element into the ring which takes over the role of the
pointer. It is called the token. In the idle state, only the token is in the ring. Such a
scheme is called a token-ring
Only a single message can be in the ring. When a node is ready to send a
message, it waits until the token arrives, and then replaces the token by the
message. The token is reinserted after the message. The header contains the
number of the destination node which triggers the receiver to become active.
When the message header arrives at the destination, that node feeds the
message into its receiver buffer. It removes the header from the ring by replacing
the slot with a zero data item.
46
The buffers are implemented as LUT memories with 64 elements, 10 bits wide.
Registers inA and outA are the 6-bit indices of the input buffer A, inB, outB those
of the output buffer B.
There are 4 separate activities proceeding concurrently:
1. A byte is fed from indata to the output buffer B, and the pointer inB is
advanced (incremented modulo buffer size). This is triggered by the input strobe
wreq with ioadr = 2 (0FC2H).
2. A byte is transferred from buffer B to the ring, and pointer outB is advanced.
This happens in the sending state, which is entered when the buffer contains a
message and the token appears in the slot. The sending state is left when a
trailer is transmitted.
3. A byte is transferred from the slot (ring input) to the input buffer A and the
pointer inA is advanced. This is triggered by the slot containing a message
header with the receiver’s number. A zero is fed to the ring output.
4. A byte is transferred from buffer A to outdata. This happens when the TRM
reads input. Thereafter the TRM must advance pointer outA by applying a wreq
signal and ioadr = 1
The four concurrent activities are expressed in Verilog as shown below in one
block clocked by the input signal clk. The input of buffer A is ringin, that of buffer
B is indata (input from processor).
wire startsnd, startrec, stopfwd;
wire [9:0] A, B; // buffer outputs
assign startsnd = ringin[9] & ringin[8] & rdyS; //token here and ready to send
assign startrec = ringin[9] & ~ringin[8] & ((ringin[3:0] == mynum) | (ringin[3:0] == 15));
assign stopfwd = ringin[9] & ~ringin[8] & ((ringin[3:0] == mynum) | (ringin[7:4] == mynum));
assign outdata = A;
assign status = {mynum, sending, receiving, (inB == outB), (inA == outA)};
assign ringout = slot;
always @(posedge clk)
if (~rst) begin // reset and initialization
sending <= 0; receiving <= 0; inA <= 0; outA <= 0; inB <= 0; outB <= 0;
rdyS <= 0;
if (mynum == 0) slot <= 10'b1100000000; else slot <= 0; end
else begin
slot <= (startsnd | sending) ? B : (stopfwd) ? 10'b0 : ringin;
if (sending) begin // send data
outB <= outB + 1;
if (B[9] & B[8]) sending <= 0; end // send token
else if (startsnd) begin // token here, send header
outB <= outB + 1; sending <= 1; rdyS <= 0; end
else if (wreq) begin
inB <= inB + 1; // msg element into sender buffer
if (indata[9] & indata[8]) rdyS <= 1; end
if (receiving) begin
inA <= inA + 1;
if (ringin[8]) receiving <= 0; end // trailer: end of msg
47
else if (startrec) begin // receive msg header
inA <= inA + 1; receiving <= 1; end
if (rreq) outA <= outA + 1; // advancing the read pointer
end
A software driver
The pertinent driver software is described in Oberon. It is responsible for the
maintenance of the prescribed protocol and message format, and it is therefore
presented as a module. This module alone contains references to the hardware
through procedures PUT, GET, and BIT. Clients are supposed not to access the
hardware interface directly.
The module encapsulates and exports procedures Send and Rec, a predicate
Avail indicating whether any input had been received, and a function MyNum
yielding the node number.
MODULE Ring;
48
END Ring.
A test setup
For testing and demonstrating the Ring with 12 nodes we use a simple test
setup. It involves the program TestTRMRing for node 11, and the identical
program Mirror for all nodes 0 – 10. The former is connected via the RS-232 link
to a host computer running a general test program TestTRM for sending and
receiving numbers. The main program TestTRMRing (running on TRM) accepts
commands (via RS-232) for sending and receiving messages to any of the 12
nodes. Program Mirror then receives the sent message and returns it to the
sender (node 11), which buffers it until requested by a read message command.
Communication over the link is performed by module RS, featuring procedures
for sending and receiving integers and other items. The following are examples
of commands:
TestTRM.SR 1 3 10 20 30 40 50 0 0~
TestTRM.SR 1 8 0 0~
TestTRM.SR 1 3 10 0 4 11 12 0 5 13 14 0 7 15 16 17 0 0~
TestTRM.SR 2~ receive message
The first command sends to node 3 the sequence of numbers 10, 20, 30, 40, 50.
The second sends the empty message to node 8, and the third sends to node 3
the number 10, to node 4 the items 11 12, to node 5 the numbers 13, 14, and to
node 7 the numbers 15, 16, 17.
49
PC TRM Node 11 Node 10 Node 0
RS-232 link
MODULE TestTRMRing;
IMPORT RS, Ring;
VAR cmd, dst, src, x, len, typ, s, i: INTEGER;
buf: ARRAY 16 OF INTEGER;
BEGIN
REPEAT RS.RecInt(cmd);
IF cmd = 0 THEN RS.SendInt(Ring.MyNum())
ELSIF cmd = 1 THEN (*send msg*)
RS.RecInt(dst);
REPEAT len := 0; RS.RecInt(x);
WHILE x # 0 DO buf[len] := x; INC(len); RS.RecInt(x) END ;
Ring.Send(dst, 0, len, buf); RS.RecInt(dst)
UNTIL dst = 0;
RS.SendInt(len)
ELSIF cmd = 2 THEN (*receive msg*)
IF Ring.Avail() THEN
Ring.Rec(src, typ, len, buf);
RS.SendInt(src); RS.SendInt(len); i := 0;
WHILE i < len DO RS.SendInt(buf[i]); INC(i) END
END
ELSIF cmd = 3 THEN RS.SendInt(ORD(Ring.Avail()))
END ;
RS.End
UNTIL FALSE
END TestTRMRing.
MODULE Mirror;
IMPORT Ring;
VAR src, len, typ: INTEGER;
buf: ARRAY 16 OF INTEGER;
BEGIN
REPEAT Ring.Rec(src, typ, len, buf); Ring.Send(src, 0, len, buf) UNTIL FALSE
END Mirror.
Broadcast
The design presented here was, as already mentioned, intentionally kept simple
and concentrated on the essential, the transmission of data from a source to a
destination node. A single extension was made, first because it is useful in many
applications, and second in order to show that it was easy to implement thanks to
a sound basis. This is the facility of broadcasting a message, that is, to send it to
all nodes. The ring is ideal for this purpose. If the message passes once around
50
the ring, simply all nodes must be activated as receivers. We postulate that
address 15 signals a broadcast. There are only two small additions to the circuit
are necessary, namely the addition of the term ringin[3:0] = 15 in the expression
for startrec, and of the term ringin[7:4] = mynum in that of stopfwd.
Discussion
The presented solution is remarkably simple and the Verilog code therefore brief
and the circuit small. This is most essential for tutorial purposes, where the
essence must not be encumbered by and hidden in a myriad of secondary
concerns, although in practice they may be important too.
Attractive properties of the implementation presented here are that there is no
central agency, that all nodes are perfectly identical, that no arbitration of any
kind is necessary, and that the message length is not a priori bounded. No length
counters are used; instead, explicit trailers are used to designate the message
end. All this results in a simple and tight hardware.
The data path of the ring is widened by 2 bits, a tag for distinguishing data from
control bytes, which are token, message header, and message trailer. Actually, a
single bit would suffice for this purpose. Two are used here in order to retain an
8-bit data field also for headers containing 4-bit source and destination
addresses.
The simplicity has also been achieved by concentrating on the basic essentials,
that is, by omitting features of lesser importance, or features whose function can
be performed by software, by protocols between partners. The circuit does not,
for example, check for the adherence to the prescribed message format with
header and trailer. We rely on the total “cooperation” of the software, which
simply belongs to the design. In this case, the postulated invariants can be
established and safeguarded by packing the relevant drivers into a module,
granting access to the ring by exported procedures only.
A much more subtle point is that this hardware does not check for buffer
overflow. Although such overflow would not cause memory beyond the buffers to
be affected, it would overwrite messages, because the buffers are circular. We
assume that overflow of the sending buffer would be avoided by consistent
checking against pending overflow before storing each data element, for
example, by waiting for the buffer not being full before executing any PUT
operation:
REPEAT UNTIL ~BIT(adr, 1)
In order to avoid blocking the ring when a message has partially been stored in
the sending buffer, message sending is not initiated before the message end has
been put into the buffer (signal rdyS). This effectively limits the length of
messages to the buffer size (64), although several (short) messages might be
put into the buffer, and messages being picked from the buffer one after the
other.
51
A much more serious matter is overflow of the receiving buffer. In this case, the
overflowing receiver would have to refuse accepting any further data from the
ring. This can only be done by notifying the sender, which is not done by the
presented hardware. For such matters, communication protocols on a higher
level (of software) would be the appropriate solution rather than complicated
hardware.
We consider it essential that complicated tasks, such as avoiding overflow, or of
guaranteeing proper message formats, can be left to the software. Only in this
way can the hardware be kept reasonably simple. A proper module structure
encapsulating a driver for the ring is obviously necessary.
52
Technical Report 9. 9 2010
PART 5
53
associative cache. Such memories, however, are unpopular, because they
require a lot of circuitry, essentially a comparator for every element of the table of
tags. Simpler, but still effective solutions exist.
memory M tags T cache C
54
deal with words rather than bytes (4 per word) a word address consists of n = 26
bits. The DDRAM’s access path is 256 bits, i.e. 8 words wide. Hence k = 3.The
cache memory is implemented as a single block RAM with 1K = 210 words.
Hence m = 10-k = 7. The main memory then consists of 216 blocks of 1K words,
and an address a consists of 3 fields:
Madr 16 bits a[25:10] block in memory
Tadr 7 bits a[9:3] line in block
Wadr 3 bits a[2:0] word in line
55
the stack. The software stack is frequently accessed , which justifies its exclusion
from the cache.
0 0
wadr
Tadr
blockAdr w
127
56
reg [2:0] wcnt; // DMA word count
reg [11:0] dmaAdr;
reg [22:0] DDadr;
reg [26:0] adrR;
wire [26:0] adr;
wire Twr, Twr1, miss, missE, modif, dbit, adrHi, modif; dmEnb;
wire [6:0] Tadr;
wire [15:0] Madr;
wire [16:0] Tin, Tout;
Twr (and Twr1) are the write enables for the tags. Miss signals cache misses,
and it is active when the address part Madr does not match the corresponding
tag entry (Tout), and if not the uppermost 1K block of memory is addressed
(~adrHi). The table of tags is defined by
genvar i;
generate // tags for cache 128 x (16+1)
for (i = 0; i < 17; i = i+1)
begin: tags
RAM128X1D #(.INIT(128'h00000000000000000000000000000000))
TAG(
.A(Tadr), // r/w adr, controls D, SPO
.D(Tin[i]),
.SPO(),
.DPRA(Tadr), // read only adr, controls DPO
.DPO(Tout[i]),
.WCLK(clk),
.WE((i == 16) ? Twr1 : Twr));
end
endgenerate
The signal adr is now extended from 12 bits to 26 bits. The dma-Signals are
taken over from the DMA implementation.
assign adr = ((irs == 7) ? 0 : AA[26:0]) + {19'b0, off};
assign dmadr = (dmEnb) ? dmaAdr : {1’b0, adrHi, adr[9:0]};
assign dmwr = (dmEnb) ? dmawr : ST & ~miss;
assign dmin = (dmEnb) ? dmain : B;
assign ddradr = DDadr;
assign dmaout = dmout;
assign adrHi = (Madr == 16'hffff);
assign miss = ~(Madr == Tout[15:0]) & ~adrHi;
assign missE = miss & caEnb;
assign Tadr = adr[9:3];
assign Madr = adr[25:10];
assign Tin = {dbit, Madr};
assign modif = Tout[16];
The heart of the cache system is the state machine controlling data transfers
between SDRAM (DDR2) and cache. It is triggered out of the idle state whenever
a cache miss occurs. We chose the one-hot form of state machine with states Q0
– Q12. The – after many considerations – obvious solution is to extend the
already present rudimentary state machine, which stalls the LDR instruction for
one cycle, from 2 to 13 states with the following assocuated actions:
57
Q0 idle
Q1 extend memory access
Q2 initialize DMA
Q3, Q4 transfer 8 words from cache to buffer
Q5 wait until SDRAM ready
Q6 write buffer to SDRAM
Q7 initialize DMA
Q8 wait until SDRAM ready
Q9 read buffer from SDRAM
Q10, Q11 wait until data ready
Q12 transfer 8 words from buffer to cache
The state machine is described by the following diagram (MEM = LDR | ST).
~MEM
0
MEM
~miss
1
2 3 4 5 6
wcnt = 7
miss & ~modif
~RDrdy RDrdy
7 8 9 10 11 12
wcnt = 7
ddrd
cmd rdy
58
Q10 <= Q9 | Q10 & DDstat[0];
Q11 <= Q10 & ~DDstat[0] | Q11 & ~DDstat[0];
Q12 <= Q11 & DDstat[0] | Q12 & ~wc7; // dmawr
end
And this concludes the introduction of a direct cache store. It is not obvious that
the direct cache method would prove efficient. After all, it seems likely that cache
misses are frequent with 216 lines mapping from SDRAM to the same line in the
cache. But in fact the direct-mapped cache proved quite satisfactory, considering
its relative simplicity. An intermediary method between fully associative and
direct mapped cache is the n-way associative cache. Here n tag tables and n
cache memories coexist, and if any one of the tags in corresponding lines
matches the desired address, the associated cache yields the word to be
accessed. Only n comparators are needed. In present commercial processors up
to 8-way associative caches are provided. A much simpler and hardly less
effective solution is to double or quadruple the size of the cache.
Typically, separate caches are provided for data and program access. Here we
have shown only a data cache. A program cache is simpler, because instructions
are read only. No modif condition and no write-back are needed.
59
Acknowledgement
My sincere thanks go to Ling Liu for her help, encouragement and drive in this
project. Without her advice and support the author would never have mustered
the patience to overcome the difficulties and aggravations caused by the
necessary tools, in particular the Verilog compiler and the Xilinx placer and
router. They were a great disappointment, as they proved to be rather unhelpful
in locating mistakes, and instead provide innumerable pitfalls through their
misguided efforts to “correct” programmers’ mistakes. In addition, huge lists of
“warnings” are utterly unattractive to find those warnings that actually may point
out mistakes.
References
60