40 EN - Computer Architecture Complexity and Correctness
40 EN - Computer Architecture Complexity and Correctness
This book owes much to the work of the following students and postdocs:
P. Dell, G. Even, N. Gerteis, C. Jacobi, D. Knuth, D. Kroening, H. Leister,
P.-M. Seidel.
March 2000
Silvia M. Mueller
Wolfgang J. Paul
Contents
1 Introduction 1
2 Basics 7
2.1 Hardware Model . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Components . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Cycle Times . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Hierarchical Designs . . . . . . . . . . . . . . . . 10
2.1.4 Notations for Delay Formulae . . . . . . . . . . . 10
2.2 Number Representations and Basic Circuits . . . . . . . . 12
2.2.1 Natural Numbers . . . . . . . . . . . . . . . . . . 12
2.2.2 Integers . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Basic Circuits . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Trivial Constructions . . . . . . . . . . . . . . . . 17
2.3.2 Testing for Zero or Equality . . . . . . . . . . . . 19
2.3.3 Decoders . . . . . . . . . . . . . . . . . . . . . . 19
2.3.4 Leading Zero Counter . . . . . . . . . . . . . . . 21
2.4 Arithmetic Circuits . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Carry Chain Adders . . . . . . . . . . . . . . . . 22
2.4.2 Conditional Sum Adders . . . . . . . . . . . . . . 24
2.4.3 Parallel Prefix Computation . . . . . . . . . . . . 27
2.4.4 Carry Lookahead Adders . . . . . . . . . . . . . . 28
2.4.5 Arithmetic Units . . . . . . . . . . . . . . . . . . 30
2.4.6 Shifter . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 School Method . . . . . . . . . . . . . . . . . . . 34
2.5.2 Carry Save Adders . . . . . . . . . . . . . . . . . 35
2.5.3 Multiplication Arrays . . . . . . . . . . . . . . . . 36
2.5.4 4/2-Trees . . . . . . . . . . . . . . . . . . . . . . 37
2.5.5 Multipliers with Booth Recoding . . . . . . . . . 42
2.5.6 Cost and Delay of the Booth Multiplier . . . . . . 47
2.6 Control Automata . . . . . . . . . . . . . . . . . . . . . . 50
2.6.1 Finite State Transducers . . . . . . . . . . . . . . 50
2.6.2 Coding the State . . . . . . . . . . . . . . . . . . 51
2.6.3 Generating the Outputs . . . . . . . . . . . . . . . 51
2.6.4 Computing the Next State . . . . . . . . . . . . . 52
2.6.5 Moore Automata . . . . . . . . . . . . . . . . . . 54
2.6.6 Precomputing the Control Signals . . . . . . . . . 55
2.6.7 Mealy Automata . . . . . . . . . . . . . . . . . . 56
2.6.8 Interaction with the Data Paths . . . . . . . . . . . 58
2.7 Selected References and Further Reading . . . . . . . . . 61
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Bibliography 543
Index 549
Chapter
1
Introduction
In this book we develop at the gate level the complete design of a pipelined
RISC processor with delayed branch, forwarding, hardware interlock, pre-
cise maskable nested interrupts, caches, and a fully IEEE-compliant float-
ing point unit.
The educated reader should immediately ask “So what? Such designs
obviously existed in industry several years back. What is the point of
spreading out all kinds of details?”
The point is: the complete design presented here is modular and clean.
It is certainly clean enough to be presented and explained to students. This
opens the way to covering the following topics, both in this text and in the
class room.
To begin with the obvious: we determine cost and and cycle times of
designs. Whenever a new technique is introduced, we can evaluate
its effects and side effects on the cycle count, the hardware cost,
and the cycle time of the whole machine. We can study tradeoffs
between these very real complexity measures.
As the design is modular, we can give for each module a clean and
precise specification, of what the module is supposed to do.
Because at all stages of the design we use modules with well defined
behavior, the process of putting them all together is in this text completely
precise.
Again, we begin with the obvious: one can try to learn the material
by reading the book alone. Because the book is completely self con-
tained this works. A basic understanding of programming, knowl-
edge of high school math, and some familiarity with proofs by in-
duction suffices to understand and verify (or falsify!) each and every
statement in this book.
a b2 a2 2ab b2
says the very same, but it is much easier to understand. Learning the
formalism of algebra is an investment one makes in high school and
which costs time. It pays off, if the time saved during calculations
with the formalism exceeds the time spent learning the formalism.
#
In this book we use mathematical formalism in exactly this way. It
I NTRODUCTION is the very reason why we can cover so much material so quickly.
We have already stated it above: at the very least the reader can
take the correctness proofs in this book as a highly structured and
formalized explanation as to why the authors think the designs work.
But this is not all. Over the last years much effort has been invested
in the development of computer systems which allow the formula-
tion of theorems and proofs in such a precise way, that proofs can
actually be verified by the computer. By now proofs like the ones
in this book can be entered into computer-aided proof systems with
almost reasonable effort.
Indeed, at the time of this writing (February 2000) the correctness
of a machine closely related to the machine from chapter 4 (with a
slightly different more general forwarding mechanism) has been ver-
ified using the system PVS [CRSS94, KPM00]. This also includes
the verification of all designs from chapter 2 used in chapter 4. Ver-
ification of more parts of the book including the floating point unit
of chapter 8 is under way and progressing smoothly (so far).
$%
There are three key concepts, which permit us to develop the material of
this book very quickly and at the same time in a completely precise way.
We conclude the introduction by highlighting some results from the chap-
ters of this book. In chapter 2 we develop many auxiliary circuits for later
use: various counters, shifters, decoders, adders including carry lookahead
adders, and multipliers with Booth recoding. To a large extent we will
specify the control of machines by finite state diagrams. We describe a
simple translation of such state diagrams into hardware.
In chapter 3 we specify a sequential DLX machine much in the spirit
of [PH94] and prove that it works. The proof is mainly bookkeeping. We
have to go through the exercise because later we establish the correctness
of pipelined machines by showing that they simulate sequential machines
whose correctness is already established.
In section 4 we deal with pipelining, delayed branch, result forwarding,
and hardware interlock. We show that the delayed branch mechanism can
be replaced by a mechanism we call “delayed PC” and which delays all
instruction fetches, not just branches.3 We partition machines into data
paths, control automaton, forwarding engine, and stall engine. Pipelined
machines are obtained from the prepared machines mentioned above by an
almost straightforward transformation.
Chapter 5 deals with a subject that is considered tricky and which has not
been treated much in the literature: interrupts. Even formally specifying
what an interrupt mechanism should do turns out to be not so easy. The
reason is, that an interrupt is a kind of procedure call; procedure calls in
turn are a high level language concept at an abstraction level way above
the level of hardware specifications.
Achieving preciseness turns out to be not so bad. After all preciseness
is trivial for sequential machines, and we generate pipelined machines by
transformation of prepared sequential machines. But the interplay of in-
terrupt hardware and forwarding circuits is nontrivial, in particular when it
comes to the forwarding of special purpose registers like, e.g., the register,
which contains the masks of the interrupts.
3 We are much more comfortable with the proof since it has been verified in PVS.
'
Chapter 6 deals with caches. In particular we specify a bus protocol by
I NTRODUCTION which data are exchanged between CPU, caches, and main memory, and
we specify automata, which (hopefully) realize the protocol. We explain
the automata, but we do not prove that the automata realize the protocol.
Model checking [HQR98] is much better suited to verify a statement of
that nature.
Chapter 7 contains no designs at all. Only the IEEE floating point stan-
dard is rephrased in mathematical language and theorems about rounding
are proven. The whole chapter is theory. It is an investment into chap-
ter 8 where we design an entire fully IEEE-compatible floating point units
with denormals, and exceptions, dual precision adder, multiplier, iterative
division, format conversion, rounding. All this on only 120 pages.
In chapter 9 we integrate the pipelined floating point unit into the DLX
machine. As one would expect, the control becomes more complicated,
both because instructions have variable latency and because the iterative
division is not fully pipelined. We invest much effort into a very com-
fortable forwarding mechanism. In particular, this mechanism will permit
the rounding mode of floating point operations to be forwarded. This, in
turn, permits interval arithmetic to be realized while maintaining pipelined
operation of the machine.
(
Chapter
2
Basics
In the model there are five types of basic components, namely: gates,
flipflops, tristate drivers, RAMs and ROMs. Cost and delay of the basic
components are listed in table 2.1. They are normalized relative to the cost
and delay of a 1-bit inverter. For the basic components we use the symbols
from figure 2.1.
Clock enable signals ce of flipflops and registers, output enable signals
oe of tristate drivers and write signals w of RAMs are always active high.
RAMs have separate data input and data output ports. All flipflops are
assumed to be clocked in each cycle; thus there is no need to draw clock
inputs.
A RAM with A addresses and d-bit data has cost
sl 0 1
Ad Din
oe w RAM
Dout
Din Ad
ce ROM
Dout Dout
00111100
a
0110 b
00111100 c
00111100
s
c’
)
Read and write times of registers and RAMs; d ram denotes the access H ARDWARE M ODEL
time of the RAM.
register RAM
read 0 dram
write ∆ Df f δ dram δ
and delay
log d A4 ; A 64
Dram A d
3 log A 10 ; A 64
The circuit in figure 2.2 has cost C FA and delay DFA , with
CFA 2 Cxor 2 Cand Cor
DFA Dxor maxDxor Dand Dor
%
In the computation of cycle times, we charge for reads and writes in regis-
ters and RAMs the times specified in table 2.2. Note that we start and end
counting cycles at the point in time, when the outputs of registers have new
values. The constant δ accounts for setup and hold times; we use δ 1.
Suppose circuit S has delay d S and RAM R has access time dram . The four
schematics in figure 2.3 then have cycle times
dS ∆ in case a)
dram dS ∆ in case b)
τ
dS dram δ in case c)
dS 2 dram δ in case d)
*
# +
BASICS
It is common practice to specify designs in a hierarchical or even recursive
manner. It is also no problem to describe the cost or delay of hierarchi-
cal designs by systems of equations. For recursive designs one obtains
recursive systems of difference equations. Section 2.3 of this chapter will
contain numerous examples.
Solving such systems of equations in closed form is routine work in the
analysis of algorithms if the systems are small. Designs of entire proces-
sors contain dozens of sheets of schematics. We will not even attempt
to solve the associated systems of equations in closed form. Instead, we
translate the equations in a straightforward way into C programs and let
the computer do the work.
Running a computer program is a particular form of experiment. Scien-
tific experiments should be reproducible as easily as possible. Therefore,
all C programs associated with the designs in this book are accessible at our
web site1 . The reader can easily check the analysis of the designs, analyze
modified designs, or reevaluate the designs with a new set of component
costs and delays.
DS I ; O DS I
DS I; O DS O
DS DS I; O
Circuits S do not exist in isolation; their inputs and outputs are connected
to registers or RAMs, possibly via long paths. We denote by AS I ; O the
maximum delay of a path which starts in a register or RAM, enters S via I
and leaves S via O . We call AS I ; O an accumulated delay. If all inputs
I are directly connected to registers, we have
AS I ; O DS I ; O
d) RAM to RAM
I’
inputs I
P’’ P’
P
outputs O
O’
Schematic Sc
with
TS1 d3 AS1 d3 ∆ DS1 d3 D f f δ
TS2 d1 AS2 d1 ∆ DS2 d1 D f f δ
TS2 d2 AS2 d2 ∆ AS1 d2 DS2 d2 D f f δ
! - !-
s a b c
c 1 abc 2
c s a b c
Thus, even the sum 1 can be represented with n 1 bits. The standard
algorithm for adding the binary numbers an 1 : 0 and bn 1 : 0 as well
as a carry in cin is inductively defined by
c 1
cin
ci si ci 1 ai bi
(2.2)
sn cn 1
1 2 sn 1 : 0
n
an bn cn
.
a 2n 1
2n 1
1
twon x a
a 0 an
1 1
The leading bit of a two’s complement number is therefore called its sign
bit. The basic properties of two’s complement numbers are summarized in
The first two equations are obvious. An easy calculation shows, that
a a an 12 ;
n
a b a 0b
a 1b 1
a b 1 mod 2n
The salient point about two’s complement numbers is that addition al-
gorithms for the addition of n-bit binary numbers work just fine for n-bit
two’s complement numbers as long as the result of the addition stays in the
range Tn . This is not completely surprising, because the last n 1 bits of n-
bit two’s complement numbers are interpreted exactly as binary numbers.
The following theorem makes this precise.
Let a an 1 : 0, b bn 1 : 0 and let cin 0 1. Let sn : 0
an 1 : 0 bn 1 : 0 cin and let the bits ci and si be defined as in
the basic addition algorithm for binary numbers. Then
a b cin Tn cn 1 cn2
a b cin 2n
1
an 1 bn1 an 2 : 0 bn 2 : 0 cin
2 n1
an 1 bn1 cn2 sn 2 : 0
2 n1
an 1 bn1 cn2 2 cn2 sn 2 : 0
2 n1
cn 1 sn 1 2 cn
2 sn 2 : 0
2 cn
n
1 cn2 sn 1 : 0
One immediately verifies
2n cn 1 cn2 sn 1 : 0 Tn cn 1 cn2
N THIS section a number of basic building blocks for processors are con-
structed.
a Æ b an Æ b a Æ bn 1 a Æ b0
The circuit in figure 2.6 (c) has inputs a bn 1 : 0 and outputs c a Æ b
The circuit consists of an n-bit Æ-gate where all inputs ai are tied to the
same bit a.
For Æ AND , OR, a balanced tree of n 1 many Æ-gates has inputs
an 1 : 0 and output b an 1 Æ Æ a0 . It is called an n-input Æ-tree.
The cost and the delay of the above trivial constructions are summarized
in table 2.4. The symbols of these constructions are depicted in figure 2.7.
/
BASICS a) b) c)
a[n-1] a[0] a[n-1] b[n-1] a[0] b[0] a b[n-1] b[0]
... ...
11
00 ...
Circuits of an n-bit inverter (a) and of an n-bit Æ-gate. The circuit (c)
computes a Æ bn 1 : 0.
n n n n
ce sl 0 1 oe
n n n
b[n-1:0] c[n-1:0] b[n-1:0]
Symbols of an n-bit register (a), an n-bit mux (b), an n-bit tristate driver
(c), an n-bit Æ-gate (d, e), and an n-input Æ-tree (f). In (e), all the inputs a i are tied
to one bit a.
Cost and delay of the basic n-bit components listed in figure 2.7.
n-bit n-input
register mux driver Æ-gate Æ-tree
cost n Cf f n Cmux n Cdriv n C Æ n 1 C Æ
)
#
# 0 12- %
BASIC C IRCUITS
An n-zero tester is a circuit with input an 1 : 0 and output
b an 1 a0
The obvious realization is an n-bit OR-tree, where the output gate is re-
placed by a NOR gate. Thus,
Since ai bi is equivalent to ai bi 0, the equality test can also be
expressed as
## +
Y 2k i j 1 V i 1 U j 1
xn 1 : k i xk 1 : 0 j
xn 1 : kxk 1 : 0 2k i j
*
n=1 n>1
BASICS
x[0] x[k-1 : 0]
k
dec(k)
K=2k U[K-1 : 0] 0110
Cdec 1 Cinv
Cdec n Cdec n2 Cdec n2 2n Cand
Ddec 1 Dinv
Ddec n Ddec n2 Dand
Thus, input x turns on the x low order bits of the output of the half de-
coder.
Let L denote the lower half and H the upper half of the index range
2n 1 : 0:
L 2n 1
1 : 0 H 2n 1 : 2n 1
Chdec 1 0
Chdec n Chdec n 1 2n 1
Cand Cor
Dhdec 1 0
Dhdec n Dhdec n 1 maxDand Dor
#
n=1 n>1 x[n-2 : 0]
BASIC C IRCUITS
0 x[0]
hdec(n-1)
U[L]
2n-1
x[n-1]
Y[1] Y[0]
2n-1 2n-1
Y[H] Y[L]
In the induction step of the correctness proof the last xn 2 : 0 bits
of U are set to one by induction hypothesis. If xn 1 0, then
If xn
1 1, then
x 2n 1
xn 2 : 0
yH U and
2n 1
yL 1
H n 1 : n2
L n2 1 : 0
yH lz xH and
yL lz xL
m=0 m>0
y m-1
BASICS x[0] L
x[L] lz(n/2) m 1
y m-1
H
x[H] lz(n/2) m 0 y[m:0]
y[0] 0
Thus,
lz xH if lz xH 2m 1
lz xH xL m 1
2
lz xL if lz xH 2m 1
0yH m 1 : 0 if yH m 1 0
z if yH m 1 1
where
z 10m 1 yL m 1 : 0
01yL m 2 : 0 if yL m 1 0
10yL m 2 : 0 if yL m 1 1
yL m 1 yL m 1 yL m 2 : 0
Cost and delay of this circuit are
Clz 1 Cinv
Clz n 2 Clz n2 Cmux m 1 Cinv
Dlz 1 Dinv
Dlz n Dlz n2 Dinv Dmux
Full adders implement one step of the basic addition algorithm for binary
numbers as illustrated in table 2.3 of section 2.2. The circuit in figure 2.2
of section 2.1 happens to be a full adder with the following cost and delay
An n-adder is a circuit with inputs an 1 : 0, bn 1 : 0, cin and outputs
sn : 0 satisfying
a b cin s
The most obvious adder construction implements directly the basic ad-
dition algorithm: by cascading n full adders as depicted in figure 2.11, one
obtains a carry chain adders. Such adders are cheap but slow, and we
therefore do not use them.
A half adder is a circuit with inputs a c and outputs c s satisfying
c s a c
s a c and c ac
#
a c
BASICS
01 01
c’ s
a0 cin
a1 HA
c0
HA s0
an-1 cn-1
...
c1 s1
HA
sn sn-1
the obvious realization of half adders consists of one AND gate and one OR
gate, as depicted in figure 2.12.
An n-incrementer is a circuit with inputs an 1 : 0 cin and outputs
sn : 0 satisfying
a cin s
By cascading n half adders as depicted in figure 2.13 (b), one obtains a
carry chain incrementer with the following cost and delay:
CCCI n n Cxor Cand
DCCI n n 1 Dand maxDxor Dand
The correctness proof for this construction follows exactly the lines of the
correctness proof for the basic addition algorithm.
The most simple construction for conditional sum adders is shown in figure
2.14. Let m n2 and k n2 and write
sn : 0 sn : m sm 1 : 0
&
&
b[n-1:m] a[n-1:m] b[m-1:0] a[m-1:0]
0011 1 0011 0
cin
A RITHMETIC
C IRCUITS
adder(k) adder(k) adder(m)
s1[n:m] s0[n:m] m
k+1
1 0
cm-1
s[n:m] s[m-1:0]
k n2 .
then
sn : m an 1 : m bn 1 : m cm 1
Thus, the high order sum bits in figure 2.14 are computed twice: the sum
bits s0 n : m are for the case cm 1 0 and bits s1 n : m are for the case
nlog 3 c 1 n1 57 c 1
This is too expensive.
For incrementers things look better. The high order sum bits of incre-
menters are
an 1 : m if cm 1 0
an 1 : m 1 if cm 1 1
This leads to the very simple construction of figure 2.15. Our incrementer
of choice will be constructed in this way using carry chain incrementers
'
11
00
a[n-1:m] a[m-1:0]
BASICS
inc(k) inc(m)
s1[n:m] 0 s0[n:m] m
k+1
1 0
cm-1
s[n:m] s[m-1:0]
Note that in figure 2.15, the original problem is reduced to only two
problems of half the size of the original problem. Thus, this construction
could be applied recursively with reasonable cost (see exercise 2.1). One
then obtains a very fast conditional sum incrementer CSI.
Indeed, a recursive construction of simple conditional sum adders turns
out to be so expensive because disjoint circuits are used for the computa-
tion of the candidate high order sum bits s0 n : m and s1 n : m. This flaw
can be remedied if one constructs adders which compute both, the sum and
the sum +1 of the operands a and b.
An n-compound adder is a circuit with inputs an 1 : 0 bn 1 : 0 and
outputs s0 n : 0 s1 n : 0 satisfying
s0 a b
s 1
a b 1
A recursive construction of the n-compound adders is shown in figure
2.16. It will turn out to be useful in the rounders of floating point units.
Note that only two copies of hardware for the half sized problem are used.
Cost and delay of the construction are
Cadd2 1 Cxor Cxnor Cand Cor
Cadd2 n Cadd2 k Cadd2 m 2 Cmux k 1
Dadd2 1 maxDxor Dxnor Dand Dor
Dadd2 n Dadd2 m Dmux k 1
(
&
n=1 n>1 a[n-1:m] b[n-1:m] a[m-1:0] b[m-1:0]
a b A RITHMETIC
add2(k) add2(m) C IRCUITS
mux(k+1) mux(k+1)
S1[1:0] S0[1:0]
S1[n:m] S0[n:m] S1[m-1:0] S0[m-1:0]
X n-1 X n-2 X3 X2 X1 X0
...
PP (n/2)
...
Yn-1 Yn-2 Y2 Y1 Y0
y1 yn with yi x1 Æ Æ xi .
A recursive construction of efficient parallel prefix circuits based on Æ-
gates is shown in figure 2.17 for the case that n is even. If n is odd, then
one realizes PP n 1 by the construction in figure 2.17 and one computes
Æ
it follows
Yi Xi Æ Æ X0 X2i1 Æ Æ X0 Y2i1
The computation of the outputs
Y2i X2i Æ Y2i 1
/
is straightforward. For cost and delay, we get
BASICS
CPP 1
Æ 0
CPP n
Æ CPP n2 n 1 C
Æ Æ
DPP 1
Æ 0
DPP n DPP n2 2 D
Æ Æ Æ
pi j a b 1
a j : i b j : i 1 j i1
gi j a b 1
a j : i b j : i 10 j i1
g0 j a b cin 1
a j : 0 b j : 0 cin 10 j 11
Obviously, we have
pi i
ai bi
gi i
ai bi for i0
g0 0
a0 b0 cin a0 b0
Suppose one has already computed the generate and propagate signals
for the adjacent intervals of indices i : j and j 1 : k, where i j k.
The signals for the combined interval i : k can then be computed as
pi k
pi j p j1 k
gi k
g j1 k gi j p j1 k
g p g2 p2 Æ g1 p1
g2 g1 p2 p1 p2 M
)
&
g2 p2 g1 p1
A RITHMETIC
C IRCUITS
g p
a0 b0
...
g n-1 p n-1 g1 p1 g0 p0
PP (n)
G n-1 G n-2 G1 G 0 P0 cin
...
sn s n-1 s1 s0
A simple exercise shows that the operation Æ defined in this way is asso-
ciative (for details see, e.g., [KP95]).
Hence, figure 2.18 can be substituted as a Æ-gate in the parallel prefix
circuits of the previous subsections. The point of this construction is that
the i-th output of the parallel prefix circuit computes
Gi Pi gi pi Æ Æ g0 p0 gi 0 pi 0
ci pi 0
g g2 g1 p2 g2 g1 p2
The cost and the delay of the whole CLA adder are
if sub 0
op
if sub 1
a op b 0. This flag has to be correct even in the presence of an over-
flow. With the help of this flag one implements for instance instructions
of the form “branch if a b”. In this case one wants to know the sign of
a b even if a b is not representable with n bits.
Figure 2.20 shows an implementation of an n-bit arithmetic unit. The
equation
b b 1
translates into
b b sub and cin sub
The flag neg is the sign bit of the sum a b Tn1 . By the argument
at the end of section 2.2, the desired flag is the sum bit sn of the addition
#
&
It can be computed as
A RITHMETIC
neg sn cn 1 an 1 bn 1 cn 1 pn 1 C IRCUITS
adder, all the carry bits are available, whereas the conditional sum adder
only provides the final carry bit cn 1 . Since the most significant sum bit
ov f cn 1 pn 1 cn 2
pn 1
sn 1 neg
Let add denote the binary adder of choice; the cost and the delay of the
arithmetic unit, then be expressed as
&(
The function cls is called a cyclic left shift, the function crs is called a cyclic
right shift, and the function lrs is called logic right shift. We obviously
have
crs a i cls a n i mod n
cls a i if s 1
r
a otherwise
0 1 0 1 0 1 0 1
s
rn-1 ... ri ri-1 ... r0
a[n-1:0]
cls(n, 20 ) b[0]
r0
cls(n, 21 ) b[1]
...
b[m-1:0]
1
a[n-1:0]
inc(m)
m
CLS(n)
r[n-1:0]
#
an-1 ai ai-1 ... a0 &
...
A RITHMETIC
0 0 C IRCUITS
0 1 0 1 0 1 0 1
s
rn-1 ... ri ri-1 ... r0
r crs a b
lrs a i if s 1
r
a otherwise
##
BASICS
ET a an 1 : 0 and b bm 1 : 0, then
a b 2n 1 2m 1 2nm 1 (2.3)
Thus, the product can be represented with n m bits.
An n m-multiplier is a circuit with an n-bit input a an 1 : 0, an
m-bit input b bm 1 : 0, and an n m-bit output p pn m 1 : 0
such that a b p holds.
' 5
Obviously, one can write the product a b as a sum of partial products
m1
a b ∑ a bt 2t
t 0
with
a bt 2t an 1 bt a0 bt 0t
Thus, all partial products can be computed with cost n m Cand and delay
Dand . We denote by
j k1
∑t j a bt 2
Sj k t
(2.4)
a b j k 1 : j 2 j 2nk j
the sum of the k partial products from position j to position j k 1.
Because S j k is a multiple of 2 j it has a binary representation with j trailing
S j kh
S j k S jk h
S0 t
S0 t 1 St 11
Let x be a natural number and suppose the two binary numbers sn 1 : 0
and t n 1 : 0 satisfy
s t x
i.e., the outputs s and t are a carry save representation of the sum of the
numbers represented at the inputs. As carry save adders compress the sum
of three numbers to two numbers, which have the same sum, they are also
called n-3/2-adders. Such adders are realized, as shown in figure 2.26,
simply by putting n full adders in parallel. This works, because
n1
a b c ∑ ai bi ci 2i
i 0
n1
∑ ti1 si 2i
i 0
n1
∑ 2 ti1 si 2i s t
i 0
The cost and the delay of such a carry save adder are
C3 2add n n CFA
D3 2add n DFA
The point of the construction is, of course, that the delay of carry save
adders is independent of n.
#'
a[n-1] b[n-1] a[1] b[1] a[0] b[0]
BASICS c[n-1] c[1] c[0]
FA ... FA FA 0
0m-1 0m-1-j 0
j
0m-1
save representation of S0 t 1 are fed into an n-carry save adder. The result
many n-carry save adders as suggested above, one obtains an addition tree
which is also called a multiplication array because of its regular structure.
If the final addition is performed by an n m-carry lookahead adder,
one obtains an n m-multiplier with the following cost and delay
#(
'
a) b)
S0,1
S1,1 S0,t-1
M ULTIPLIERS
S2,1 St-1,1 0 ... 0
S0,3 0 S0,t
Generating a carry save representation of the partial sums S 0 3 (a) and
S0 t (b).
C4 2add n 2 C3
2add n 2 n CFA
D4 2add n 2 D3 2add n 2 DFA
a) b) S2K-1 SK SK-1 S0
S3 S2 S1 S0 ... ...
T(K/2) T(K/2)
4/2-adder
4/2-adder
4a 3 M 4 a m
hence
a m 3 M 4
Note that for i 0 1 , the partial products Si 1 are entered into the
tree from right to left and that in the top level of the tree the 3/2-adders are
arranged left of the 4/2-adders. For the delay of a multiplier constructed
with such trees one immediately sees
Estimating the cost of the trees is more complicated. It requires to estimate
the cost of all 3/2-adders and 4/2-adders of the construction. For this es-
timate, we view the addition trees as complete binary trees T in the graph
theoretic sense. Each 3/2-adder or 4/2-adder of the construction is a node
v of the tree. The 3/2-adders and 4/2-adders at the top level of the addition
tree are then the leaves of T .
The cost of the leaves is easily determined. Carry save representations of
the sums Si 3 are computed by n-3/2-adders in a way completely analogous
#*
Si,1
Si+1,1
BASICS
Si+2,1
Si,3 0
Si+3,1
Si,4 0
Partial compression of S i 4
c v 2 n CFA
Si kh Si k Sik h
3
Let T be a complete binary tree with depth µ. We number the levels from
the leaves to the root from 0 to µ. Each leaf u has weight W u. For some
natural number k, we have W u k k 1 for all leaves, and the weights
are nondecreasing from left to right. Let m be the sum of the weights of
3 Formally, figure 2.32 can be viewed as a simplified n 3-4/2-adder
&
h n k i
'
0 ... 0 Si,k M ULTIPLIERS
0 ... 0
0 ... 0 Si+k,h
0 ... 0
4/2-adder(n+h)
0 ... 0 Si,k+h
0 ... 0
n+h k i
the leaves. For µ 4, m 53, and k 3, the leaves would, for example,
have the weights 3333333333344444. For each subtree t of T , we define
W t ∑ W u
u leaf of t
where u ranges over all leaves of t. For each interior node v of T we define
L v and R v as the weight of the subtree rooted in the left or right son of
v, respectively. We are interested in the sums
µ
H ∑ L v and H ∑H
level 1
where v ranges over all nodes of level . The cost H then obeys
µ m2 2 µ1
H µ m2
By induction on the levels of T one shows that in each level weights are
nondecreasing from left to right, and their sum is m. Hence,
2H ∑ L v ∑ R v ∑ W v m
level level level
h v L v R v2 L v R v L v2
&
Observe that all nodes in level , except possibly one, have weights in
BASICS k 2 , k 1 2 . Thus, in each level there is at most one node v with
h v k 1 2
1
k2 1
2
2 2
h W L u
E 2 H µ m
2 2µ 1
2 05 M 4 m3
Thus, the upper bound is quite tight. A very good upper bound for the cost
of 4/2-trees is therefore
A good upper bound for the cost of multipliers built with 4/2-trees is
C4
2mul n m n m Cand C4 2tree n m CCLA n m
t s
0
add(n+m)
p[n+m-1:0]
becomes more expensive and slower. One therefore has to show, that the
savings in the addition tree outweigh the penalty in the partial product
generation.
Figure 2.34 depicts the structure of a 4/2-tree multiplier with Booth re-
coding. Circuit Bgen generates the m Booth recoded partial products S2 j 2 ,
which are then fed into a Booth addition tree. Finally, an ordinary adder
produces from the carry save result of the tree the binary representation of
the product. Thus, the cost and delay of an n m-multiplier with 4/2-tree
and Booth recoding can be expressed as
C4 2Bmul n m CBgen n m C4 2Btree n m CCLA n m
D4 2Bmul n m DBgen n m D4 2Btree n m DCLA n m
7 8
In the simplest form (called Booth-2) the multiplier b is recoded as sug-
gested in figure 2.35. With bm1 bm b 1 0 and m m 12,
one writes
m¼ 1
b 2b b ∑ B2 j 4 j
j 0
where
B2 j 2 b2 j b2 j 1 2 b2 j 1 b2 j
2 b2 j1 b2 j b2 j 1
Bm Bm-2 Bi B2 B0
Booth digits B2 j
With
E2 j C2 j 3 2n1
E0 C0 4 2n1
e2 j binn3 E2 j
e0 binn4 E0
j 0 3
a b
j 0
e2 j 1s2 j d2 j s2 j s2 j
e0 s0 s0 s0 d0 s0 s0
&&
'
1
M ULTIPLIERS
1 1 0 0 0 0
E0 :
+ 00B
11
- 00 0
11
d0 = < a >
1 1 0 0 0 0
E2 :
+ d2 = < a > 0
1B2
-
1 1 0 0 0 0
E4 :
+
- 00B4
11
d4 = < a >
0
1
00 1
11 0
00
1100
11
00
11
1 1 0 0 0 0
E2m’-2 :
11B
00
+
- d
2m’-2
00
11
= <a>
2m’-2
Summation of the E 2 j
f2 j 1 s2 j d2 j s2 j
f0 s0 s0 s0 d0 s0
&'
g0 s0 s0 s0 d0 s 0 0 0
BASICS s2 d2 s 2 s0
g2 1 0
g4 1 s4 d4 s 4 0 s
2
1
0
1
0
11
00
g 2m’ 1 s2m’ d2m’ s2m’ 0 s 2m’-2
g2 j f2 j 0 s2 j 2 0 1n5
g0 f0 00 0 1n6
then
g2 j 4 f2 j s2 j 2 4 F2 j s2 j 2
m¼ 1 m¼ 1
∑ j 0 4 F2 j s2 j 2 4 j ∑j
1
0
g2 j 4 j1
We define
j k1
S2 j 2k
∑ g2 j 4 j 1
t j
then
S2 j 2
g2 j 4 j 1
f2 j 0s2 j 2 4
j 1
S2 j 2kh
S2 j 2k S2 j2k 2h
and it holds:
S2 j 2k
S2 j 2k 1 g2 jk 1 4 jk 2
g2 j 1s2 j d2 j s2 j 0s2 j 2
g0 s0 s0 s0 d0 s0 00
shifted by 2 j 2 bit positions. The d2 j binn1 a B2 j are easily
determined from B2 j and a by
0 0 if B2 j 0
0 a if B2 j 1
d2 j
a 0 if B2 j 2
For this computation, two signals indicating B2 j 1 and B2 j 2 are
necessary. We denote these signals by
1 if B2 j 1 1 if B2 j 2
b12 j b22 j
0 otherwise 0 otherwise
and calculate them by the Booth decoder logic BD of figure 2.38 (a). The
decoder logic BD can be derived from table 2.6 in a straightforward way.
It has the following cost and delay:
CBD Cxor Cxnor Cnor Cinv
DBD maxDxor Dxnor Dnor
The selection logic BSL of figure 2.38 (b) directs either bit ai, bit ai
1, or 0 to position i 1. The inversion depending on the sign bit s2 j then
yields bit g2 j i 3. The select logic BSL has the following cost and delay:
CBSL 3 Cnand Cxor
DBSL 2 Dnand Dxor
&/
BASICS Representation of the Booth digits
b2 j 1 : 2 j 1 B2 j b2 j 1 : 2 j 1 B2 j
000 0 100 -2
001 1 101 -1
010 1 110 -1
011 2 111 -0
s2j
d2j [i+1]
/s2j s2j b22j b12j
g2j [i+3]
The Booth decoder BD (a) and the Booth selection logic BSL (b)
The select logic BSL is only used for selecting the bits g2 j n 2 : 2; the
remaining bits g2 j 1 : 0 and g2 j n 5 : n 3 are fixed. For these bits, the
selection logic is replaced by the simple signal of a sign bit, its inverse, a
zero, or a one. Thus, for each of the m partial products n 1 many select
circuits BSL are required. Together with the m Booth decoders, the cost of
the Booth preprocessing runs at
¼
Redundant Partial Product Addition Let M 2 log m be the smallest
34 M m M
E 4H
E 2 µ m
Thus, the delay and the cost of the 4/2-tree multiplier with Booth-2 recod-
ing can be expressed as
C4 2Btree n m 2 E CFA
n m 2 2 µ m CFA
D4 2Btree 2 µ 1 DFA
Let C and D denote the cost and delay of the Booth multiplier but with-
out the n m-bit CLA adder, and let C and D denote the corresponding
cost and delay of the multiplier without Booth recoding:
C C4 2Bmul n m CCLA n m
D D4 2Bmul n m DCLA n m
C C4 2mul n m CCLA n m
D D4 2mul n m DCLA n m
C C 4524655448 816%
D D 5562 887%
C C 1023412042 849%
D D 4350 860%
η : Z In Out
If the automaton is in state z and reads input symbol in, it then out-
puts symbol η z in and goes to state δ z in.
If the output function does not depend on the input in, i.e., if it can be
written as
η : Z Out
then the automaton is called a Moore automaton. Otherwise, it is called a
Mealy automaton.
Obviously, the input of an automaton which controls parts of a com-
puter will come from a certain number σ of input lines inσ 1 : 0, and
it will produce outputs on a certain number γ of output lines out γ 1 : 0.
Formally, we have
{0, 1}2
state z is possible. The edge z z is labeled with all input symbols in that
take the automaton from state z to state z. For Moore automata, we write
into the rectangle depicting state z the outputs signals which are active in
state z.
Transducers play an important role in the control of computers. There-
fore, we specify two particular implementations; their cost and delay can
easily be determined if the automaton is drawn as a graph. For a more
general discussion see, e.g., [MP95].
(
Let k #Z be the number of states of the automaton. Then the states can be
numbered from 0 to k 1, and we can rename the states with the numbers
from 0 to k 1:
Z 0 k 1
for all i. This means that, if the automaton is in state i, bit Si is turned on
and all other bits S j with j i are turned off. The initial state always gets
number 0. The cost of storing the state is obviously k Cf f .
(# 9 - -
For each output signal outj Out, we define the set of states
(& - !
z to z occurs for all inputs in, then δz z¼ 1 and the disjunctive normal
Let M z z be the set of monomials in D z z and let
M M z z 1
z z¼
E
'
(
be the set of all nontrivial monomials occurring in the disjunctive forms
D z z . The next state vector N k 1 : 1 can then be computed in three C ONTROL
steps: AUTOMATA
Note that we do not compute N 0 yet. For each monomial m, its length
l m denotes the number of literals in m; lmax and lsum denote the maximum
and the sum of all l m:
lmax maxl m m M lsum ∑l m
mM
The computation of the monomials then adds the following cost and delay:
CCM σ Cinv lsum #M Cand
DCM Dinv log lmax Dand
For each node z, let
f anin z ∑ #M z z
z¼ z
E
f aninmax max f anin z 1 z k 1
k1
f aninsum ∑ f anin z
z 1
N[0]
zero(k-1) 0k-1 1
k
N[k-1:1] 11
00 01 clr
00
11 0 1
ce
σ m NS S γ
in CM CN O out
(' 5 -
i.e., the code of the initial state 0 into register S. As long as the clear signal
is inactive, the next state N is computed by circuit NS and a zero tester. If
none of the next state signals N k 1 : 1 is active, then the output of the
zero tester becomes active and next state signal N 0 is turned on.
This construction has the great advantage that it works even if the tran-
sition function is not completely specified, i.e., if δ z in is undefined for
some state z and input in. This happens, for instance, if the input in codes
an instruction and a computer controlled by the automaton tries to execute
an undefined (called illegal) instruction.
In the Moore automaton of figure 2.39, the transition function is not
specified for z1 and in 10. We now consider the case, that the automa-
ton is in state z1 and reads input 10. If all next state signals including
signal N 0 are computed according to equation (2.5) of the previous sub-
section, then the next state becomes 0k . Thus, the automaton hangs and
'&
(
N[0] k 0k-1 1
zero(k-1)
C ONTROL
N[k-1:1] 11
00 clr O
00
11 0
11
00
1
γ AUTOMATA
σ NS
in CM
m
CN S 01 Rout
ce clr out
can only be re-started by activating the clear signal clr. However, in the
construction presented here, the automaton falls gracefully back into its
initial state. Thus, the transition δ z1 10 z0 is specified implicitly in
this automaton.
Let A in and A clr ce denote the accumulated delay of the input sig-
nals in and of the signals clr and ce. The cost, the delay and the cycle time
of this realization can then be expressed as
This will be our construction of choice for Moore automata. The choice
is not completely obvious. For a formal evaluation of various realizations
of control automata see [MP95].
''
{01, 10, 11}
BASICS z2
z1 {00, 01}
z0 out = (101) out = (011)
out = (010) {00} out[3] if in[0] {11} out[3] if /in[1]
{0, 1}2
Representation of a Mealy automaton with tree states, inputs in1 : 0,
and outputs out 3 : 0; out 3 is the only Mealy component of the output.
(/ 5% -
Zj z f z j
0
we can visualize Mealy outputs outj in the following way. Let z be a state
– visualized as a rectangle – in Zj ; we then write inside the rectangle:
out j if F z i
MF
MF z j 1
γ1
j 0 zZ ¼j
zZ j
A Mealy automaton computes the next state in the same way as a Moore
automaton, i.e., as outlined in section 2.6.4. The only difference is that in
'/
N[0] 0k-1 1
zero(k-1)
BASICS k
01 11
00 clr
N[k-1:1] 10 0 1
ce
σ M NS S
in CN 01 γ
CM MF 10 O out
Let A in and A clr ce denote the accumulated delay of the input sig-
nals in and of the signals clr and ce. A Mealy automaton (figure 2.43) can
then be realized at the following cost and delay:
Tristate drivers are used in the data paths but not in the control au-
tomata.
Note, for all of our processor designs, it must be checked that the control
automata only generate admissible control signals.
For Moore automata, such a partitioning of the data paths and of the
automaton is unnecessary, since the control signals do not depend on the
current state. However, in a Mealy automaton, the partitioning is essential.
The signals out 1 then form the Moore component of the output.
" 5% -
The output signals of a Mealy automaton tend to be on the time critical
path. Thus, it is essential for a good performance estimate, to provide the
accumulated delay of every output circuit O i respectively the accumu-
lated delay of every subset out i of output signals:
AOi A out i
Let νmax i denote the maximal frequency of all output signals in out i,
and let l fmax i denote the maximal length of the monomials in the dis-
junctive normal forms F z j, with j out i. Thus:
A out i A in i DCM MF i DOi
DOi Dand log νmax i Dor
DCM MF i Dinv log l fmax i Dand
Table 2.8 summarizes all the parameters which must be determined from
the specification of the automaton in order to determine the cost and the
delay of a Mealy automaton.
(
/
Parameters of a Mealy control automaton with p output levels S ELECTED
O1 O p R EFERENCES AND
F URTHER R EADING
Symbol Meaning
σ # input signals in j of the automaton
γ # output signals out j of the automaton
k # states of the automaton
fansum accumulated fanin of all states z z0
fanmax maximal fanin of all states z z0
#M # monomials m M MF M of the automaton
lsum accumulated length of all monomials m M
l fmaxi , maximal length of the monomials of output level O i
lmax and of the monomials m M of the next state circuit
νsum accumulated frequency of all control signals
νmaxi maximal frequency of the signals out i of level O i
Aclr ce
accumulated delay of the clear and clock signals
Aini , accumulated delay of the inputs in i of circuit O i
Ain and of the inputs in of the next state circuit
$ %&
Let m n2 for any n 1. The high order sum bits of
an n-bit incrementer with inputs an 1 : 0, cin and output sn : 0 can be
expressed as
sn : m an 1 : m cm 1
an 1 : m if cm 1 0
an 1 : m 1 if cm 1 1
(
where cm 1 denotes the carry from position m 1 to position m. This
BASICS suggests for the circuit of an incrementer the simple construction of fig-
ure 2.15 (page 26); the original problem is reduced to only two half-sized
problems. Apply this construction recursively and derive formulae for the
cost and the delay of the resulting incrementer circuit CSI.
Derive formulae for the cost and the delay of an n m-mul-
tiplier which is constructed according to the school method, using carry
chain adders as building blocks.
34 M m M with M
2 log m
This exercise deals with the construction of the tree T m for the remain-
ing cases, i.e., for M 2 m 34 M. The bottom portion of the tree is
still a completely regular and balanced 4/2-tree T M 4 with M 4 many
pairs of inputs and M 8 many 4/2-adders as leaves. In the top level, we
now have a many 3/2-adders and M 4 a many pairs of inputs which are
directly fed to the 4/2-tree T M 4. Here, a is the solution of the equation
3a 2 M 4 a m
hence
a m M 2
For i 0 1 , the partial products Si 1 are entered into the tree from right
to left and that in the top level of the tree the 3/2-adders are placed at the
right-hand side.
(
Chapter
3
A Sequential DLX Design
We will be able to reuse almost all designs from this chapter. The design
process will be – almost – strictly top down.
Load and store operations move data between the general purpose reg-
isters and the memory M. There is a single addressing mode: the effective
address ea is the sum of a register and an immediate constant. Except for
shifts, immediate constants are always sign extended.
# 6 5 5 16
6 26
J-type opcode PC offset
The three instruction formats of the DLX fixed point core. RS1 and
RS2 are source registers; RD is the destination register. SA specifies a special
purpose register or an immediate shift amount; f unction is an additional 6-bit
opcode.
All three instruction formats (figure 3.1) have a 6-bit primary opcode and
specify up to three explicit operands. The I-type (Immediate) format spec-
ifies two registers and a 16-bit constant. That is the standard layout for
instructions with an immediate operand. The J-type (Jump) format is used
for control instructions. They require no explicit register operand and profit
from a larger 26-bit immediate operand. The third format, R-type (Regis-
ter) format, provides an additional 6-bit opcode (function). The remaining
20 bits specify three general purpose registers and a field SA which spec-
ifies a 5-bit constant or a special purpose register. A 5-bit constant, for
example, is sufficient for a shift amount.
Since the DLX description in [HP90] does not specify the coding of the
instruction set, we adapt the coding of the MIPS R2000 machine ([PH94,
KH92]) to the DLX instruction set. Tables 3.1 through 3.3 list for each
DLX instruction its effect and its coding; the prefix “hx” indicates that the
number is represented as hexadecimal. Taken alone, the tables are almost
but not quite a mathematical definition of the semantics of the DLX ma-
chine language. Recall that mathematical definitions have to make sense if
taken literally.
So, let us try to take the effect
RD RS1 imm ? 1 : 0
(&
#
I-type instruction layout. All instructions except the control instruc- I NSTRUCTION S ET
tions also increment the PC by four; sxt a is the sign-extended version of a. A RCHITECTURE
The effective address of memory accesses equals ea GPRRS1 sxt imm,
where imm is the 16-bit intermediate. The width of the memory access in bytes is
indicated by d. Thus, the memory operand equals m M ea d 1 M ea.
('
#
A S EQUENTIAL R-type instruction layout. All instructions execute PC += 4. SA denotes
DLX D ESIGN the 5-bit immediate shift amount specified by the bits IR10 : 6.
((
#
of instruction in table 3.1 literally: the 5-bit string RS1 is compared
with the 16-bit string imm using a comparison “” which is not defined I NSTRUCTION S ET
for such pairs of strings. The 1-bit result of the comparison is assigned to A RCHITECTURE
the 5-bit string RD.
This insanity can be fixed by providing five rules specifying the abbre-
viations and conventions which are used everywhere in the tables.
1. RD is a shorthand for GPRRD. Strictly speaking, it is actually a
shorthand for GPRRD. The same holds for R1 and R2.
4. All integer arithmetic is modulo 232 . This includes all address cal-
culations and, in particular, all computations involving the PC.
By lemma 2.2 we know that a a mod 232 for 32-bit addresses a.
Thus, the last convention implies that it does not matter whether we in-
terpret addresses as two’s complement numbers or as binary numbers.
The purpose of abbreviations and conventions is to turn long descrip-
tions into short descriptions. In the tables 3.1 through 3.3, this has been
done quite successfully. For three of the DLX instructions, we now list the
almost unabbreviated semantics, where sxt imm denotes the 32-bit sign
extended version of imm.
1. Arithmetic instruction :
GPRRD GPRRS1 imm mod 232
GPRRS1 sxt imm
or, equivalently
The crucial property of this storage scheme is, that half words, words
and instructions stored in memory never cross word boundaries (see figure
3.2). For word boundaries e, we define the memory word with address e as
Let a31 : 0 be a memory address, and let e be the word boundary e
a31 : 2 00. Then
1. the byte with address a is stored in byte a1 : 0 of the memory
word with address e:
M a byte a1:0 Mword a31 : 2 00
()
#
a) 1-bank desing
byte
H IGH L EVEL DATA
addr half word word
PATHS
<a> b) 4-bank design
: : :
bank address a[1:0]
11 10 01 00 addr
<a’>
: : : :
e+4 e+4
e+3 b3 b3 b2 b1 b0 e
e+2 b2
: : : :
e+1 b1 0
e b0
e-1 4 bytes
: : :
0
bits 31 24 23 16 15 8 7 0
2. The piece of data which is d bytes wide and has address a is stored
in the bytes a1 : 0 to a1 : 0 d 1 of the memory word with
address e:
byte a1:0 d
1 : a1:0 Mword a31 : 200
3.4 presents a high level view of the data paths of the machine.
IGURE
It shows busses, drivers, registers, a zero tester, a multiplexer, and the
environments. Environments are named after some major unit or a register.
They contain that unit or register plus some glue logic that is needed to
adapt that unit or register to the coding of the instruction set. Table 3.4
gives a short description of the units used in figure 3.4. The reader should
copy the table or better learn it by heart.
(*
#
MDout
A S EQUENTIAL C MDRr
DLX D ESIGN
SH4Lenv
C’
PCenv GPRenv IRenv
00111100
PC AEQZ A’ B’ co
zero 0110
A
00111100
B
4 0
a
b
Menv
ALUenv SHenv
D
MDin
MAR MDRw
fetch 1 0 MA
1. Clock enable signals for register R are called Rce. Thus, IRce is the
clock enable signal of the instruction register.
3. We show that the machine interprets the instruction set, i.e., that the
hardware works correctly.
/
##
Units and busses of the sequential DLX data paths E NVIRONMENTS
Large Units, Environments
GPRenv environment of the general purpose register file GPR
ALUenv environment of the arithmetic logic unit ALU
SHenv environment of the shifter SH
SH4Lenv environment of the shifter for loads SH4L
PCenv environment of the program counter PC
IRenv environment of the instruction register IR
Menv environment of the memory M
Registers
A, B output registers of GPR
MAR memory address register
MDRw memory data register for data to be written to M
MDRr memory data register for data read from M
Busses
A’, B’ input of register A and register B
a, b left/right source operand of the ALU and the SH
D internal data bus of the CPU
MA memory address
MDin Input data of the memory M
MDout Output data of the memory M
Inputs for the control
AEQZ indicates that the current content of register A equals zero
IR[31:26] primary opcode
IR[5:0] secondary opcode
GPRRS2 if RS2 0
B
0 if RS2 0
Let Cad be the address to which register C is written. This address is
usually specified by RD. In case of jump and link instructions (Jlink 1),
however, the PC must be saved into register 31. Writing should only occur
if the signal GPRw is active:
RD if Jlink 0
Cad
31 if Jlink 1
GPRCad : C if GPRw 1
The remaining equations specify simply the positions of the fields RS1,
RS2 and RD; only the position of RD depends on the type of the instruction:
CCAddr 2 Cmux 5
DDAddr 2 Dmux 5
0011 0011
1 0 Jlink GPRw
A’ B’
The register file performs two types of accesses; it provides data A and B ,
or it writes data C back. The read access accounts for the delay
DGPR read
DGPRenv IR GPRw; A B
maxDram3 32 32 Dzero 5 Dinv Dand
DGPR write
DCAddr Dram3 32 32
IR : MDout if IRce 1
27 SA if shi f tI 1
co31 : 0
sxt imm if shi f tI 0
/#
# MDout
A S EQUENTIAL
IR
DLX D ESIGN IRce
[31:26] [25] [24:16] [15:5] [4:0]
This environment is controlled by the reset signal and the clock enable
signal PCce of the PC. If the reset signal is active, then the start address
032 of the boot routine is clocked into the PC register:
D if PCce reset
PC :
032 if reset
This completes the specification of the PC environment. The design in fig-
ure 3.7 implements PCenv in a straightforward manner. Let DPCenv In; PC
denote the delay which environment PCenv adds to the delay of the inputs
of register PC. Thus:
CPCenv C f f 32 Cmux 32 Cor
DPCenv In; PC maxDmux 32 Dor
/&
##
032 D
E NVIRONMENTS
reset 1 0
reset
PC
PCce
f2 0 0 1 1 1 1
f1 0 1 0 0 1 1
f0 * * 0 1 0 1
The coding of conditions from table 3.5 is frequently used. The obvious
implementation proceeds in two steps. First, one computes the auxiliary
signals l e g (less, equal, greater) with
l1 ab ab 0
e1 ab ab 0
g1 ab ab 0
and then, one generates
t a b f f2 l f1 e f0 g
Figure 3.8 depicts a realization along these lines using an arithmetic unit
from section 2.4. Assuming that the subtraction signal sub is active, it
holds
l neg
e1 s31 : 0 032
g e l
/(
##
a[31:0] b[31:0] sub
E NVIRONMENTS
AU(32)
ovf neg s
0011 s[31:0]
zero(n)
f2 01 f0
f1 0011
comp
t
01 01 11
00
10 10 00
11
a 32
b[15:0]
b 32 01 11
00 11
00
sub 016
AU(32)
ovf neg s
01 0 1 f0 0 1 f0
10
f [2:0] comp(32) 0 1 f1 LU
t 0 1 f2
031
test al
1 0
alu
3
The coding of the arithmetic/logic functions in table 3.6 translates in a
straightforward way into figure 3.9. Thus, the cost and the delay of the
logic unit LU and of this ALU run at
CLU 32 Cand 32 Cor 32 Cxor 32 3 Cmux 32
DLU 32 maxDand Dor Dxor 2 Dmux
CALU CAU 32 CLU 32 Ccomp 32 2 Cmux 32
DALU maxDAU 32 Dcomp 32 DAU 32 Dmux
DLU 32 Dmux Dmux
//
#
test IR[28:26] IR[2:0]
A S EQUENTIAL f[1] 000 0 1 Rtype
DLX D ESIGN
1 0 add
sub f[2:0]
9- 3
Figure 3.10 suggests how to generate the signals sub and f 2 : 0 from
control signals add and Rtype. The mux controlled by signal Rtype selects
between primary and secondary opcode. The mux controlled by add can
force f 2 : 0 to 000, that is the code for addition.
The arithmetic unit is only used for tests and arithmetic operations. In
case of an arithmetic ALU operation, the operation of the AU is an addition
(add or addi) if f1 0 and it is a subtraction (sub or subi) if f1 1. Hence,
the subtraction signal can be generated as
sub test f1
The environment ALUenv consists of the ALU circuit and the ALU glue
logic. Thus, for the entire ALU environment, we get the following cost and
delay:
Recall that the memory M is byte addressable. Half words are aligned at
even (byte) addresses; instructions and words are aligned at word bound-
aries, i.e., at (byte) addresses divisible by 4. Due to the alignment, memory
data never cross word boundaries. We therefore organize the memory in
such a way that for every word boundary e the memory word
Mword e M e 3 : e
If the read operation accesses the d-byte data X, by lemma 3.1, X is then
the subword
X byte MA1:0 d
1 : MA1:0 MDout
/*
#
MDout
[31:24] [23:16] [15:8] [7:0]
A S EQUENTIAL
do do do do
DLX D ESIGN
mr bank mr bank mr bank mr bank
MB[3] MB[2] MB[1] MB[0]
mbw[3] mbw[2] mbw[1] mbw[0]
di a di a di a di a
[31:24] [23:16] [15:8] [7:0]
MDin
MA[31:2]
Connecting the memory banks to the data and address busses
mbw[3:0]
Memory control MC. Circuit GenMbw generates the bank write sig-
nals according to Equation 3.1
The bank write signals are then generated in a brute force way by
mbw0 mw B0
mbw1 mw W B0 H B0 B B1
(3.1)
mbw2 mw W B0 H B2 B B2
mbw3 mw W B0 H B2 B B3
When reusing common subexpressions, the cost and the delay of the
memory control MC (figure 3.12) runs at
Let dmem be the access time of the memory banks. The memory environ-
ment then delays the data MDout by
We do not elaborate on the generation of the mbusy signal. This will only
be possible when we built cache controllers.
The shifter environment SHenv is used for two purposes: for the execution
of the explicit shift operations sll (shift left logical), srl (shift right logical)
and sra (shift right arithmetic), and second, for the execution of implicit
shifts. An implicit shifted is only used during the store operations sb and
sw in order to align the data to be stored in memory. The environment
SHenv is controlled by a single control signal
shi f t4s, denoting a shift for a store operation.
)
#
A S EQUENTIAL Coding of the explicit shifts
DLX D ESIGN
IR[1:0] 00 10 11
type sll srl sra
1
We formally define the three explicit shifts. Obviously, left shifts and right
shifts differ by the shift direction. Logic shifts and arithmetic shifts differ
by the fill bit. This bit fills the positions which are not covered by the
shifted operand any more. We define the explicit shifts of operand an 1 :
0 by distance bm 1 : 0 in the following way:
b
sll a b an b 1 a0 f ill
b
srl a b f ill an 1 ab
b
sra a b f ill an 1 ab
where
0 for logic shifts
f ill
an 1
for arithmetic shifts
Thus, arithmetic shifts extend the sign bit of the shifted operand. They
probably have their name from the equality
sra a b a2 b
.
Implicit left shifts for store operation are necessary if a byte or half word
– which is aligned at the right end of a31 : 0 – is to be stored at a byte
address which is not divisible by 4. The byte address is provided by the
memory address register MAR. Measured in bits, the shift distance (moti-
vated by lemma 3.1) in this case equals
8 MAR1 : 0 MAR1 : 0000
The operand a is shifted cyclically by this distance. Thus, the output sh of
the shifter environment SHenv is
shi f t a b IR1 : 0 if shi f t4s 0
sh
cls a MAR1 : 0000 if shi f t4s 1
)
a b
##
MAR[1:0]
2
5 E NVIRONMENTS
32
CLS(32) Dist
r
32 fill
Fill
Scor mask
Mask
32
32
sh
f ill if maski 1
shi
ri if maski 0
b[4:0] 1
inc(5)
right 0 1
MAR[1:0] 000
shift4s 0 1
dist[4:0]
Thus, in the distance circuit Dist of figure 3.15, the mux controlled by
signal right selects the proper left shift distance of the explicit shift. Ac-
cording to table 3.8, bit IR1 can be used to distinguish between explicit
left shifts and explicit right shifts. Thus, we can set
right IR1
The additional mux controlled by signal shi f t4s can force the shift distance
to MAR1 : 0000, i.e., the left shift distance specified for stores. The cost
and the delay of the distance circuit Dist are
CDist Cinv 5 Cinc 5 2 Cmux 5
DDist b Dinv 5 Dinc 5 2 Dmux 5
DDist MAR Dmux 5
,
The fill bit is only different from 0 in case of an arithmetic shift, which is
coded by IR1 : 0 11 (table 3.8). In this case, the fill bit equals the sign
bit a31 of operand a, and therefore
f ill IR1 IR0 a31
The cost and the delay of the fill bit computation run at
CFill 2 Cand
DFill 2 Dand
)&
##
032 1
Flip(32) 1 mask[31:0] E NVIRONMENTS
hdec(5)
0
b[4:0] 0
shift4s
right
Circuit Mask generating the mask for the shifter SH.
8 5
During an explicit left shift, the least significant b bits of the intermediate
result r have to be replaced. In figure 3.16, a half decoder generates from
b the corresponding mask 032 b 1 b . During an explicit right shift, the
This environment consists of the shifter for loads SH4L and a mux; it is
controlled by a single control signal
shi f t4l denoting a shift for load operation.
If signal shi f t4l is active, the result R of the shifter SH4L is provided to
the output C of the environment, and otherwise, input C is passed to C :
R if shi f t4l 1
C :
C if shi f t4l 0
Figure 3.17 depicts the top level schematics of the shifter environment
SH4Lenv; its cost and delay can be expressed as
CSH4Lenv CSH4L Cmux 32
DSH4Lenv DSH4L Dmux 32
)'
#
MDRr
A S EQUENTIAL shifter SH4L R 1
DLX D ESIGN MAR[1:0] C’
0
C
32 shift4l
The shifter SH4L is only used in load operations. The last three bits
IR28 : 26 of the primary opcode specify the type of the load operation
(table 3.9). The byte address of the data, which is read from memory on
a load operation, is stored in the memory address register MAR. If a byte
or half word is loaded from a byte address which is not divisible by 4, the
loaded data MDRr has to be shifted to the right such that it is aligned at the
right end of the data bus D31 : 0. A cyclic right shift by MAR1 : 0000
bits (the distance is motivated by lemma 3.1) will produce an intermediate
result
r crs MDRr MAR1 : 0000
where the loaded data is already aligned at the right end. Note that this
also covers the case of a load word operation, because words are stored at
addresses with MAR1 : 0 00. After the loaded data has been aligned,
the portion of the output R not belonging to the loaded data are replaced
with a fill bit:
f ill 24 r7 : 0 for lb, lbu
f ill 16 r15 : 0
R31 : 0
r31 : 0
for lw, lwu
for lw
In an unsigned load operation, the fill bit equals 0, whereas in signed
load operations, the fill bit is the sign bit of the shifted operand. This is
summarized in table 3.9 which completes the specification of the shifter
SH4L.
Figure 3.18 depicts a straightforward realization of the shifter SH4L.
The shift distance is always a multiple of 8. Thus, the cyclic right shifter
only comprises two stages for the shift distances 8 and 16. Recall that for
32 bit data, a cyclic right shift by 8 (16) bits equals a cyclic left shift by 24
(16) bits.
The first half word r31 : 16 of the intermediate result is replaced by
the fill bit in case that a byte or half word is loaded. During loads, this
is recognized by IR27=0. Byte r15 : 8 is only replaced when loading
a single byte. During loads, this is recognized by IR27 : 26 00. This
explains the multiplexer construction of figure 3.18.
)(
##
Fill bit of the shifts for load E NVIRONMENTS
IR[28] IR[27:26] Type MAR[1:0] fill
0 00 byte, signed 00 MDRr[7]
01 MDRr[15]
10 MDRr[23]
11 MDRr[31]
01 halfword, signed 00 MDRr[15]
10 MDRr[31]
11 word
1 00 byte, unsigned 0
01 halfword, unsigned 0
MDRr[31:0] 11
00
00
11 MAR[0]
CSR32,8 = CSL32,24
00
11
CSR32,16 = CSL32,16 00
11
00
11
MAR[1]
16 8 8
01
fill
LFILL
IR[26]
1 0 0 1
01 IR[27]
The circuit LFILL of figure 3.19 is a brute force realization of the fill bit
function specified in table 3.9. The cost and the delay of the shifter SH4L
and of circuit LFILL are
)/
#
MDRr[7] MDRr[15] MDRr[23] MDRr[31]
A S EQUENTIAL MAR[0] MAR[0]
0 1 0 1
DLX D ESIGN MDRr[15] MDRr[31]
0 1 MAR[1] 0 1 MAR[1]
0 1 IR[26]
IR[28]
fill
Circuit LFILL computes the fill bit for the shifter SH4L
Figure 3.20 depicts the graph of a finite state diagram. Only the names
of the states and the edges between them are presently of interest. In order
to complete the design, one has to specify the functions δz z¼ for all states z
with more than one successor state. Moreover, one has to specify for each
state z the set of control signals active in state z.
We begin with an intermediate step and specify for each state z a set of
register transfer language (RTL) instructions rt z to be executed in that
state (table 3.10). The abbreviations and the conventions are those of the
tables 3.1 to 3.3. In addition, we use M PC as a shorthand for M PC.
Also note that the functions op, shi f t, sh4l and rel have hidden parameters.
We also specify for each type t of DLX instruction the intended path
path t through the diagram. All such paths begin with the states fetch and
decode. The succeeding states on the path depend on the type t as indicated
in table 3.11. One immediately obtains
))
#&
S EQUENTIAL
C ONTROL
)*
#
fetch
A S EQUENTIAL
else
DLX D ESIGN decode
D2 D4 D6 D7 D8 D9 v D10 D12
1. for each type of instruction t, the path path t is taken, and that
state s rtl(s)
fetch IR M PC
decode A GPRRS1 B GPRRS2 PC PC 4 mod 232
alui C A imm if this is in T32
wbi GPRRD C
*
#&
Paths patht through the FSD for each type t of DLX instruction S EQUENTIAL
C ONTROL
DLX instruction type path through the FSD
arithmetic/logical, I-type: fetch, decode, aluI, wbI
# $# % # # # !
arithmetic/logical, R-type: fetch, decode, alu, wbR
# $# % # # # !
test set, I-type: fetch, decode, testI, wbI
# !# # !# # %# #
&
test set, R-type: fetch, decode, test, wbR
# !# # !# # %# # &
shift immediate: # # fetch, decode, shiftI, wbR
shift register: # # fetch, decode, shift, wbR
load: # # '# $# $ fetch, decode, addr, load, sh4l
store: # # ' fetch, decode, addr, sh4s, store
jump register: " fetch, decode, jreg
jump immediate: " fetch, decode, jimm
jump & link register: " fetch, decode, savePC, jalR, wbL
jump & link immediate: " fetch, decode, savePC, jalI, wbL
taken branch # % fetch, decode, branch, btaken
untaken branch # % fetch, decode, branch
It is that easy and boring. Keep in mind however, that with literal appli-
cation of the abridged semantics, this simple exercise would end in com-
plex and exciting insanity. Except for loads and stores, the proofs for all
cases follow exactly the above pattern.
Hence,
X byte MAR1:0 d
1:MAR1:0 MDRw
and the effect of the store operation is
M ea d 1 : ea X
state s rtl(s)
addr MAR A imm mod 232
load MDRr Mword MAR31 : 200
sh4l GPRRD sh4l MDRr MAR1 : 0000
X byte MAR1:0 d
1:MAR1:0 Mword MAR31 : 200
byte MAR1:0 d
1:MAR1:0 MDRr
*
#&
With the fill bit f ill defined as in table 3.9, one concludes
S EQUENTIAL
GPRRD sh4l MDRr MAR1 : 0000 C ONTROL
328d
f ill X
sxt m for load (signed)
032 8d m
for load unsigned
The design is now easily completed. Table 3.12 is an extension of table
3.10. It lists for each state s not only the RTL instructions rtl s but also
the control signals activated in that state. One immediately obtains
For all states s, the RTL instructions rtl s are executed in state s.
For all states except addr and btaken, this follows immediately from the
specification of the environments. In state s addr, the ALU environment
performs the address computation
MAR A sxt imm mod 232
A imm15 sxt imm mod 232
A imm
The branch target computation of state s btaken is handled in a com-
pletely analogous way.
It only remains to specify the disjunctive normal forms Di for figure 3.20
such that it holds:
For each instruction type t, the sequence path t of states specified by
table 3.11 is followed.
Each Di has to test for certain patterns in the primary and secondary
opcodes IR31 : 26 5 : 0, and it possibly has to test signal AEQZ as well.
These patterns are listed in table 3.13. They have simply been copied from
the tables 3.1 to 3.3. Disjunctive form D8 , for instance, tests if the actual
instruction is a jump register instruction " coded by
IR31 : 26 hx16 010110
It can be realized by the single monomial
D8 IR31 IR30 IR29 IR28 IR27 IR26
In general, testing for a single pattern with k zeros and ones can be done
with a monomial of length k. This completes the specification of the whole
machine. Lemmas 3.2 to 3.4 imply
The design correctly implements the instruction set.
*#
#
A S EQUENTIAL RTL instructions and their active control signals
DLX D ESIGN
state RTL instruction active control signals
fetch IR M PC fetch, mr, IRce
decode A RS1, Ace,
RD if I-type
B Bce, Pce
RS2 if R-type
PC PC 4 PCadoe, 4bdoe, add, ALUDdoe,
27 SA ; shiftI
co shiftI,
sxt imm ; other.
alu C A op B Aadoe, Bbdoe, ALUDdoe, Cce,
Rtype
test C A rel B ? 1 : 0 like alu, test
shift C shi f t A B4 : 0 Aadoe, Bbdoe, SHDdoe, Cce,
Rtype
aluI C A op co Aadoe, cobdoe, ALUDdoe, Cce
testI C A rel co ? 1 : 0 like aluI, test
shiftI C shi f t A co4 : 0 Aadoe, cobdoe, SHDdoe, Cce,
shiftI, Rtype
wbR RD C (R-type) GPRw, Rtype
wbI RD C (I-type) GPRw
addr MAR A co Aadoe, cobdoe, ALUDdoe, add,
MARce
load MDRr mr, MDRrce
Mword MAR31 : 200
sh4l RD sh4l MDRr shift4l, GPRw
MAR1 : 0000
sh4s MDRw Badoe, SHDdoe, shift4s,
cls B MAR1 : 0000 MDRwce
store m bytes MDRw mw
branch
btaken PC PC co PCadoe, cobdoe, add,
ALUDdoe, PCce
jimm PC PC co like btaken, Jjump
jreg PC A Aadoe, 0bdoe, add, ALUDdoe,
PCce
savePC C PC PCadoe, 0bdoe, add, ALUDdoe,
Cce
jalR PC A like jreg
jalI PC PC co like jimm
wbL GPR31 C GPRw, Jlink
*&
#&
Nontrivial disjunctive normal forms (DNF) of the DLX finite state S EQUENTIAL
diagram and the corresponding monomials C ONTROL
Nontrivial Target Monomial m M Length
DNF State IR31 : 26 IR5 : 0 l m
D1 shift 000000 0001*0 11
000000 00011* 11
D2 alu 000000 100*** 9
D3 test 000000 101*** 9
D4 shiftI 000000 0000*0 11
000000 00001* 11
D5 aluI 001*** ****** 3
D6 testI 011*** ****** 3
D7 addr 100*0* ****** 4
10*0*1 ****** 4
10*00* ****** 4
D8 jreg 010110 ****** 6
D9 jalR 010111 ****** 6
D10 jalI 000011 ****** 6
D9 D10 savePC like D9 and D10
D11 jimm 000010 ****** 6
D12 branch 00010* ****** 5
D13 sh4s **1*** ****** 1
/D13 load **0*** ****** 1
bt btaken AEQZ IR26 2
/AEQZ IR26 2
Accumulated length of m M: ∑m M l m
115
by the instruction register IR at zero delay. Thus, the input signals of the
automaton have the accumulated delay:
According to section 2.6, the cost and the delay of such a Moore automa-
ton only depend on a few parameters (table 3.14). Except for the fanin of
the states/nodes and the frequency of the control signals, these parameters
can directly be read off the finite state diagram (figure 3.20) and table 3.13.
State f etch serves as the initial state z0 of the automaton. Recall that our
realization of a Moore automaton has the following peculiarity: whenever
the next state is not specified explicitly, a zero tester forces the automaton
in its initial state. Thus, in the next state circuit NS, transitions to state
f etch can be ignored.
, !
For each edge z z E and z f etch, we refer to the number #M z z
of monomials in D z z as the weight of the edge. For edges with nontriv-
ial monomials, the weight can be read off table 3.13; all the other edges
have weight 1. The fanin of a node z equals the sum of the weights of all
edges ending in z. Thus, state wbR has the highest fanin of all states differ-
ent from f etch, namely, f aninmax 4, and all the states together have an
accumulated fanin of f aninsum 31.
*(
#&
Control signals of the DLX architecture and their frequency. Signals S EQUENTIAL
printed in italics are used in several environments. C ONTROL
control signals control signals
Top PCadoe, Aadoe, Badoe, GPRenv GPRw, Jlink, Rtype
level Bbdoe, 0bdoe, SHDdoe, PCenv PCce
coBdoe, 4bdoe, ALUDdoe, ALUenv add, test, Rtype
Ace, Bce, Cce, MARce, Menv mr, mw, fetch
MDRrce, MDRwce, fetch SHenv shift4s
IRenv Jjump, shiftI, IRce SH4Lenv shift4l
1
A register R is now controlled by two signals, the signal Rce which request
the update and the update enable signal Rue which enables the requested
update (figure 3.21). The register is only updated if both signals are active,
*/
#
A S EQUENTIAL Rce’ Rce di a Kw’ Kw
R Rue RAM K w Kue
DLX D ESIGN do
Controlling the update of registers and RAMs. The control automaton
provides the request signals Rce, Kw; the stall engine provides the enable signals
Rue, Kue.
i.e., Rce Rue 1. Thus, the actual clock enable signal of register R,
which is denoted by Rce , equals
The clock request signal Rce is usually provided by the control automaton,
whereas signals Rue and Rce are generated by a stall engine.
In analogy, the update of a RAM R is requested by signal Rw and enabled
by signal Rue. Both signals are combined to the actual write signal
Rw Rw Rue
Note that the read and write signals Mr and Mw of the memory M are not
masked by signal UE.
According to table 3.15, the control automaton provides 8 clock request
signals and 1 write request signal. Together with the clock of the Moore
automaton, the stall engine has to manipulate 10 clock and write signals.
Thus, the cost and the delay of this simple stall engine run at
The hardware consists of the data paths and of the sequential control. If
not stated otherwise, we do not consider the memory M itself to be part of
the DLX hardware.
The data paths DP (figure 3.4) of the sequential DLX fixed-point core
consist of six registers, nine tristate drivers, a multiplexer and six environ-
ments: the arithmetic logic unit ALUenv, the shifters SHenv and SH4Lenv,
**
#
A S EQUENTIAL Cost of the DLX fixed-point core and of all its environments
DLX D ESIGN
cost cost cost
ALUenv 1691 IRenv 301 DP 10846
SHenv 952 GPRenv 4096 CON 1105
SH4Lenv 380 PCenv 354 DLX 11951
and the environments of the instruction register IR, of the general purpose
registers GPR and of the program counter PC. Thus, the cost of the 32-bit
data paths equals
CDP 6 C f f 32 9 Cdriv 32 Cmux 32 CALUenv CSHenv
CSH4Lenv CIRenv CGPR CPCenv
#' %
For the cycle time, we have to consider the four types of transfers illus-
trated in figure 2.3 (page 11). This requires to determine the delay of each
paths which start in a register and end in a register, in a RAM, or in the
memory. In this regard, the sequential DLX design comprises the follow-
ing types of paths:
1. the paths which only pass through the data paths DP and the Moore
control automaton,
2. the paths of a memory read or write access, and
3. the paths through the stall engine.
These paths are now discussed in detail. For the paths of type 1 and 2, the
impact of the global update enable signal UE is ignored.
#'
" - +" 5 -
All these paths are governed exclusively by the output signals of the Moore H ARDWARE C OST
automaton; these standard control signals, denoted by Csig, have zero de- AND C YCLE T IME
lay:
A Csig A pMoore out 0
One type of paths is responsible for the update of the Moore automaton.
A second type of paths is used for reading from or writing into the register
file GPR. All the remaining paths pass through the ALU or the shifter SH.
-
The time TpMoore denotes the cycle time of the Moore control automaton,
as far as the computation of the next state and of the outputs is concerned.
According to section 2.6, this cycle time only depends on the parameters
of table 3.14 and on the accumulated delay A in A clr ce of its input,
clear and clock signals.
8 ,
For the timing, we distinguish between read and write accesses. During
a read access, the two addresses come directly from the instruction word
IR. The data A and B are written into the registers A and B. The control
signals Csig switch the register file into read mode and provide the clock
signals Ace and Bce. The read cycle therefore requires time:
During write back, the value C , which is provided by the shifter envi-
ronment SH4Lenv, is written into the multiport RAM of the GPR register
file. Both environments are governed by the standard control signals Csig.
Since the register file has a write delay of DGPRw , the write back cycle takes
As soon as the operands become valid, they are processed in the ALU and
the shifter SHenv. From the data bus D, the result is then clocked into
#
a register (MAR, MDRw or C) or it passed through environment PCenv
A S EQUENTIAL which adds delay DPCenv IN; PC. Thus, the ALU and shift cycles require
DLX D ESIGN a cycle time of
delays after the start of each cycle, all the inputs of the memory system
are valid, and the memory access can be started. The status flag mbusy
therefore has an accumulated delay of
On a read access, the memory data arrive on the bus MDout dmem delays
after the inputs of the memory are stable, and then, the data are clocked
into the instruction register IR or into register MDRr. Thus, the read cycle
time is
TM TMread AMC dmem ∆
Since mbusy has a much longer delay than the standard control signals of
the Moore automaton, the stall engine provides the write and clock enable
signals at an accumulated delay of
The cycle time TM of the memory environment only has an indirect im-
pact on the cycle time τDLX . If the memory cycle time is less than τDLX ,
memory accesses can be performed in a single machine cycle. In the other
case, TM τDLX , the cycle time of the machine must be increased to TM or
memory accesses require TM τDLX cycles. Our designs use the second
approach.
#
#
! " #
Ì
A S EQUENTIAL
DLX D ESIGN HEDLX instruction set is from the classical textbook [HP90]. The
design presented here is partly based on designs from [HP90, PH94,
KP95, MP95]. A formal verification of a sequential processor is reported
in [Win95].
&
Chapter
4
Basic Pipelining
IF fetch
else
ID decode
EX
D1 D2 D3 D4 D5 D6 D7 D8 D9 v D10 D11 D12
M load store
Partitioning of the FSD of the sequential DLX design into the five
stages of table 4.1.
I k T i
if I 0 T i then I 0 T 1 i 1, and
if I k T i and k 4 then I k 1 T 1 i.
I k T i T ki
(
&
IF I0 I1 I2 I3 I4 ... cycles D ELAYED B RANCH
...
AND D ELAYED PC
ID I0 I1 I2 I3 I4
EX I0 I1 I2 I3 I4 ...
idle ...
M I0 I1 I2 I3 I4
WB I0 I1 I2 I3 I4
1. The adder is used in stage decode for incrementing the PC, and in
stage execute it is either used for ALU operations or for branch tar-
get computations. The instructions " and " use the adder even
twice in the execute stage, namely for the target computation and for
passing the PC to the register file. Thus, we at least have to provide
an extra incrementer for incrementing the PC during decode and an
ALU bypass path for saving the PC.
The next instruction Ii1 then has to be fetched from location PCi with
btargeti if b jtakeni 1
PCi
PCi 1 4 otherwise
+%
The way out of this difficulty is by very brute force: one changes the se-
mantics of the branch instruction by two rules, which say:
1. A branch taken in instruction Ii affects only the PC computed in the
following instruction, i.e., PCi1 . This mechanism is called delayed
branch.
PC 1 0
b jtaken 1 0
btargeti if b jtakeni 1
PCi1
PCi 4 otherwise.
)
&
Observe that the definition of branch targets PC 4 imm instead of the
much more obvious branch targets PC imm is motivated by the delayed D ELAYED B RANCH
branch mechanism. After a control operation Ii , one always executes the AND D ELAYED PC
instruction IM PCi 1 4 in the delay slot of Ii (because Ii does not occupy
a delay slot and hence, b jtakeni 1 0). With a branch target PC imm,
instead of
PCi1 PCi immi1
The delayed branch semantics is, for example, used in the MIPS [KH92],
the SPARC [SPA92] and the PA-RISC [Hew94] instruction set.
+% "
Instead of delaying the effect of taken branches, one could opt for delay-
ing the effect of all PC calculations. A program counter PC is updated
according to the trivial sequential semantics
PCi 1 immi if b jtakeni 1 Ii % " "
PCi 1 4 otherwise
DPCi1 PCi
The delayed program counter DPC is used for fetching instructions from
IM, namely Ii IM DPCi 1. Computations are started with
PC 1
4
DPC 1 0
We call this uniform and easy to implement mechanism delayed PC. The
two mechanisms will later turn out to be completely equivalent.
Ii1 IM PCi
is the instruction in the delay slot of Ii . The jump and link instruction Ii
should therefore save
GPR31i PCi 1 4
2. and if Ii is a jump and link instruction, then the value GPR31i saved
into register 31 during instruction Ii is identical for both machines.
PCi1 PCi
&
Since DPCi 1 PCi 1 , the same instruction Ii is fetched with delayed
branch and delayed PC, and in both cases, the variable b jtakeni has the P REPARED
same value. S EQUENTIAL
If b jtakeni 0, it follows M ACHINES
PCi PCi 1 4
PCi 2 4 immi
because b jtakeni 1 0
PCi 1 4 immi
by the induction hypothesis for i 2
btargeti
PCi1 because b jtakeni 1
If Ii is of type " or ", then
PCi RS1i 1
btargeti
PCi1 because b jtakeni 1
and part one of the induction hypothesis follows.
For the second part, suppose Ii is a jump and link instruction. With
delayed branch, PCi 1 8 is then saved. Because Ii is not in a delay slot,
we have
PCi 18 PCi 4
DPCi 4 by induction hypothesis
PCi 1 4
by definition of delayed PC
This is exactly the value saved in the delayed PC version.
Table 4.2 illustrates for both mechanisms, delayed branch and delayed
PC, how the PCs are updated in case of a jump and link instruction.
2. The data paths and the control of the machine are arranged in a 5-
stage pipeline, but
Most of the environments can literally be taken from the sequential DLX
designs. Only two environment undergo nontrivial changes: the PC envi-
ronment and the execute environment EXenv. The PC environment has to
be adapted for the delayed PC mechanism. For store instructions, the ad-
dress calculation of state addr and the operand shift of state sh4s have now
to be performed in a single cycle. This will not significantly slow down the
&
IMenv
IF P REPARED
S EQUENTIAL
IR.1
ID M ACHINES
IRenv CAddr
PCenv 12
A, B PC’, link, DPC co IR.2 Cad.2
EX
5
D EXenv sh
Inputs and outputs of each stage k of the prepared DLX data paths
stage in k out k
0 IF DPC, IM IR
1 ID GPR, PC’, IR A, B, PC’, link, DPC,
co, Cad.2
2 EX A, B, link, co, Cad.2, IR MAR, MDRw, Cad.3
3 M MAR, MDRw, DM, Cad.3, IR DM, C, MDRr, Cad.4
4 WB C, MDRr, Cad.4, IR GPR
cycle time, because only the last two bits of the address influence the shift
distance, and these bits are known early in the cycle. Trivially, the memory
M is split into an instruction memory IM and a data memory DM.
There is, however, a simple but fundamental change in which we clock
the output registers of the stages. Instead of a single update enable signal
UE (section 3.4.3), we introduce for every stage k a distinct update enable
signal uek. An output register R of stage k is updated iff its clock request
#
&
R1 R1ce ... Rs Rsce
BASIC P IPELINING
ue.k
signal Rce and the update enable signal of stage k are both active (figure
4.4). Thus, the clock enable signal Rce of such a register R is obtained as
As before, the read and write signals of the main memory M are not masked
by the update enable signal ue3 but by the full bit f ull 3 of the memory
stage.
1 .8
of the instruction register is still controlled by the signals J jump (J-type
jump), shi f tI and the clock signal IRce. The functionality is virtually the
same as before. On IRce 1, the output IMout of the instruction memory
is clocked into the instruction register
IR IMout
1 &3
is controlled by signal shi f t4l which requests a shift in case of a load
instruction. The only modification in this environment is that the memory
address is now provided by register C and not by register MAR. This has
an impact on the functionality of environment SH4Lenv but not on its cost
and delay.
&
&
Let sh4l a dist denote the function computed by the shifter SH4L as
it was defined in section 3.3.7. The modified SH4Lenv environment then P REPARED
provides the result S EQUENTIAL
M ACHINES
sh4l MDRr C1 : 0000 if shi f t4l 1
C
C if shi f t4l 0
1 9"8
As in the sequential design, circuit CAddr generates the address Cad of the
destination register based on the control signals Jlink (jump and link) and
Itype. However, the address Cad is now precomputed in stage ID and is
then passed down stage by stage to the register file environment GPRenv.
For later use, we introduce the notation
Environment GPRenv (figure 4.5) itself has still the same functionality.
It provides the two register operands
GPRRS1 GPRIR25 : 21 if RS1 0
A
0 otherwise
Since circuit CAddr is now an environment of its own, the cost of the
register file environment GPRenv run at
Due to the precomputed destination address Cad 4, the update of the reg-
ister file becomes faster. Environment GPRenv now only delays the write
access by
DGPR write Dram3 32 32
Let ACON csW B denote the accumulated delay of the control signals
which govern stage WB; the cycle time of the write back stage then runs at
The delay DGPR read of a read access, however, remains unchanged; it adds
A’ B’
5 % 1
The DLX design which is prepared for pipelined execution comprises two
memories, one for instructions and one for the actual data accesses.
The instruction memory also provides a signal ibusy indicating that the
access cannot be finished in the current clock cycle. We expect this signal
to be valid dIstat time units after the start of an IM memory access.
TM TDMenv read
ACON csM DDMC dDmem ∆
ADMenv dbusy ACON csM DDMC dDstat
" 1
The environment PCenv of figure 4.6 is governed by seven control signals,
namely:
jump which denotes one of the four jump instructions "# "# " and
",
jumpR which denotes an absolute jump instruction ("# "),
Based on these signals, its glue logic PCglue generates the clock sig-
nal of the registers PC and DPC. They are clocked simultaneously when
signal PCce is active or on reset, i.e., they are clocked by
PCce reset
11
00
A’
00
11
nextPC jumpR 1 0
bjtaken 0
11
00
0 1
4
0 1 reset 0 1 reset
Let ACON csID denote the accumulated delay of the control signals
which govern stage ID. The cost of the glue logic and the delay of the
signals AEQZ and b jtaken then run at
PCenv also provides a register link which is updated under the control
of signal linkce. On linkce 1, it is set to
link PC 4
D sh
1- 1
The execute environment EXenv of figure 4.7 comprises the ALU envi-
ronment and the shifter SHenv and connects them to the operand and re-
sult busses. Since on a store instruction, the address computation and the
operand shift are performed in parallel, three operand and two result busses
are needed.
Register A always provides the operand a. The control signals bmuxsel
and a muxsel select the data to be put on the busses b and a :
B if bmuxsel 1 B if a muxsel 1
b a
co otherwise, A otherwise.
The data on the result bus D is selected among the register link and the
results of the ALU and the shifter. This selection is governed by three
output enable signals
link if linkDdoe 1
D
alu if ALUDdoe 1
sh if SHDdoe 1
Note, that at most one of these signals should be active at a time.
*
&
ALU Environment Environment ALUenv is governed by the same con-
BASIC P IPELINING trol signals as in the sequential design, and the specification of its results
alu and ov f remains unchanged. However, it now provides two additional
bits s1 : 0 which are fed directly to the shifter. These are the two least
significant bits of the result of the arithmetic unit AU 32. Depending on
signal sub, which is provided by the ALU glue logic, the AU computes the
sum or the difference of the operands a and b modulo 232 :
The cost of the ALU environment and its total delay DALUenv remain the
same, but the bits s1 : 0 have a much shorter delay. For all the adders
introduced in chapter 2, the delay of these bits can be estimated based on
the delay of a 2-bit AU
Figure 4.8 depicts an FSD for the prepared data paths; the tables 4.4 to
4.6 list the corresponding RTL instructions and their active control signals.
&
IF fetch
P REPARED
ID decode S EQUENTIAL
EX M ACHINES
D9 D1 D2 D3 D4 D5 D6 D7 D8 else
addrL alu aluI shift shiftI test testI savePC addrS noEX
M
load passC store noM
WB
sh4l wb noWB
The nontrivial DNFs are listed in table 4.7. Except for the clocks Ace, Bce,
PCce and linkce, all the control signals used in the decode stage ID are
Mealy signals. Following the pattern of section 3.4, one shows
1. for each type of instruction, the path specified in table 4.8 is taken,
2. and for each state s, the set of RTL instructions rtl s is executed.
If every memory access takes only one cycle, then the machine interprets
the DLX instruction set with delayed PC semantics.
We derive from the above FSD and the trivial stall engine a new control
and stall engine with exactly the same behavior. This will complete the
design of the prepared sequential machine DLXσ .
&
RTL instructions of the memory and write back stage P REPARED
S EQUENTIAL
state RTL instruction control signals
M ACHINES
M passC C MAR Cad 4 Cad 3 Cce, Cad4ce
load MDRr Mword MAR31 : 200, Dmr, MDRrce,
C MAR Cad 4 Cad 3 Cce, Cad4ce
store m bytes MDRw Dmw
noM
WB sh4l GPRCad 4 shift4l, GPRw
sh4l MDRr MAR1 : 0000
wb GPRCad 4 C GPRw
noWB (no update)
We begin with a stall engine which clocks all stages in a round robin
fashion. It has a 5-bit register f ull 4 : 0, where for all stages i, signal
enables the update of the registers in out i. Since memory accesses can
take several cycles, the update of the data memory DM is enabled by f ull3
and not by ue3 . Register f ull is updated by
00001 if reset
f ull 4 : 0 if busy reset
f ull 4 : 0 :
cls f ull otherwise
Since the design comprises two memories, we compute the busy signal by
busy
ibusy NOR dbusy
With signals f ull defined in this way, we obviously can keep track of the
stage which processes the instruction, namely: the instruction is in stage i
iff f ulli 1. In particular, the instruction is in stage IF iff f ull0 1, and
it is in stage ID if f ull1 1.
We proceed to transform the FSD by the following four changes:
Nontrivial disjunctive normal forms (DNF) of the FSD corresponding
to the prepared data paths
&
&
Paths patht through the FSD for each type t of DLX instruction P REPARED
S EQUENTIAL
DLX Instruction Type Path through FSD
M ACHINES
# $# % # # # ! fetch, decode, aluI, passC, wb
# $# % # # # ! fetch, decode, alu, passC, wb
# !# # !# # %# fetch, decode, testI, passC, wb
# &
# !# # !# # %# # fetch, decode, test, passC, wb
&
# # fetch, decode, shiftI, passC, wb
# # fetch, decode, shift, passC, wb
# # '# $# $ fetch, decode, addrL, load, sh4l
# # ' fetch, decode, addrS, store, noWB
"# " fetch, decode, savePC, passC, wb
others fetch, decode, noEX, noM, noWB
4. Finally observe that in figure 4.8 only state decode has a fanout
greater than one. In stage ID, we can therefore precompute the con-
trol signals of all stages that follow and clock them into a register
R2 out 1. Table 4.9 lists for each state the signals to be clocked
into that register. The inputs of register R2 are computed in every
cycle, but they are only clocked into register R2 when
(a) signals x to be used in the next cycle only control stage EX,
(b) signals y to be used in the next two cycles control the stages
EX and M, and
(c) signals z to be used in the next three cycles control the stages
EX, M and WB.
(
&
ue.1 z y x R.2 P REPARED
S EQUENTIAL
ue.2 z y R.3 M ACHINES
ue.3 z R.4
R.0 IM DP
stage IF
R.1 out(0)
stage ID
R.2 out(1)
stage EX
R.3 out(2)
stage M
R.4 out(3)
stage WB
CON
out(4)
Structure of the data paths DP and the precomputed control CON of
the DLXσ machine
Parameters of the two control automata; one precomputes the Moore
signals (ex) and the other generate the Mealy signals (id).
/
&
The new precomputed control generates the same control signals as the
BASIC P IPELINING FSD, but the machine now has a very regular structure: Control signals
coming from register Rk out k 1 control stage k of the data paths for
all k 1. Indeed, if we define R0 and R1 as dummy registers of length
0, the same claim holds for all k. The structure of the data paths and the
precomputed control of machine DLXσ is illustrated in figure 4.10.
The hardware generating the inputs of register R2 is a Moore automaton
with the 10 EX states, precomputed control signals and the parameters ex
from table 4.10. The state noEX serves as the initial state of the automaton.
The next state only depends on the input IR but not on the current state.
Including the registers R3 and R4, the control signals of the stages EX to
WB can be precomputed at the following cost and cycle time:
The Mealy signals which govern stage ID are generated by a Mealy au-
tomaton with a single state and the parameters id of table 4.10. All its
inputs are provided by register IR at zero delay. According to section 2.6,
the cost of this automaton and the delay of the Mealy signals can be esti-
mated as
We do not bother to analyze cost and cycle time of the stall engine.
T 0 1
)
&
For such cycles T and signals R, we denote by RT the value of R during
cycle T ; R can also be the output of a register. We abbreviate with P REPARED
S EQUENTIAL
Iσ k T i M ACHINES
if Iσ 4 T i, then Iσ 0 T 1 i 1.
and for any cycle T 0, stage k is full ( f ullT k 1) iff Iσ k T is defined.
Recall that we denote by Ri the value of R after execution of instruction
Ii . By R 1 we denote the initial value of R, i.e., the value of R just after
Ri if t k
This very intuitive formulation of the lemma is the reason why in figure
4.3 we have drawn the general purpose register file at the bottom of the
pipeline and not – as is usual – in the middle of stage ID. A formal proof
uses the fact that
Ri 1 R5i
...
1111
0000 k
...
Ri-1
the machine around the equator (with stage k east of stage k 1 mod 5).
Now imagine that we process one instruction per day and that we clock the
pipeline stages at the dateline, i.e., the border between today and yesterday.
Then the lemma states that east of the dateline we already have today’s data
whereas west of the dateline we still have yesterdays data.
Let I 4 T i, then I 0 T 1 i 1, and the dateline lemma applies
for all R
RT 1 Ri1 1 Ri
PCce reset
2. The stall engine from figure 4.13 is used. For all i, signal uei enables
the update of registers and RAM cells in out i. The update of the
data memory DM is now enabled by
f ull3 reset
#
&#
IF
00111100 dpc IM
P IPELINING AS A
[31:2] [1:0] co 0 1 reset
0011
T RANSFORMATION
Inc(30) Add(32)
A’ 0
01
jumpR 1 0
ID
0 1 bjtaken
4
0 1 reset
nextPC
link PC’
CE ue.0
CE 1 full.1
reset
ue.1
CE full.2
ue.2
CE full.3
ue.3
CE full.4
ue.4
At reset, the signals ue4 : 0 are initialized with 00001. When only
counting cycles T with an inactive busy signal, the update enable sig-
nals ue become active successively as indicated in table 4.11. Note
that we now assume the reset signal to be active during cycle T 0.
&#
+- 8
We generally assume that the reset signal is active long enough to permit
an instruction memory access.
R 1 R0σ R1π
M 1 Mσ0 Mπ1
Note that we make no such assumption for the remaining registers. This
will be crucial when we treat interrupts. The mechanism realizing the jump
to the interrupt service routine (JISR) will be almost identical to the present
reset mechanism.
#
&#
The schedule for the execution of instructions Ii by machine DLXπ is
defined by P IPELINING AS A
Iπ k T i T k i T RANSFORMATION
This should hold in particular for the signals which are clocked into the
output registers R out k of stage k at the end of cycle T . This would
permit us to conclude for these registers
RπT 1 RσT 1 Ri
¼
(4.2)
This almost works. Indeed it turns out that equations 4.1 and 4.2 hold
after every invisible register has been updated at least once. Until this has
happened, the invisible registers in the two machines can have different
values because they can be initialized with different values.
Thus, we have to formulate a weaker version of equations 4.1 and 4.2.
We exploit the fact, that invisible registers are only used to hold intermedi-
ate results (that is why they can be hidden from the programmer). Indeed,
if the invisible register R is an input register of stage k, then the pipelined
machine uses this register in cycle T only if it was updated at the end of
the previous cycle. More formally, we have
Let Iπ k T i, and let R be an invisible input register of stage k that was
not updated at the end of cycle T 1, then:
1. The set of output registers R of stage k which are updated at the end
of cycle T is independent of RT .
This can be verified by inspection of the tables 4.4 to 4.6 and 4.8.
##
&
Therefore, it will suffice to prove equation 4.2 for all visible registers as
BASIC P IPELINING well as for all invisible registers which are clocked at the end of cycle T . It
will also suffice to prove equation 4.1 for the input signals S of all registers
which are clocked at the end of cycle T .
Under the above assumptions and with a hypothesis about data depen-
dencies in the program executed we are now able to prove that the ma-
chines DLXπ and DLXσ produce the same sequence of memory accesses.
Thus, the CPUs simulate each other in the sense that they have the same
input/output behavior on the memory. The hypotheses about data depen-
dencies will be removed later, when we introduce forwarding logic and a
hardware interlock.
0, the instructions Ii 3 Ii 1 do
Suppose that for all i 0 and for all r
2. For all registers and R out k which are visible or updated at the
end of cycle T :
RTπ 1 Ri
Proof by induction on the cycles T of the pipelined execution. Let T 0.
We have Iπ 0 0 0 Iσ 0 0, i.e., instruction 0 is in stage 0 of machine
DLXπ during cycle T 0 and in stage 0 of machine DLXσ during cycle
T 0. The only input of stage 0 is the address for the instruction mem-
ory. This address is the output of register DPC for machine DLXσ and
signal d pc for machine DLXπ. By construction, both signals have in the
corresponding cycles T 0 and T 0 the same value, namely
DPCσ0 d pc0π 0
As stages 0 are for both machines identical, we have
Sπ0 Sσ0
for all internal signals S of stage 0 and claim 1 follows. In particular in
both machines IM 0 is clocked into the instruction register at the end of
cycle T T 0. Hence, claim 2 follows because of
IR1π IR1σ IR0
#&
&#
Illustration of the scheduling function I π for the stages k 1 and k. P IPELINING AS A
T RANSFORMATION
stage s Iπ s T Iπ s T 1
k-1 i
k i i-1
R out k 1
Hence, except for invisible input registers R which were not updated
after cycle T 1, stage k of machine DLXπ has in cycle T the same
inputs as stage k of machine DLXσ in cycle T . Stage k is identical in
both machines (this is the point of the construction of the prepared
machine !). By lemma 4.4, the set of output registers R of stage k
which are updated after cycle T or T , respectively, is identical for
both machines, and the input signals S of such registers have the
same value:
¼
SπT SσT
If follows that at the end of these cycles T and T identical values
are clocked into R :
T 1 T ¼ 1
Rπ Rσ
Ri by lemma 4.3
#'
&
BASIC P IPELINING
IF IM
out(0)
ID
NextPC
EX
out(2)
DM
out(3)
WB
out(4) GPR
Data flow between the pipeline stages of the DLX σ design
Iπ 3 T 1 i 1
Iπ 3 T i T 3
#(
&#
Illustration of the scheduling function I π for the stages 0 and 1. P IPELINING AS A
T RANSFORMATION
stage s Iπ s T Iπ s T 1
0 i
1 i-1 i-2
Iπ 4 T 1 i 4
GPRrTπ GPRri
4 by induction hypothesis
T¼
GPRrσ
GPRri 1 GPRri 4
PC if reset 0
d pc
0 if reset 1
The cycle time TPCenv of the PC environment remains unchanged, but the
address d pc of the instruction memory has now a longer delay. Assuming
that signal reset has zero delay, the address is valid at
This delay adds to the cycle time of the stage IF and to the accumulated
delay of signal ibusy of the instruction memory:
For each register R out i and memory M out i, the stall engine
then combines the clock/write request signal and the update signal and
turns them into the clock/write signal:
The update of the data memory DM is only enabled if stage 3 is full, and
if there is no reset:
The write signal Dmw of the data memory has now a slightly larger accu-
mulated delay. However, an inspection of the data memory control DMC
(page 81) indicates that signal Dmw is still not time critical, and that the
accumulated delay of DMC remains unchanged.
For the DLXπ design and the DLXσ design, the top level schematics of the
data paths DP are the same (figure 4.3), and so do the formula of the cost
CDP .
The control unit CON comprises the stall engine, the two memory con-
trollers IMC and DMC, and the two control automata of section 4.2.3. The
cost CCON moore already includes the cost for buffering the Moore sig-
nals up to the write back stage. The cost of the control and of the whole
DLXπ core therefore sum up to
Table 4.15 lists the cost of the DLX core and of its environments for
the sequential design (chapter 3) and for the pipelined design. The execute
environment of the sequential design consists of the environments ALUenv
and SHenv and of the 9 drivers connecting them to the operand and result
busses. In the DLXπ design, the busses are more specialized so that EXenv
only requires three drivers and two muxes and therefore becomes 20%
cheaper.
In order to resolve structural hazards, the DLXπ design requires an ex-
tended PC environment with adder and conditional sum incrementer. That
accounts for the 358% cost increase of PCenv and of the 12% cost increase
of the whole data paths.
Under the assumption that the data and control hazards are resolved in
software, the control becomes significantly cheaper. Due to the precompu-
tation and buffering of the control signals, the automata generate 19 instead
of 29 signals. In addition, the execution scheme is optimized, cutting the
total frequency νsum of the control signals by half. The constant, for exam-
ple, is only extracted once in stage ID, and not in every state of the execute
stage.
%
In order to determine the cycle time of the DLX design, we distinguish
three types of paths, those through the control, through the memory system
and through the data paths.
Control Unit CON The automata of the control unit generate Mealy
and Moore control signals. The Mealy signals only govern the stage ID;
they have an accumulated delay of 13 gate delays. The Moore signals are
precomputed and therefore have zero delay:
The cycle time of the control unit is the maximum of the times required by
the stall engine and by the automata
Compared to the sequential design, the automata are smaller. The maximal
frequency of the control signals and the maximal fanin of the states are cut
by 25% reducing time Tauto by 24% (table 4.16). The cycle time of the
whole control unit, however, is slightly increased due to the stall engine.
Memory Environments The cycle time TM models the read and write
time of the memory environments IMenv and DMenv. Pipelining has no
impact on the time tM which depends on the memory access times dImem
and dDmem :
TM maxTIMenv TDMenv
Data Paths DP The cycle time TDP is the maximal time of all cycles in
the data paths except those through the memories. This involves the stages
decode, execute and write back:
Table 4.16 lists all these cycle times for the sequential and the pipelined
DLX design. The DLXπ design already determines the constant and the
destination address during decode. That saves 4 gate delays in the execute
and write back cycle and improves the total cycle time by 6%.
&
&&
The cycle time of stage ID is dominated by the updating of the PC. In the
sequential design, the ALU environment is used for incrementing the PC R ESULT
and for the branch target computation. Since environment PCenv has now F ORWARDING
its own adder and incrementer, the updating of the becomes 20% faster.
Pipelining has the following impact on the cost and the cycle time of the $&
DLX fixed-point core, assuming that the remaining data and control haz-
ards can be resolved in software:
The data paths are about 12% more expensive, but the control be-
comes cheaper by roughly 30%. Since the control accounts for 5%
of the total cost, pipelining increases the cost of the core by about
8%.
The cycle time is reduced by 6%.
In order to analyze the impact which pipelining has on the quality of
the DLX fixed-point core, we have to quantify the performance of the
two designs. For the sequential design, this was done in [MP95]. For
the pipelined design, the performance strongly depends on how well the
data and control hazards can be resolved. This is analyzed in section 4.6.
Suppose that for all i 0 and r 0, the instructions Ii 1 Ii 2 are not load
operations with destination GPRr, where GPRr is a source operand of
instruction Ii . The following two claims then hold for all cycles T and T ,
for all stages k and for all instructions Ii with
Iσ k T Iπ k T i
1. For all signals S in stage k which are inputs to a register R out k
that is updated at the end of cycle T :
¼
SπT SσT
2. For all registers and R out k which are visible or updated at the
end of cycle T :
RπT 1 Ri
&#
&
&& > ,
BASIC P IPELINING
We first introduce three new precomputed control signals v4 : 2 for the
prepared sequential machine DLXσ . The valid signal v j indicates that the
data, which will be written into the register file at stage 4 (write back), is
already available in the circuitry of stage j. For an instruction Ii , the valid
signals are defined by
0 if instruction Ii is a load
v4 1; v3 v2 Dmri
1 otherwise
where Dmri is the read signal of the data memory for Ii . Together with the
write signal GPRw of the register file and some other precomputed control
signals, the signals v4 : 2 are pipelined in registers R2, R3 and R4 as
indicated in figure 4.15. For any stage k 2 3 4, the signals GPRwk
and vkk are available in stage k. At the end of stage k, the following
signals C k are available as well:
C 2 which is the input of register MAR,
( For all i, for any stage k 2, and for any cycle T with Iσ k T i, it
holds:
1. Ii writes the register GPRr iff after the sequential execution of Ii ,
the address r, which is different from 0, is kept in the registers Cad k
and the write signals GPRwk are turned on, i.e.:
Structure of the data paths DP and of the precomputed control CON
of the extended DLXπ machine
In the decode stage, the valid signals are derived from the memory read
signal Dmr, which is precomputed by the control automata. The generation
and buffering of the valid signals therefore requires the following cost and
cycle time:
CVALID 3 2 1 C f f Cinv
TVALID Tauto Dinv
This extension effects the cost and cycle time of the precomputed control
of the pipelined DLX design.
and it has an output Dout feeding data into stage 1. The data Din are fed
into stage 1 whenever forwarding is impossible.
&'
&
a) b)
BASIC P IPELINING topA topB
ad top Dout Aad Ain Bad Bin
- ,
For the stages j 2 3 4, we specify the following signals:
Signal hit j is supposed to indicate that the register accessed by the in-
struction in stage 1 is modified by the instruction in stage j. Except for the
first four clock cycles T 0 3 all pipeline stages are full (table 4.13),
i.e., they process regular instructions. However, during the initial cycles,
an empty stage is prevented from signaling a hit by its full flag. Signal
top j hit j
j 1
hit x
x 2
indicates moreover, that there occurs no hit in stages above stage j. The
data output Dout is then chosen as
C MDRr
WB
SH4Lenv
C’.4
C’ Cad.4
GPRenv
GPRoutA, GPRoutB Aad, Bad
Top level schematics of the DLXπ data paths with forwarding. For
clarity’s sake, the address and control inputs of the stall engine are dropped.
This forwarding engine provides the output Dout and the signals top j at
the following delays
the delay is largely due to the address check. The actual data Din and C j
are delayed by no more than
Let A C Din denote the accumulated delay of the data inputs C i and
Din. Since the addresses are directly taken from registers, the forwarding
engine can provide the operands Ain and Bin at the accumulated delay
A Ain Bin; this delay only impacts the cycle time of stage ID.
&&#
Iπ k T 1
Iπ 1 α T i α
&)
&&
We consider the time T , when instruction Ii α is in stage 1 α of the
prepared sequential machine: R ESULT
F ORWARDING
Iσ 1 α T i α
In cycle T , the prepared sequential machine has not yet updated register
GPRr. The following lemma states, where we can find a precomputed
version of GPRri α in the sequential machine.
Suppose the hypothesis of theorem 4.5 holds, Ii reads GPRr, instruction )
Ii α writes GPRr, and Iσ 1 α T i α, then
¼
C 1 αTσ GPRri α
v ji α 1
struction is not a load instruction. For the valid bits this implies
v ji α 1
&*
&
for all stages j. Application of the induction hypothesis to the instruction
BASIC P IPELINING register gives IRi IRTπ . Since Ii reads GPRr, it follows for signal Aadr:
r Aadi AadπT
Since Ii α writes register GPRr, it follows by lemma 4.8 for any stage
j 2 that
GPRw ji α Cad ji α r r 0
For stage j 1 α, the pipelining schedule implies (table 4.14, page 138)
Iπ j T Iπ 1α T iα
Note that none of the stages 0 to i α is empty. By the induction hypothesis
it therefore follows that
hit 1 αTπ f ull 1 αTπ GPRw 1 αTπ
r 0 Cad 1 αTπ r
1 GPRw 1 α1 α
r 0 Cad 1 αi α r
1
Let Ii α be the last instruction before Ii which writes GPRr. Then no
hit l Tπ 0
In the simple second case, the stronger hypothesis of theorem 4.5 holds
for Ii . For any i 4, this means that none of the instructions Ii 1 Ii 2 Ii 3
For i 3, the DLXπ pipeline is getting filled. During these initial cycles
(T 3), either stage k 1 is empty or instruction Ij with Iπ k T j 2
does not update register GPR[r]. As above, one concludes that for any j
hit jTπ 0
and that
DoutπT DinTπ GPRr 1
Based on this signal, we define the two clocks, the clock CE1 of the stages
0 and 1, and the clock CE2 of the stages 2 to 4:
set and its encoding. When stalling a different pipeline, the corresponding
part of the hardware has to be modified. A much more uniform method is
the following:
The following equations define a stall engine which uses this mecha-
nism. It is clocked by CE2. A hardware realization is shown in figure
4.19. For k 2,
ue0 CE1
f ull 1 1
ue1 CE1 reset
uek CE2 reset f ull k
f ull k : ue k 1
where Dmr and Dmw are the read and write request signals provided by
the precomputed control.
'
CE1 ue.0
&'
1 full.1
H ARDWARE
CE1 I NTERLOCK
ue.1
reset CE2 full.2
CE2
ue.2
CE2 full.3
ue.3
CE2 full.4
ue.4
With the forwarding engine and the hardware interlock, it should be pos-
sible to prove a counterpart of theorem 4.7 with no hypothesis whatsoever
about the sequence of instructions.
Before stating the theorem, we formalize the new scheduling function
Iπ k T . The cycles T under consideration will be CE2 cycles. Intuitively,
the definition says that a new instruction is inserted in every CE1 cycle into
the pipe, and that subsequently it trickles down the pipe together with its
f ull k signals. We assume that cycle 0 is the last cycle in which the reset
signal is active.
The execution still starts in cycle 0 with Iπ 0 0 0. The instructions
are always fetched in program order, i.e.,
i if ue0T 0
Iπ 0 T i Iπ 0 T 1 (4.3)
i1 if ue0T 1
Any instructions makes a progress of at most one stage per cycle, i.e., if
Iπ k T i, then
Iπ k T 1 if uekT 0
i (4.4)
Iπ k 1 T 1 if uekT 1 and k 1 4
We assume that the reset signal is active long enough to permit an access
with address 0 to the instruction memory. With this assumption, activation
of the reset signal has the following effects:
CE2 1
ue0 CE1
ue1 ue2 ue3 ue4 0
read accesses to the data memory are disabled (DMr 0), and thus,
busy dhaz 0
'&
&'
When the first access to IM is completed, the instruction register holds
H ARDWARE
IR IM 0 I NTERLOCK
This is the situation in cycle T 0. From the next cycle on, the reset signal
is turned off, and a new instruction is then fed into stage 0 in every CE1
cycle. Moreover, we have
Iπ 0 T i Iπ 1 T i 1 (4.5)
This means that the instructions wander in lockstep through the stages 0
and 1. For T 1 and 1 k 3, it holds that
Once an instruction is clocked into stage 2, it passes one stage in each CE2
clock cycle. Thus, an instruction cannot be stalled after being clocked into
stage 2, i.e., it holds for k 2 3
Iπ k T i Iπ k 1 T 1 i (4.6)
1) Since the instructions are always fetched in-order (equation 4.3), in-
struction Ii enters stage 0 after instruction Ii 1 . Due to the lockstep behav-
ior of the first two stages (equation 4.5), there exists a cycle T with
Iπ 0 T i Iπ 1 T i 1
Let T T be the next cycle with an active CE1 clock. The stages 0 and 1
are both clocked at the end of cycle T ; by equation 4.4 it then follows that
both instructions move to the next stage:
Iπ 1 T 1 i Iπ 2 T 1 i 1
''
&
Instruction Ii 1 now proceeds at full speed (equation 4.6), i.e., it holds for
BASIC P IPELINING a 1 2 that
Iπ 2 a T 1 a i 1
Instruction Ii can pass at most one stage per cycle (equation 4.4), and up
to cycle T 1 a it therefore did not move beyond stage 1 a. Thus, Ii
cannot overtake Ii 1 . This proves the first statement.
CE1T 1
CE2T 1
1
Thus, the clock CE1 is disabled (CE1 0) during at most two consecutive
CE2 cycles, and all instructions therefore reach all stages of the pipeline.
Note that none of the above arguments hinges on the fact, that the pipelined
machine simulates the prepared sequential machine.
'(
&'
&'# -
H ARDWARE
We can now show the simulation theorem for arbitrary sequences of in- I NTERLOCK
structions:
2. for all registers and R out k which are visible or updated at the
end of cycle T :
RπT 1 Ri
We have argued above that IM 0 is clocked into register IR at the end
of CE2 cycle 0, and that the PC is initialized properly. Thus, the theorem
holds for T 0. For the induction steps, we distinguish four cases:
2. k 2 4. In the data paths, there exists only downward edges into
stage k, and the instructions pass the stages 2 to 4 at full speed. The
reasoning therefore remains unchanged.
Iπ 3 t i 1
for the last cycle t T such that I 3 t is defined, i.e., such that a
non-dummy instruction was in stage 3 during cycle t. Since dummy
instructions do not update the data memory cell M, it then follows
that
and that any stage between stage 1 and l is either empty or processes
an instruction with a destination address different from r. By the
construction of circuit Forw, it then follows that
topAkπT 1
With reasoning similar to the one of the previous case it then follows
that
AinTπ GPRrTπ GPRrlast i r GPRri 1
Table 4.17 lists the cost of the different DLX designs. Compared to the
sequential design of chapter 3, the basic pipeline increases the total gate
count by 8%, and result forwarding adds another 7%. The hardware inter-
lock engine, however, has virtually no impact on the cost. Thus, the DLXπ
design with hardware interlock just requires 16% more hardware than the
sequential design.
Note that pipelining only increases the cost of the data paths; the control
becomes even less expensive. This even holds for the pipelined design
with forwarding and interlocking, despite the more complex stall engine.
'*
&
BASIC P IPELINING Cycle time of the DLX core for the sequential and the pipelined de-
signs. The cycle time of CON is the maximum of the two listed times.
According to table 4.18, the result forwarding slows down the PC envi-
ronment and the register operand fetch dramatically, increasing the cycle
time of the DLX core by 40%. The other cycle times stay virtually the
same. The hardware interlocks make the stall engine more complicated
and increase the cycle time of the control, but the time critical paths re-
mains the same.
The significant slow down caused by result forwarding is not surprising.
In the design with a basic pipeline, the computation of the ALU and the
update of the PC are time critical. With forwarding, the result of the ALU
is forwarded to stage ID and is clocked into the operand registers A1 and
B1. That accounts for the slow operand fetch. The forwarded result is
also tested for zero, and the signal AEQZ is then fed into the glue logic
PCglue of the PC environment. PCglue provides the signal b jtaken which
governs the selection of the new program counter. Thus, the time critical
path is slowed down by the forwarding engine (6d), by the zero tester (9d),
by circuit PCglue (6d), and by the selection of the PC (6d).
With the fast zero tester of exercise 4.6, the cycle time can be reduced
by 4 gate delays at no additional cost. The cycle time (89d) is still 35%
higher than the one of the basic pipeline. However, without forwarding
and interlocking, all the data hazards must be resolved at compile time by
rearranging the code or by insertion of NOP instructions. The following
sections therefore analyze the impact of pipelining and forwarding on the
instruction throughput and on the performance-cost ratio.
CC IC CPI (4.7)
The CPI ratio depends on the workload and on the hardware design. The
execution scheme of the instruction set Is defines how many cycles CPII
an instruction I requires on average. On the other hand, the workload to-
gether with the compiler defines an instruction count ICI for each machine
instruction, and so the CPI value can be expressed by
ICI
CPI ∑ CPII ∑ νI CPII (4.8)
IIs IC I Is
hazard h h
(
&
In analogy to formula (4.8), the following term is treated as the CPI ratio
BASIC P IPELINING of the pipelined design:
CPI 1 ∑ νh CPHh (4.9)
hazard h
Instruction mix % of the SPECint92 programs normalized to 100%.
instructions compress eqntott espresso gcc li AV
load 19.9 30.7 21.1 23.0 31.6 25.3
store 5.6 0.6 5.1 14.4 16.9 8.5
compute 55.4 42.8 57.2 47.1 28.3 46.2
call (jal, jalr) 0.1 0.5 0.4 1.1 3.1 1.0
jump 1.6 1.4 1.0 2.8 5.3 2.4
branch, taken 12.7 17.0 9.1 7.0 7.0 10.6
, untaken 4.7 7.0 6.1 4.6 7.8 6.0
2- +
For the sequential DLX design, table 4.21 specifies the number of CPU
cycles and the number of memory accesses required by any machine in-
struction I. This table is derived from the finite state diagram of figure
3.20 (page 90). Let a memory access require W S wait states, on average.
The CPII value of an instruction I then equals the number of its CPU cy-
cles plus W S 1 times the number of memory accesses. When combined
with the instruction frequencies from table 4.20, that yields the following
CPI ratio for the sequential DLX design:
" + .
Even with result forwarding, the pipelined DLX design can be slowed
down by three types of hazards, namely by empty branch delay slots, by
hardware interlocks due to loads, and by slow memory accesses.
Branch Delay Slots The compiler tries to fill the delay slot of a branch
with useful instructions, but about 59% of the delay slots cannot be filled
(table 4.19). In comparison to perfect pipelining, such an empty delay slot
stalls the pipeline for CPHNopB 1 cycles. This hazard has the following
frequency:
Since these control hazards are resolved in software, every empty delay
slot also causes an additional instruction fetch.
Since the branch hazards are resolved in software by inserting a NOP, they
cause νNopB additional instruction fetches. Load hazards are resolved by
a hardware interlock and cause no additional fetches. Thus, the frequency
of instruction fetches equals
Summing up the stall cycles of all the hazards yields the following CPI
ratio for the pipelined DLX design with forwarding:
ν f orw 039
('
&
BASIC P IPELINING Hardware cost, cycle time and CPI ratio of the DLX designs (sequen-
tial, basic pipeline, pipeline with interlock)
The simulation assumed that the additional hazards are resolved by in-
serting a NOP. Thus, every branch, load or forwarding hazard causes an
additional instruction fetch. The frequency of fetches then runs at
Thus, the CPI ratio of the pipelined DLX design without forwarding is:
" - %
According to table 4.22, pipelining and result forwarding improve the CPI
ratio, but forwarding also increases the cycle time significantly. The CPI
ratio of the three designs grows with the number of memory wait states.
Thus, the speedup caused by pipelining and forwarding also depends on
the speed of the memory system (figure 4.20).
Result forwarding and interlocking have only a minor impact (3%) on
the performance of the pipelined DLX design, due to the slower cycle time.
However, both concept disburden the compiler significantly because the
hardware takes care of the data hazards itself.
The speedup due to pipelining increases dramatically with the speed
of the memory system. In combination with an ideal memory system
((
&(
3
DLXs/DLXpb
DLXs/DLXp
C OST
DLXpb/DLXp P ERFORMANCE
2.5
A NALYSIS
speedup
2
1.5
Q P1
q
Cq
(4.10)
1.8
1.6
1.4
1.2
0.8
0 0.2 0.4 0.6 0.8 1
quality paramter: q
Quality ratio of the pipelined designs relative to the sequential design
(DLXs: sequential, DLXpb: basic pipeline, DLXp: pipeline with interlock)
that a design A which is twice as fast as design B has the same quality
as B if it is four times as expensive.
For a realistic quality metric, the quality parameter should be in the range
02 05: Usually, more emphasis is put on the performance than on the
cost, thus q 05. For q 02, doubling the performance already allows
for a cost ratio of 16; a higher cost ratio would rarely be accepted.
$ %&
In case of a data hazard, the interlock engine of section 4.5
stalls the stages IF and ID. The forwarding circuit Forw signals a hit of
stage j 2 3 4 by
dummy instructions could also activate the hazard flag, and that the inter-
lock engine could run into a deadlock.
Prove for the interlock engine of section 4.5 and the corre-
sponding scheduling function the claim 2 of lemma 4.10: for any stage k
and any cycle T 0, the value Iπ k T is defined iff f ull kT 1.
Fast Zero Tester. The n-zero tester, introduced in section
2.3, uses an OR-tree as its core. In the technology of table 2.1, NAND / NOR
gates are faster than OR gates. Based on the equality
/
Chapter
5
Interrupt Handling
They can be maskable, i.e., they can be ignored under software con-
trol, or non maskable.
– repeat instruction I,
– continue with the instruction I which would follow I in the
uninterrupted execution of the program,
– abort the program.
The first step turns out to be not so easy. Recall that interrupts are a
kind of procedure calls, and that procedure call is a high level language
concept. On the other hand, our highest abstraction level so far is the as-
sembler/machine language level. This is the right level for stating what the
hardware is supposed to do. In particular, it permits to define the meaning
of instructions like ", which support procedure call. However, the mean-
ing of the call and return of an entire procedure cannot be defined like the
meaning of an assembler instruction.
There are various way to define the semantics of procedure call and re-
turn in high level languages [LMW86, Win93]. The most elementary way
– called operational semantics – defines the meaning of a procedure by
1 Priority 1 is urgent, priority 31 is not.
/
'
Classifications of the interrupts ATTEMPTING A
R IGOROUS
index j symbol external maskable resume
T REATMENT OF
0 reset yes no abort I NTERRUPTS
1 ill no no abort
2 mal no no abort
3 pff no no repeat
4 pfls no no repeat
5 trap no no continue
6 ovf no yes continue/abort
6i exi yes yes continue
prescribing how a certain abstract machine should interpret calls and re-
turns. One uses a stack of procedure frames. A call pushes a new frame
with parameters and return address on the stack and then jumps to the body
of the procedure. A return pops a frame from the stack and jumps to the
return address.
The obvious choice of the ‘abstract machine’ is the abstract DLX ma-
chine with delayed branch/delayed PC semantics defined by the DLXσ in-
struction set and its semantics. The machine has, however, to be enriched.
There must be a place where interrupt masks are stored, and there must be
a mechanism capable of changing the PC as a reaction to event signals. We
will also add mechanisms for collecting return addresses and parameters,
that are visible at the assembler language level.
We will use a single interrupt service routine ISR which will branch
under software control to the various exception handlers H j. We denote
by SISR the start address of the interrupt service routine.
We are finally able to map out the rest of the chapter. In section 5.2, we
will define at the abstraction level of the assembler language
Note that this is a nontrivial equation. It states that for instruction Ii , causes
are masked with the masks valid after instruction Ii 1 . Thus, if Ii happens
/'
'
to be a + instruction with destination SR, the new masks have no
I NTERRUPT affect on the MCA computation of Ii .
H ANDLING
<- .8
From the masked cause MCA, the signal JISR (jump to interrupt service
routine) is derived by
31
JISRi MCA j
j 0
Activation of signal JISR triggers the jump to the interrupt service routine.
Formally we can treat this jump either as a new instruction Ii1 or as a part
of instruction Ii . We chose the second alternative because this reflects more
closely how the hardware will work. However, for interrupted instructions
Ii and registers or signals X, we have now to distinguish between
Xi , which denotes the value of X after the (interrupted) execution of
instruction Ii , i.e., after JISR, and
Interrupt il has the highest priority among all those interrupts which were
not masked during Ii and whose event signals ev j were caught. Interrupt
il can be of type continue, repeat or abort. If it is of type repeat, no register
file and no memory location X should be updated, except for the special
purpose registers. For any register or memory location X, we therefore
define
Xi 1 if ili is of type repeat
Xi
Xiu otherwise
By SISR, we denote the start address of the interrupt service routine. The
jump to ISR is then realized by
The return addresses for the interrupt service routine are saved as
DPC PC i 1 if ili is of type repeat
DPC PC ui
EDPC EPC i
if ili is of type continue
if ili is of type abort,
/(
'#
i.e., on an interrupt of type abort, the return addresses do not matter. The
exception data register stores a parameter for the exception handler. For I NTERRUPT
traps this is the immediate constant of the trap instruction. For page fault S ERVICE ROUTINES
and misalignment during load/store this is the memory address of the faulty F OR N ESTED
access: I NTERRUPTS
For page faults during fetch, the address of the faulty instruction memory
access is DPCi 1, which is saved already. Thus, there is no need to save it
twice.
The exception cause register ECA stores the masked interrupt cause
ECAi MCAi
SR 0
ISP GPR30
//
'
I NTERRUPT Extensions to the DLX instruction set. Except for rfe and trap, all
H ANDLING instructions also increment the PC by four. SA is a shorthand for the special
purpose register SPRSA; sxt(imm) is the sign-extended version of the immediate.
the exception handling registers. For each frame F of the interrupt stack
and for any register R, we denote by FR the portion of F reserved for reg-
ister R. We denote by FEHR the portion of the frame reserved for copies
of the exception handling registers. We denote by ISEHR the portions
of all frames of the stack, reserved for copies of the exception handling
registers.
The interrupt service routine, which is started after an JISR, has three
phases:
1. S AVE (save status):
il min j ECA j 1
SR 031 il il
1
SR GPR0
/*
'
' )
I NTERRUPT
H ANDLING E INTEND interrupts to behave like procedure calls. The mechanism
of the previous section defines the corresponding call and return
mechanism. Handlers unfortunately are not generated by compilers and
thus, the programmer has many possibilities for hacks which make the
mechanism not at all behave like procedure calls. The obvious point of
attack are the fields ISEHR. Manipulation of IST OPEDPC obviously
allows to jump anywhere.
If the interrupt stack is not on a permanent memory page, each interrupt,
including page fault interrupts, can lead to a page fault interrupt, and so
on. One can list many more such pitfalls. The interesting question then
obviously is: have we overlooked one?
In this section we therefore define an interrupt service routine to be ad-
missible if it satisfies a certain set of conditions (i.e., if it does not make
use of certain hacks). We then prove that with admissible interrupt service
routines the mechanism behaves like a procedure call and return.
The code segments S AVE and R ESTORE can be interpreted as left and right
brackets, respectively. Before we can establish that admissible interrupt
service routines behave in some sense like procedures we have to review
some facts concerning bracket structures.
)
'
For sequences S S1 St of brackets ‘(’ and ‘)’ we define
I NTERRUPT
H ANDLING l S the number of left brackets in S
r S the number of right brackets in S
l S r S and
(5.1)
l Q r Q for all prefixes Q of S
i.e., the number of left brackets equals the number of right brackets, and in
prefixes of S there are never more right brackets than left brackets.
Obviously, if S and T are bracket structures, then S and ST are bracket
structures as well. In bracket structures S one can pair brackets with the
following algorithm:
2. L k exists, and
il j during S AVE1
il j during H
il 2 during S AVE2 , H and R ESTORE
Executions of admissible interrupt service routines are properly nested.
We will first establish properties of perfectly nested executions of in-
terrupt service routines in lemma 5.3. In lemma 5.4 we will prove by
)#
'
induction the existence of the bracket structure. In the induction step, we
I NTERRUPT will apply lemma 5.3 to portions of the bracket structure, whose existence
H ANDLING is already guaranteed by the induction hypothesis. In particular, we will
need some effort to argue that R ESTORE s are never interrupted.
The theorem then follows directly from the lemmas 5.3 and 5.4.
SRa 2
if j is a repeat interrupt
SRd
SRua 1
if j is a continue interrupt
Proof by induction on the number n of interrupts which interrupt the exe-
cution of an ISR j.
n 0. The execution of ISR j is uninterrupted. Since interrupt j is
not aborting, S AVE allocates a new frame on the stack IS, and R ESTORE
removes one frame. The handler H j itself does not update the stack
pointer (constraint 1), and thus
ISPa 1 ISPd
According to constraint 1, the EHR fields on the interrupt stack IS are only
written by S AVE . However S AVE just modifies the top frame of IS which
is removed by R ESTORE . Thus
ISEHRa 1 ISEHRd
)&
'&
and claim 1 follows. With respect to claim 2, we only show the preciseness
of the masks SR; the preciseness of the PCs can be shown in the same way. A DMISSIBLE
I NTERRUPT
SRd ESRd 1 by definition of , S ERVICE ROUTINES
IST OPESRc 1 by definition of R ESTORE
where IST OP denotes the top frame of the stack IS. Since the handler
itself does not update the stack pointer ISP nor the EHR fields on the stack
IS (constraint 1), it follows
SRa 2
if j is a repeat interrupt
SRd ESRa 1
SRua 1
if j is a continue interrupt
Ia Ib Ia1 Id1 Ia2 Id2 Iam Idm Ic Id
S AVE H j R ESTORE
Each of the ISR jr is interrupted at most n times, and due to the induction
hypothesis, they return the pointer ISP and the EHR fields on the stack
unchanged:
Since the instructions of the handler H j do not update these data, it fol-
lows for the pointer ISP that
The same holds for the EHR fields of the interrupt stack:
Since R ESTORE removes the frame added by S AVE , and since S AVE only
)'
'
updates the EHR fields of the top frame, the claim 1 follows for n 1. The
I NTERRUPT preciseness of the ISR j can be concluded like in the case n 0, except
H ANDLING for the equality
ISTOPEHRb ISTOPEHRc 1
Let the interrupt mechanism obey the software constraints. Then, non
aborting executions of the interrupt service routine are properly nested.
SR : GPR0 0
According to lemma 5.4, the codes S AVE and R ESTORE can only be in-
terrupted by reset. Thus, we focus on the interrupt handlers. For any non-
maskable interrupt j 6, claim two follows directly by constraint 2. For
the maskable interrupts j 6, we prove the claims by induction on the
number n of interrupts which interrupt the handler H j.
)/
'
n 0: The ISR is always started with SR 0, due to signal JISR.
I NTERRUPT The ISR only updates the masks by a special move + or by an
H ANDLING , instruction. Since , is only used as the last instruction of an ISR
(constraint 2), it has no impact on the masks used by the ISR itself.
In case of a special move SR : R, the bit Ri must be zero for any
i j. Thus, the maskable interrupts are masked properly. Due to the
definition of the masked interrupt cause of instruction Il
Let LHR be a non aborting execution of ISR j, where L is a save sequence
and R is a restore sequence. By theorem 5.2, the sequence of S AVE s and
R ESTORE s in LHR is an initial segment of a bracket structure. If the brack-
ets L and R are paired, then the S AVE and R ESTORE sequences in H form
a bracket structure. Hence, the brackets in LHR form a bracket structure
and LHR is perfectly nested.
))
'&
Assume R is paired with a left bracket L right of L:
A DMISSIBLE
L L R I NTERRUPT
S ERVICE ROUTINES
ISR j
Completeness Let the interrupt mechanism obey the software constraints.
Every internal interrupt j which occurs in instruction Ii and which is not
masked receives service in instruction Ii1 , or instruction Ii is repeated
after the ISR which starts with instruction Ii1 .
Let instruction Ii trigger the internal interrupt j, i.e., ev ji 1. The cause
bit CA ji is then activated as well. Under the assumption of the lemma, j
is either non-maskable or it is unmasked (SR ji 1 1). In either case, the
The enhanced ISA requires changes in the data paths and in the control
(section 5.5.6). The data paths get an additional environment CAenv which
collects the interrupt event signals and determines the interrupt cause (sec-
tion 5.5.5). Except for the PC environment, the register file environment
RFenv and circuit Daddr, the remaining data paths undergo only minor
changes (section 5.5.4). Figure 5.1 depicts the top level schematics of the
enhanced DLX data paths. Their cost can be expressed as
Note that without interrupt hardware, reset basically performs two tasks,
it brings the hardware in a well defined state (hardware initialization) and
3 Registers A and B play this role for register file GPR
*
''
IMenv
I NTERRUPT
EPCs IR.1
H ARDWARE
Ain, Bin IRenv Daddr
Sin PCenv
S A, B link, PCs co
CAenv
D EXenv sh
buffers:
MAR MDRw IR.j
Cad.j
DMenv PCs.j
Sad.j
C.4 MDRr
1111
0000 SH4Lenv
C’
Sout
SR RFenv
Data paths of the prepared sequential designs with interrupt support
restarts the instruction execution. In the DLXΣ design with interrupt hard-
ware, the reset signal itself initializes the control and triggers an interrupt.
The interrupt mechanism then takes care of the restart, i.e., with respect to
restart, signal JISR takes the place of signal reset.
0110 00111100
Inc / +4
H ANDLING EPC
Add(32) EDPC
Ain
rfe 0 1 jumpR 1 0 0 1 rfe
bjtaken 0 1 SISR
SISR+4
0 1 JISR
0 1 JISR
on JISR, instead:
SISR SISR 4 if JISRi 1
DPCi PCi
DPCiu PC ui otherwise.
Except for an , instruction, the values PC ui and DPCiu are computed as
before:
if Ii ,
u
EPCi 1
PCi 1 immi
PCi 1 4 otherwise
EDPCi 1 if Ii ,
DPCiu
PCi 1
otherwise
Thus, the new PC computation just requires two additional muxes con-
trolled by signal r f e. The two registers link and DDPC are only updated
on an active clock signal PCce, whereas PC and DPC are also updated on
a jump to the ISR:
These modifications have no impact on register link nor on the glue logic
PCglue which generates signal b jtaken. The cost of the environment now
are
The two exception PCs are provided by environment RFenv. Let csID
denote the control signals which govern stage ID, including signal JISR;
*
''
IR[20:11] IR[10:6] 00001 IR[10:6] 00000
Jlink Saddr
I NTERRUPT
Caddr 0 1 rfe.1 0 1 H ARDWARE
Rtype
Cad Sas Sad
Circuit Daddr
and let ACON csID denote their accumulated delay. Environment PCenv
then requires a cycle time of
'' - +
Circuit Daddr consists of the two subcircuits Caddr and Saddr. As before,
circuit Caddr generates the destination address Cad of the general purpose
register file GPR. Circuit Saddr (figure 5.3) provides the source address
Sas and the destination address Sad of the special purpose register file
SPR.
The two addresses of the register file SPR are usually specified by the
bits SA IR10 : 6. However, on an , instruction, the exception status
ESR is copied into the status register SR. According to table 5.3, the reg-
isters ESR and SR have address 1 and 0, respectively. Thus, circuit Saddr
selects the source address and the destination address of the register file
SPR as
SA SA if r f e 0
Sas Sad
00001 00000 if r f e 1
Circuit Daddr provides the three addresses Cad, Sas and Sad at the
following cost and delay:
8 ,
An K n special register file SF comprises K registers, each of which
is n bits wide. The file SF can be accessed like a regular two-port register
file:
the flag w specifies, whether a write operation should be performed
the addresses adr and adw specify the read and write address of the
register file, and
Din and Dout specify the data input and output of the register file.
In addition, the special register file SF provides a distinct write and read
port for each of its registers. For any register SF r,
Dor specifies the output of its distinct read port, and
Dir specifies the data to be written into register SF r on an active
write flag wr.
In case of an address conflict, such a special write takes precedence over
the regular write access specified by address adw. Thus, the data d r to be
written into SF r equals
Dir if wr 1
d r
Din otherwise.
The register is updated in case of wr 1 and in case of a regular write to
address r:
cer wr w adw r (5.3)
*&
''
Di[K-1] Di[0]
00111100
Din adr adw w w[ ]
I NTERRUPT
w[K-1] 1 0 w[0] 1 0 AdDec H ARDWARE
K
00111100
... ce
ce[K-1] SF[K-1] ce[0] SF[0] sl
01 DataSel
k-dec k-dec
K K
sl[K-1 : 0] ce[K-1 : 0]
We do not specify the output Dout of the special purpose register file if a
register is updated and read simultaneously.
8:
Figure 5.4 depicts an example realization of a special register file SF of
size (K n). The multiplexer in front of register SF r selects the proper
input depending on the special write flag wr.
The address decoder circuit AdDec in figure 5.5 contains two k-bit de-
coders (k log K ). The read address adr is decoded into the select bits
sl K 1 : 0. Based on this decoded address, the select circuit DataSel se-
lects the proper value of the standard data output Dout. For that purpose,
the data Dor are masked by the select bit sl r. The masked data are then
combined by n-OR-trees in a bit sliced manner:
K 1
Dout j Dor j sl r
r 0
The write address adw is decoded into K select bits. The clock signals of
the K registers are generated from these signals according to equation 5.3.
Thus, the cost of the whole register file SF runs at
The distinct read ports have a zero delay, whereas the standard output Dout
is delayed by the address decoder and the select circuit:
DSF Dor 0;
DSF Dout Ddec log K Dand Dor Dtree K
On a write access, the special register file has an access time of DSFw , and
the write signals w and w delay the clock signals by DSF w; ce:
1 "8
The core of the special purpose register environment SPRenv (figure 5.6) is
a special register file of size 6 32. The names of these registers SPR[5:0]
are listed in table 5.3. The environment is controlled by the write signals
SPRw and SPRw5 : 0, and by the signals JISR, repeat, and sel.
The standard write and read ports are only used on the special move
instructions + and + and on an , instruction. The standard
data output of the register file equals
Souti SPRSasi 1
SPRwr JISR
where
C4i if SPRwi Sadi 0
SRui
SRi 1 otherwise.
C4i if seli 1
Di1i
SRi 1 otherwise,
with
sel repeat SPRw Sad 0
According to the specification of JISR, if instruction Ii is interrupted, the
two exception PCs have to be set to
whereas on an abort interrupt, the values of the exception PCs do not mat-
ter. Environment PCenv generates the values PCiu , DPCiu , and
DDPCiu DPCi 1
*/
'
which are then passed down the pipeline together with instruction Ii . Ex-
I NTERRUPT cept on an , instruction,
H ANDLING
DPCiu PCi 1
but due to the software constraints, , can only be interrupted by reset
which aborts the execution. Thus, the inputs of the two exception PCs can
be selected as
PC4 DPC4 if repeat 0
Di3 Di4
DPC4 DDPC4 if repeat 1
Environment SPRenv consists of a special register file, of circuit SPRsel
which selects the inputs of the distinct read ports, and of the glue logic
which generates signal sel. Thus, the cost run at
All the data inputs are directly provided by registers at zero delay. Let its
control inputs have a delay of ACON csSPR. The output Sin and the inputs
Di then have an accumulated delay of
The decode stage ID gets the new output register S. The two opcodes
IR[31:26] and IR[5:0] and the destination address Cad of the general pur-
pose register file are provided by stage ID, but they are also used by later
stages. As before, these data are therefore passed down the pipeline, and
they are buffered in each stage. Due to the interrupt handling, stage WB
now also requires the three PCs and the address Sad of the register file
SPR. Like the opcodes and the address Cad, these data wander down the
pipeline together with the instruction. That requires additional buffering
(figure 5.7); its cost runs at
Buffering
1- 1
The execute environment EXenv of figure 5.8 still comprises the ALU en-
vironment and the shifter SHenv and connects them to the operand and
result busses. The three operand busses are controlled as before, and the
outputs sh and ov f also remain the same.
The only modification is that the result D is now selected among six
values. Besides the register value link and the results of the ALU and the
shifter, environment EXenv can also put the constant co or the operands S
or A on the result bus:
link if linkDdoe 1
if ALUDdoe 1
alu
sh if SHDdoe 1
D
co if coDdoe 1
A
S
if ADdoe 1
if SDdoe 1
The result D co is used in order to pass the trap constant down the
pipeline, whereas the result D A is used on the special move instruction
+. D S is used on , and +.
The selection of D now requires two additional tristate drivers, but that
has no impact on the delay of the environment. The cost of EXenv are
A 0 1 bmuxsel 0 1 a’muxsel
01 0011
H ANDLING a’
a b
ALUenv SHenv
ovf alu s[1:0] sh
D sh
CIMC Cor
AIMenv f lags maxDor dIstat
Memory Control DMC In addition to the bank write signals, the mem-
ory controller DMC now provides signal dmal which indicates a mis-
aligned access.
The bank write signals Dmbw[3:0] are generated as before (page 81). In
addition, this circuit DMbw provides the signals B (byte), H (half word),
and W (word) which indicate the width of the memory access, and the
signals B[3:0] satisfying
B j 1 s1 : 0 j
A byte access is always properly aligned. A word access is only aligned,
if it starts in byte 0, i.e., if B0 1. A half word access is misaligned, if it
starts in byte 1 or 3. Flag dmal signals that an access to the data memory
is requested, and that this access is misaligned (malAc 1):
dmal Dmr3 Dmw3 malAc
malAc W B0 H B1 B3
The cost CDMC of the memory controller is increased by some gates, but
the delay DDMC of the controller remains unchanged:
CDMC CDMbw Cinv 3 Cand 3 Cor
Cdec 2 3 Cinv 15 Cand 8 Cor
Let ACON csM denote the accumulated delay of the signals Dmr and
Dmw, the cycle time of the data memory environment and the delay of
its flags can then be expressed as
TM ACON csM DDMC dDmem ∆
ADMenv f lags ACON csM DDMC dDstat
'
CAcol ipf, imal
I NTERRUPT [3, 2]
ue.0 CA.1
H ANDLING
[6] ovf
ue.1 CA.2
[5, 1] ovf?
trap, ill
ue.2 CA.3
reset dmal
dpf
ev[31:7] [0] [2] [4] cause processing
CA.3’ CApro
CA4ce MCA, jisr.4, repeat
The cause environment CAenv (figure 5.9) performs two major tasks:
Its circuit CAcol collects the interrupt events and clocks them into
the cause register.
It processes the caught interrupt events and initiates the jump to the
ISR. This cause processing circuit CApro generates the flags jisr
and repeat, and provides the masked interrupt cause MCA.
Since the interrupt event signals are provided by several pipeline stages,
the cause register CA cannot be assigned to a single stage. Register CA is
therefore pipelined: CAi collects the events which an instruction triggers
up to stage i. That takes care of internal events. External events could be
caught at any stage, but for a shorter response time, they are assigned to
the memory stage.
The control signals of the stage EX are precomputed. The cycle time of
the cause collection CAcol, the accumulated delay of its output CA3 , and
its cost can be expressed as:
TCAcol maxAIMenv f lags AALUenv ov f Dand ∆
ACAcol CA3 maxADMenv f lags ADMC Dor
CCAcol Cand Cor 9 C f f
- "
(figure 5.10) The masked cause mca is obtained by masking the maskable
interrupt events CA3 with the corresponding bits of the status register SR.
The flag jisr is raised if mca is different from zero, i.e., if at least one bit
mcai equals one.
CA3 i SRi if i 6
mcai
CA3 i otherwise
31
jisr mcai
i 0
The cost and cycle time of the whole cause environment CAenv run at
As in the previous designs, the control unit basically comprises two cir-
cuits:
The control automaton generates the control signals of the data paths
based on an FSD. These signals include the clock and write request
signals of the registers and RAMs.
The control automaton must be adapted to the extended instruction set, but
the new instructions have no impact on the stall engine. Nevertheless, the
DLXΣ design requires a new stall engine, due to the ISR call mechanism.
&
''
CE ue.1
CE full.2
CE ue.2
CE full.3
reset
/reset ue.3
CE
CE full.4
CE ue.4
Stall engine of the sequential DLX design with interrupt hardware
The update enable bit uei enables the update of the of the output registers
of stage i. During reset, all the update enable flags are inactive
A jump to the interrupt service routine is only initiated, if the flag jisr4
is raised and if the write back stage is full:
The interrupt mechanism requires that the standard write to a register file
or to the memory is canceled on a repeat interrupt. Since the register files
GPR and SPR belong to stage WB, their protection is easy. Thus, the write
signals of the two register files are set to
For the data memory, the protection is more complicated because the
memory DM is accessed prior to the cause processing. There are only two
kinds of repeat interrupts, namely the two page faults p f f and p f ls; both
interrupts are non-maskable. Since the interrupt event p f ls is provided by
the memory DM, the memory system DM itself must cancel the update if
it detects a page fault. The other type of page fault (ev2 p f f ) is already
detected during fetch. We therefore redefine the write signal Dmw as
Let T be the last machine cycle in which the reset signal is active. In (
the next machine cycle, the DLXΣ design then signals a reset interrupt and
performs a jump to the ISR:
the DLXΣ design is clocked whenever the reset signal is active, and espe-
cially in cycle T . Due to reset, the flags f ull 4 : 0 get initialized
and the clock enable signal for the output registers of CApro is
Hence, the output registers of the cause processing circuit are updated at
the end of cycle T with the values
Consequently,
-
The control automaton is constructed as for the DLXσ design without in-
terrupt handling (section 4.2.3). The automaton is modeled by a sequential
FSD which is then transformed into precomputed control:
The control signals of stage IF and the Moore signals of ID are al-
ways active, whereas the Mealy signals of stage ID are computed in
every cycle.
/
'
The control signals of the remaining stages are precomputed during
I NTERRUPT ID. This is possible because all their states have an outdegree of one.
H ANDLING There are three types of signals: signals x are only used in stage EX,
signals y are used in stage EX and M, and signals z are used in all
three stages.
However, there are three modifications. The automaton must account for
the 8 new instructions (table 5.4). It must check for an illegal opcode,
i.e., whether the instruction word codes a DLX instruction or not. Unlike
the DLXσ design, all the data paths registers invisible to the assembler
programmer (i.e., all the registers except for PC’, DPC, and the two register
files) are now updated by every instruction. For all these registers, the
automaton just provides the trivial clock request signal 1.
The invisible registers of the execute stage comprise the data registers
MAR and MDRw and the buffers IR.3, Cad.3, Sad.3, PC.3, DPC.3, and
DDPC.3. By default, these registers are updated as
Besides the buffers, the invisible registers of the memory stage comprise
the data registers C.4 and MDRr. Their default update is the following:
The automaton is modeled by the FSD of figure 5.12. The tables 5.6
and 5.7 list the RTL instruction; the update of the invisible registers is
only listed if it differs from the default. Note that in the stages M and
WB, an , is processed like a special move +. Table 5.8 lists the
nontrivial disjunctive normal forms, and table 5.10 lists the parameters of
the automaton.
In stage ID, only the selection of the program counters and of the con-
stant got extended. This computation requires two additional Mealy sig-
nals r f e1 and Jimm. In stage EX, the automaton now also has to check
for illegal instructions; in case of an undefined opcode, the automaton gets
into state Ill. Since this state has the largest indegree, Ill serves as the new
initial state. State noEX is used for all legal instructions which already
finish their actual execution in stage ID, i.e., the branches and %
and the two jumps " and ".
)
''
IF fetch
I NTERRUPT
ID
decode H ARDWARE
EX
*
'
I NTERRUPT
H ANDLING
RTL instructions of the stages EX, M, and WB. The update of the
invisible registers is only listed if it differs from the default.
''
Nontrivial disjunctive normal forms of the DLX Σ control automaton I NTERRUPT
H ARDWARE
stage DNF state/signal IR31 : 26 IR5 : 0 length
EX D1 alu 000000 1001** 10
000000 100**1 10
D2 aluo 000000 1000*0 11
D3 aluI 0011** ****** 4
001**1 ****** 4
D4 aluIo 0010*0 ****** 5
D5 shift 000000 0001*0 11
000000 00011* 11
D6 shiftI 000000 0000*0 11
000000 00001* 11
D7 test 000000 101*** 9
D8 testI 011*** ****** 3
D9 savePC 010111 ****** 6
000011 ****** 6
D10 addrS 10100* ****** 5
1010*1 ****** 5
D11 addrL 100*0* ****** 4
1000*1 ****** 5
10000* ****** 5
D12 mi2s 000000 010001 12
D13 ms2i 000000 010000 12
D14 trap 111110 ****** 6
D15 rfe 111111 ****** 6
D16 noEX 00010* ****** 5
000010 ****** 6
010110 ****** 6
ID D17 Rtype 000000 ****** 6
D6 shiftI 000000 0000*0 (10)
000000 00001* (10)
D9 Jlink 010111 ****** (6)
000011 ****** (6)
D18 jumpR 01011* ****** 5
D19 jump 00001* ****** 5
01011* ****** (5)
D20 branch 00010* ****** (5)
D21 bzero *****0 ****** 1
D15 rfe.1 111111 ****** (6)
D22 Jimm 00001* ****** (5)
111110 ****** (6)
Accumulated length of all nontrivial monomials 206
'
I NTERRUPT Control signals to be precomputed during stage ID
H ANDLING
EX M WB type x signals (stage EX only)
y shift4s, Dmw trap, ADdoe ovf?
amuxsel coDdoe SDdoe add?
z Dmr shift4l linkDdoe Rtype ill
SPRw ALUDdoe bmuxsel test
GPRw SHDdoe
''
Parameters of the two control automata; one precomputes the Moore I NTERRUPT
signals (ex) and the other generate the Mealy signals (id). H ARDWARE
# states # inputs # and frequency of outputs
k σ γ νsum νmax
ex 17 12 16 48 11
id 1 12 9 13 2
The stage EX, M and WB are only controlled by Moore signals, which
are precomputed during decode. All their states have an outdegree of one.
It therefore suffices to consider the states of stage EX in order to generate
all these control signals. For any of these signals, the table 5.9 list its type
(i.e., x, y, or z) and the EX states in which it becomes active.
+
Along the lines of section 3.4 it can be show that the DLXΣ design interprets
the extended DLX instruction set of section 5.2 with delayed PC semantics.
In the sequential DLX design without interrupt handling, any instruction
which has passed a stage k only updates output registers of stages k k
(lemma 4.3). In the DLXΣ design, this dateline criterion only applies for
the uninterrupted execution. If an instruction Ii gets interrupted, the two
program counters PC’ and DPC get also updated when Ii is in the write
back stage. Furthermore, in case of a repeat interrupt, the update of the data
memory is suppressed. Thus, for the DLXΣ design, we can just formulate
a weak version of the dateline criterion:
Let IΣ k T i. For any memory cell or register R out t different from )
PC’ and DPC, we have
¼ Ri 1 if t k
RT
Ri if t k
¼ Ri 1 if k 0 1
RT
Rui if k 2
#
' ¼
If the execution of instruction Ii is not interrupted, i.e., if JISRT 0 with
I NTERRUPT IΣ 4 T i, then Ri Rui for any register R.
H ANDLING
If IΣ 4 T i, then IΣ 0 T 1 i 1 and lemma 5.9 implies for all
R
RT 1
¼
Ri
S IN the basic DLX design (chapter 4), the same three modifications
are sufficient in order to transform the prepared sequential design
DLXΣ into the pipelined design DLXΠ . Except for
a modified PC environment,
extensive hardware for result forwarding and hazard detection, and
a different stall engine,
the DLXΣ hardware can be used without changes. Figure 5.13 depicts the
top-level schematics of the DLXΠ data paths. The modified environments
are now described in detail.
Figure 5.14 depicts the PC environment of the DLXΠ design. The only
modification over the DLXΣ design is the address provided to the instruc-
tion memory IM. As for the transformation of chapter 4, memory IM is
now addressed by the input d pc of register DPC and not by its output.
SISR if JISR 1
EDPC if JISR 0 r f e1 1
d pc
PC otherwise
However, the delayed program counter must be buffered for later use, and
thus, register DPC cannot be discarded.
The cost of environment PCenv and most of its delays remain the same.
The two exception PCs are now provided by the forwarding circuit FORW .
Thus,
APCenv d pc maxAJISR AFORW EDPC 2 Dmux 32
TPCenv maxDinc 30 AIRenv co Dadd 32 AGPRenv Ain
AFORW EPCs A b jtaken ACON csID
3 Dmux 32 ∆
&
'(
P IPELINED
IMenv I NTERRUPT
H ARDWARE
EPCs IR.1
Ain, Bin IRenv Daddr
Sin PCenv
S A, B link, PCs co
CAenv
D EXenv sh
buffers:
Forwarding Engine FORW MAR MDRw IR.j
SR Cad.j
DMenv PCs.j
Sad.j
C.4 MDRr
SH4Lenv
C’
RFenv
C’, Aout, Bout, Sout
Data paths of the pipelined design DLX Π with interrupt support
0110 11
00
Inc / +4
Add(32) rfe.1
00
11
EPC JISR
Ain
0
rfe.1 0 1 jumpR 1 0
1 0
1 dpc
bjtaken 0 1
SISR+4
EDPC SISR
0 1 JISR
'
'
The modified PC environment also impacts the functionality and delay
I NTERRUPT of the instruction memory environment. On a successful read access, the
H ANDLING instruction memory now provides the memory word
The cycle time of IMenv and the accumulated delay of its flags are
The data paths comprise two register files, GPR and SPR. Both are up-
dated during write back. Since their data are read by earlier stages, result
forwarding and interlocking is required. The two register files are treated
separately.
9 "- 8
During + instructions, data are copied from register file SPR via reg-
ister S and the Ck registers into the register file GPR. The forwarding
circuits to S have to guarantee that the uninterrupted execution of Ii , i.e.,
IΠ 2 T IΠ 3 T 1 IΠ 4 T 2 i
down the Ck registers like the result of an ordinary fixed point operation.
Thus we do not modify the forwarding circuits for registers A and B at all.
"- 8
Data from the special purpose registers are used in three places, namely
on an , instruction, the two exception PCs are read during decode.
Therefore, one only needs to forward data from the inputs of the Ck reg-
isters with destinations in SPR specified by Sad.
Circuit SFor is slightly faster than the forwarding circuit Forw for the GPR
operands.
/
' Dout
ad
I NTERRUPT SPRw.2
hit[2]
full.2 0 1
H ANDLING equal
Sad.2
SPRw.3 C’.2
hit[3]
full.3 0 1
equal
Sad.3
SPRw.4 C’.3
hit[4]
full.4 0 1
equal
Sad.4
Din C’.4
a) 011 EPC’ b)
000 SR’
Forwarding of EPC into register PC’ (a) and of register SR into the
memory stage (b)
but this would increase the instruction fetch time. Therefore, forwarding
of EDPC to d pc is omitted. The data hazards caused by this can always
be avoided if we update in the R ESTORE sequence of the interrupt service
routine register EDPC before register EPC.
If this precaution is not taken by the programmer, then a data hazard
signal
dhaz EDPC hit 2 hit 3 hit 4
is generated by the circuit in figure 5.18. Note that this circuit is obtained
from circuit SFor 3 by the obvious simplifications. Such a data hazard is
only of interest, if the decode stage processes an , instruction. That is the
only case in which a SPR register requests an interlock:
Cost and Delay The hazard signal dhazS is generated at the following
cost and delay
The address and control inputs of the forwarding circuits SFor are directly
taken from registers. The input data are provided by the environment EX-
env, by register C.4 and by the special read ports of the SPR register file.
Thus,
The stall engine of the DLXΠ design is very similar to the interlock engine
of section 4.5 except for two aspects: the initialization is different and there
are additional data hazards to be checked for. On a data hazard, the upper
two stages of the pipeline are stalled, whereas the remaining three stages
proceed. The upper two stages are clocked by signal CE1, the other stages
are clocked by signal CE2.
A data hazard can now be caused by one of the general purpose operands
A and B or by a special purpose register operand. Such a hazard is signaled
by the activation of the flag
dhaz dhazA dhazB dhazS
,- >
The full vector is initialized on reset and on every jump to the ISR. As in
the DLXΣ design, a jump to the ISR is only initiated if the write back stage
is not empty
JISR jisr4 f ull 4
On JISR, the write back stage is updated and stage IF already fetches the
first instruction of the ISR. The update enable signals ue4 and ue0 must
therefore be active. The instructions processed in stages 1 to 3 are canceled
on a jump to the ISR; signal JISR disables the update enable signals ue3 :
1. In the cycle after JISR, only stages 0 and 1 hold a valid instruction, the
other stages are empty, i.e., they process dummy instructions.
Like in the DLXΣ design, an active reset signal is caught immediately
and is clocked into register MCA even if the memory stage is empty. In
order to ensure that in the next cycle a jump to the ISR is initiated, the reset
signal forces the full bit f ull 4 of the write back stage to one.
The following equations define such a stall engine. A hardware realiza-
tion is depicted in figure 5.19.
ue0 CE1
ue1 CE1 JISR f ull 1 1
ue2 CE2 JISR f ull 2 f ull 2 : ue1
ue3 CE2 JISR f ull 3 f ull 3 : ue2
ue4 CE2 f ull 4 f ull 4 : ue3 reset
CE1 ue.0
'(
1 full.1 P IPELINED
CE1 I NTERRUPT
ue.1 H ARDWARE
JISR CE2 full.2
CE2
ue.2
CE2 full.3
reset ue.3
CE2 full.4
ue.4
CE2
Like in the pipelined design DLXπ without interrupt handling, there are two
clock signals. Signal CE1 governs the upper two stages of the pipeline, and
signal CE2 governs the remaining stages.
CE1 busy dhaz JISR Ibusy
busy dhaz JISR NOR Ibusy
CE2 busy JISR NOR Ibusy reset
Both clocks are inactive if one of the memories is busy; CE1 is also inactive
on a data hazard. However, on JISR both clocks become active once the
instruction memory is not busy. In order to catch an active reset signal
immediately, the clock CE2 and the clock CA4ce of the cause processing
circuit must be active on reset
CA4ce ue3 reset
In order to avoid unnecessary stalls, the busy flags are only considered in
case of a successful memory access. Since the memories never raise their
flags when they are idle, the busy flags are generated as
Ibusy ibusy imal NOR ip f
Dbusy dbusy dmal NOR d p f
busy Ibusy NOR Dbusy
The interrupt mechanism requires that the standard write to a register
file or memory is canceled on a repeat interrupt. The register files GPR
'
and SPR are protected as in the sequential design. A special write to the
I NTERRUPT SPR register file is enabled by signal ue4. The write signals of the register
H ANDLING files are therefore generated as
For the data memory, the protection becomes more complicated. Like in
the sequential design DLXΣ, the memory system DM itself cancels the
update if it detects a page fault, and in case of a page fault on fetch, the
write request signal is disabled during execute
However, the access must also be disabled on JISR and on reset. Thus,
signal Dmw3 which is used by the memory controller DMC in order to
generate the bank write signals is set to
The remaining clock and write signals are enabled as in the pipelined de-
sign DLXπ without interrupt handling: the data memory read request is
granted if stage M is full
Like for the DLXΣ design (lemma 5.8), it follows immediately that with this
stall engine, an active reset signal brings up the DLXΠ design, no matter in
which state the hardware has been before:
* Let T be the last machine cycle in which the reset signal is active. In the
next machine cycle, the DLXΠ design then signals a reset interrupt and
performs a jump to the ISR:
IΠ 0 0 0
The instructions are still fetched in program order and wander in lock-
step through the stages 0 and 1:
i if ue0T 0
IΠ 0 T i IΠ 0 T 1
i1 if ue0T 1
IΠ 1 T i IΠ 0 T i 1
Any instruction makes a progress of at most one stage per cycle, and it
cannot be stalled once it is clocked into stage 2. However, on an active
JISR signal, the instructions processed in stages 1 to 3 are evicted from the
pipeline. Thus, IΠ k T i JISRT 0 k 0 implies
IΠ k T 1 if uekT 0
i
IΠ k 1 T 1 if uekT 1 and k1 4
IΠ k T i JISRT 0 IΠ k 1 T 1 i
Note that on JISR 1, the update enable signals of the stages 0 and 4 are
active whereas the ones of the remaining stages are inactive.
#
'
+%
I NTERRUPT The computation of the inverted hazard signal dhaz requires the data haz-
H ANDLING ard signals of the two GPR operands A and B and the data hazard signal
dhazS of the SPR operands.
Since for the two GPR operands, the hazard detection is virtually the same,
the cost and delay of signal dhaz can be modeled as
The inverted flag busy, which combines the two signals Dbusy and
Ibusy, depends on the flags of the memory environments. Its cost and
delay can be modeled as
The two clock signals CE1 and CE2 depend on the busy flag, the data
hazard flag dhaz, and the JISR flags.
We assume that the reset signal has zero delay. The two clocks can then be
generated at the following cost and delay
The core of the stall engine is the circuit of figure 5.19. In addition,
the stall engine generates the clock signals and enables the update of the
registers and memories. Only the data memory, the two register files, the
output registers of environment CApro, and the registers PC’ and DPC
have non-trivial update request signals. All the other data paths registers
R out i are clocked by uei. The cost and the cycle time of the whole
stall engine can therefore be modeled as
In following, we determine the cost and the cycle time of the DLXΠ de-
sign and compare these values to those of pipelined design DLXπ without
interrupt handling.
+ "
Except for the forwarding circuit FORW, the top level schematics of the
data paths of the two DLX design with interrupt support are the same. The
cost of the DLXΠ data paths DP (figure 5.13) can therefore be expressed as
Table 5.12 lists the cost of the data paths and its environments for the
two pipelined DLX designs. Environments which are not effected by the
interrupt mechanism are omitted. The interrupt mechanism increases the
cost of the data paths by 58%. This increase is largely caused by the reg-
ister files, the forwarding hardware, and by the buffering. The other data
paths environments become about 20% more expensive.
Without interrupt hardware, each of the stages ID, EX and M requires
17 buffers for the two opcodes and one destination address. In the DLXΠ
design, each of these stages buffers now two addresses and three 32-bit
PCs. Thus, the amount of buffering is increased by a factor of 4.
The environment RFenv now consists of two register files GPR and SPR.
Although there are only 6 SPR registers, they almost double the cost of en-
vironment RFenv. That is because the GPR is implemented by a RAM,
whereas the SPR is implemented by single registers. Note that an 1-bit
register is four times more expensive than a RAM cell. The register imple-
mentation is necessary in order to support the extended access mode – all
6 SPR registers can be accessed in parallel.
'
'
I NTERRUPT Cost of the control of the two pipelined DLX designs
H ANDLING
environment stall MC automata buffer CON DLX
DLXπ 77 48 609 89 830 13840
DLXΠ 165 61 952 105 1283 21893
increase 114% 27% 56% 18% 44% 58%
According to the schematics of the precomputed control (figure 4.15), the
control unit CON buffers the valid flags and the precomputed control sig-
nals. For the GPR result, 6 valid flags are needed, i.e., v4 : 22 v4 : 33
and v44. Due to the extended ISA, there is also an SPR result. Since
this result always becomes valid in the execute stage, there is no need for
additional valid flags.
Since the control automata already provide one stage of buffering, pre-
computed control signals of type x need no explicit buffering. Type y sig-
nals require one additional stage of buffers, whereas type z signals require
two stages of buffers. According to table 5.9, there are three control signals
of type z and one of type y. Thus, the control requires
6 2 3 1 1 13
flipflops instead of 11. One inverter is used in order to generate the valid
signal of the GPR result. In addition, the control unit CON comprises the
stall engine, the two memory controllers IMC and DMC, and two control
automata (table 5.10). Thus, the cost of unit CON can be modeled as
Table 5.13 lists the cost of the control unit, of all its environments, and
of the whole DLX hardware. The interrupt mechanism increases the cost
of the pipelined control by 44%. The cost of the stall engine is increased
above-average (114%).
%
According to table 5.14, the interrupt support has virtually no impact on the
cycle time of the pipelined DLX design. The cycle times of the data paths
environments remain unchanged, only the control becomes slightly slower.
However, as long as the memory status time stays below 43 gate delays,
the cycle time of the DLXΠ design is dominated by the PC environment.
(
'/
Cycle times of the two pipelined DLX designs; d mem denotes the max- C ORRECTNESS OF
imum of the two access times d Imem and dDmem and dmstat denotes the maximum THE I NTERRUPT
of the two status times dIstat and dDstat . H ARDWARE
ID CON / stall
EX WB DP IF, M
A/B PC max( , )
DLXπ 72 89 66 33 89 16 dmem 57 43 dmstat
DLXΠ 72 89 66 33 89 16 dmem 57 46 dmstat
N THIS section, we will prove that the pipelined hardware DLXΠ to-
gether with an admissible ISR processes nested interrupts in a precise
manner. For a sequential design, the preciseness of the interrupt processing
is well understood. We therefore reduce the preciseness of the pipelined
interrupt mechanism to the one of the sequential mechanism by showing
that the DLXΠ design simulates the DLXΣ design on any non-aborted in-
struction sequence.
In a first step, we consider an uninterrupted instruction sequence I0
Ip , where I0 is preceded by JISR, and where Ip initiates a JISR. In a sec-
ond step, it is shown that the simulation still works when concatenating
several of these sequences. With respect to these simulations, canceled
instructions and external interrupt events are a problem.
Let the external interrupt event ev j be raised during cycle T of the
pipelined execution of P
ev jTΠ 1
0 and ev jTΠ 1
let t be the first cycle after T for which the write back stage is full, and let
T 1 be the cycle in the sequential execution of P corresponding to cycle
t, i.e.,
IΠ 4 t i IΣ 4 T 1
In the sequential execution of P, event ev j is then assigned to cycle T
¼
ev jTΣ 1
IΠ 3 tˆ IΠ 4 tˆ 1 IΠ 4 t
The proofs dealing with the admissibility of the ISR (section 5.4) only
argue about signal JISR and the values of the registers and memories vis-
ible to the assembler programmer, i.e., the general and special purpose
register files, the two PCs and the two memories IM and DM:
For the simulation, signal JISR and the contents of storage C are therefore
of special interest.
JISRΣ 1 1
and JISR0Π 1
R C R0Σ R1Π
Let Tp and Tp denote the cycles in which Ip is processed in the write back
stage
T
IΠ 4 Tp IΣ 4 Tp p ue4Πp 1
The initial PCs then have values PC 0Σ SISR 4 and DPCΣ0 SISR. For
any instruction Ii P , any stage k, and any two cycles T , T with
IΠ k T IΣ k T i uekΠ
T
1
k=3 I0
k=4 I0 Ip
(b) for all registers R out k which are visible or updated at the
end of cycle T :
RTΠ1 RΣT 1
¼ Rui if T Tp
Ri if T Tp
T 1 T 1 ¼
RΠ RΣ R p
With respect to the pipelined execution, there are three types of pairs
k T for which the values ST and RT 1 of the signals S and output regis-
ters R of stage k are of interest (figure 5.20):
For the first cycle, the theorem makes an assumption about the con-
tents of all registers and memories R C independent of the stage
they belong to (box 0).
For the final cycle Tp , claim 2 covers all the registers and memories
R C independent of the stage they belong to (box 2).
The above theorem and the simulation theorem 4.11 of the DLX design
without interrupt handling are very similar. Thus, it should be possible to
largely reuse the proof of theorem 4.11. Signal JISR of the designs DLXΣ
and DLXΠ is the counterpart of signal reset in the designs DLXσ and DLXπ.
This pair of signals is used to initialize the PC environment and they mark
the start of the execution. In the sequential designs, the execution is started
in cycle 1, whereas in the pipelined designs, it is started in cycle 0:
resetσ 1
JISRΣ 1
JISR0Π resetπ0 1
it follows from the hypothesis of the theorem and the update enable flags
that
d pc0Π DPCΠ 1 0
DPCΣ DPCΣ
1
The memory IM is read-only and therefore keeps its initial contents. Thus,
on design DLXΠ in cycle T 0 stage 0 has the same inputs as on design
DLXΣ in cycle T 1.
Note that the stages k of the designs DLXΣ and DLXΠ generate the same
signals S and update their output registers in the same way, given that they
get identical inputs. This also applies to the data memory DM and its write
request signal Dmw 3 which in either design is disabled if the instruction
encounters a page fault on fetch. Thus, with the new dateline lemma 5.9,
the induction proof of claim 1 can be completed as before.
Claim 2 is new and therefore requires a full proof. For the output regis-
ters of stage 4, claim 1 already implies claim 2. Furthermore, in the designs
DLXΣ and DLXΠ , the instruction memory is never updated. Thus, claim 2
only needs to be proven for the two program counters PC’ and DPC, and
for the data memory DM.
The instruction sequence P of the sequential design was constructed
such that instruction Ip causes an interrupt. Since signal JISR is generated
in stage 4, claim 1 implies
T¼ T
JISRΣp JISRΠp 1
In either design, the two PCs are initialized on an active JISR signal, and
therefore
T ¼ 1 Tp 1
DPCΣp SISR DPCΠ
T ¼ 1 T 1
PC Σp SISR 4 PC Πp
The data memory DM belongs to the set out 3. For stage 3, the two
scheduling functions imply
IΠ 3 Tp 1 IΣ 3 Tp 1 p
#
'/
In the sequential design, the data memory is only updated when the in-
struction is in stage 3, i.e., when f ull 3 1. Claim 1 then implies that C ORRECTNESS OF
THE I NTERRUPT
T T¼
DMΠp DMΣp DM p H ARDWARE
JISR is only signaled if f ull 4 1. For cycle Tp , the sequential stall engine
then implies that
T¼ T¼
f ull 3Σp 1 and DmwΣp 0
Thus, the data memory is not updated during JISR, and therefore
T¼ T ¼ 1
DMΣp DMΣp
In the pipelined design, the write enable signal of the data memory is gen-
erated as
Since signal Dmw3 is disabled on an active JISR signal, the data memory
is not updated during cycle Tp , and therefore,
T 1 T
DMΠp DMΠp
Pi Ii 0
Ii pi
Ii pi δi
This means that for any sequence Pi , instruction Ii 0 is preceded by JISR,
Ii pi δi is the last instruction fetched before the jump to the ISR. For the
The external interrupt events are assigned as before. The scheduling func-
tions are extended in an obvious way. For the designs DLXΣ and DLXΠ ,
IΣ k T i j and IΠ k T i j
##
'
Like the two DLX designs without interrupt hardware, the designs DLXΣ
I NTERRUPT and DLXΠ are started by reset and not by JISR. Lemmas 5.8 and 5.10
H ANDLING imply that after reset, both designs come up gracefully; one cycle after
reset JISR 1 and the designs initiate a jump to ISR(0). Thus, we can
now formulate the general simulation theorem for the designs DLXΣ and
DLXΠ :
resetΣ 2
1 resetΠ 1
Let both designs be started with identical contents, i.e., any register and
memory R of the data paths satisfies
RΣ 1 R0Π
(5.4)
JISRΣ 1
1 JISR0Π
As shown in the proof of theorem 5.11 claim 2, both designs initialize the
PCs on JISR in the same way, thus
PC DPC0Σ PC DPC1Π
The instruction memory is ready-only, and the update of the data memory
is disabled on ue3 0. Table 5.17 and equation 5.4 therefore imply
IM DM 0Σ IM DM 1Π
In either design, the output registers of stage 4 are updated during JISR.
Since stage 4 gets identical inputs it also produces identical outputs. Thus,
R C R0Σ R1Π
IΠ 4 T1 1 p1 and IΣ 4 T1 1 p1
#&
')
T
k=0 I(1,0) I(2,0) S ELECTED
box’ 1 R EFERENCES AND
k=1
box’ 0 F URTHER R EADING
k=2
box 0 box 1 box 2
k=3
k=4 I(1,p)
Scheduling of the first two subsequences P1 P2 for the pipelined ex-
ecution of sequence Q
T ¼ 1 T 1
R C RΣ1 RΠ1 (5.5)
IΠ 4 Ti i pi JISRTΠi 1 IΠ 0 Ti i 1 0
For the first two subsequences, figure 5.21 illustrates this scheduling be-
havior.
Thus, cycle T1 1 corresponds to the cycle 0 of the sequential execution
of P2 , and that cycle T1 1 corresponds to the cycle 1 of the pipelined
execution of P2 . Equation 5.5 then implies that the subsequences P2 and P2
are started in the same configuration, and that theorem 5.11 can be applied.
With the same arguments, the theorem follows by induction on the sub-
sequences of Q and Q .
NTERRUPT SERVICE routines which are not nested are, for example,
described in [PH94]. Mechanisms for nested interrupts are treated in
[MP95] for sequential machines and in [Knu96] for pipelined machines.
#'
'
0 %&
I NTERRUPT
H ANDLING Let t1 and t2 be cycles of machine DLXΠ, and let t1 t2 . Sup-
pose external interrupts i and j are both enabled, interrupt i becomes active
in cycle t1 , interrupt j becomes active in cycle t2 , and no other interrupts
are serviced or pending in cycle t2 .
Invalid address exception. Two addresses are stored in spe-
cial purpose registers UP and LOW . A maskable exception of type abort
has to be signalled, if a memory location below LOW or above UP is ac-
cessed.
#/
Chapter
6
Memory System Design
N THE simplest case, the memory system is monolithic, i.e., it just com-
prises a single level. This memory block can be realized on-chip or
off-chip, in static RAM (SRAM) or in dynamic RAM (DRAM). DRAM is
about 4 to 10 times cheaper and slower than SRAM and can have a 2 to
4 times higher storage capacity [Ng92]. We therefore model the cost and
(
delay of DRAM as
M EMORY S YSTEM
D ESIGN CDRAM A d CSRAM A d α
DDRAM A d α DSRAM A d
with α 4 8 16. Thus, on-chip SRAM yields the fastest memory sys-
tem, but that solution has special drawbacks, as will be shown now.
Chapter 3 describes the sequential design of a DLX fixed point core. The
main memory is treated as a black box which has basically the function-
ality of a RAM; its temporal behavior is modeled by two parameters, the
(minimal) memory access time dmem and the memory status time dmstat .
All CPU internal actions of this DLX design require a cycle time of
τCPU 70 gate delays, whereas the memory access takes TM 18 dmem
delays. If a memory access is performed in 1 W cycles, then the whole
DLX fixed point unit can run at a cycle time of
TM
τDLX max τCPU (6.1)
W 1
The parameter W denotes the number of wait states. From a performance
point of view, it is desirable to run the memory without wait states and at
the speed of the CPU, i.e.,
Under these constraints, the memory access time dmem can be at most 52
gate delays. On-chip SRAM is the fastest memory available. According to
our hardware model, such an SRAM with A entries of d bits each has the
following cost and access time
The main memory of the DLX is organized in four banks, each of which
is one byte wide. If each bank is realized as an SRAM, then equation (6.2)
limits the size of the memory to
4A 4 2 52
103
216 bytes
That is much to small for main memory. Nevertheless, these 64 kilo bytes
of memory already require 1.3 million gates. That is roughly 110 times the
&
(
Signals of the bus protocol A M ONOLITHIC
M EMORY D ESIGN
signal type CPU memory
data of the write read
MDat bidirectional
memory access read write
MAd memory address unidirectional write read
burst burst transfer
status
w/r write/read flag unidirectional write read
flag
BE byte enable flags
req request access write read
hand-
reqp request pending unidirectional
shake read write
Brdy bus ready
cost of the whole DLX fixed point core CDLX 11951. Thus, a large,
monolithic memory system must be implemented off-chip, and a memory
access then definitely takes several CPU cycles. The access time of the
main memory depends on many factors, like the memory address and the
preceding requests. In case of DRAMs, the memory also requires some
time for internal administration, the so called refresh cycles. Thus, the
main memory has a non-uniform access time, and in general, the processor
cannot foresee how many cycles a particular access will take. Processor
and main memory therefore communicate via a bus.
There exist plenty of bus protocols; some are synchronous, the others are
asynchronous. In a synchronous protocol, memory and processor have a
common clock. That simplifies matters considerably. Our memory designs
therefore uses a synchronous bus protocol similar to the pipelined protocol
of the INTEL Pentium processor [Int95].
The bus signals comprise the address MAd and the data MDat of the
memory access, the status flags specifying the type of the access, and the
handshake signals coordinating the transfer. The data lines MDat are bidi-
rectional, i.e., they can be read and written by both devices, the processor
and the memory system. The remaining bus lines are unidirectional; they
are written by one device and read by the other (table 6.1). The protocol
uses the three handshake signals request (req), request pending (reqp), and
bus ready (Brdy) with the following meaning:
&
(
Request is generated by the processor. This signal indicates that a
M EMORY S YSTEM new transfer should be started. The type of the access is specified by
D ESIGN some status flags.
The main memory provides its handshake signals reqp and Brdy one cycle
ahead. That leaves the processor more time for the administration of the
bus. During the refresh cycles, the main memory does not need the bus.
Thus, the processor can already start a new request but the main memory
will not respond reqp 1 Brdy 0 until the refresh is finished.
-
The data unit to be transferred on the bus is called bus word. In our memory
design, the bus word corresponds to the amount of data which the processor
can handle in a single cycle. In this monograph, the bus width is either 32
bits or 64 bits. The memory system should be able to update subwords
(e.g., a single byte) and not just a whole bus word. On a write access, each
byte i of the bus word is therefore accompanied by an enable bit BEi.
On a burst transfer, which is indicated by an active burst flag, MAd
specifies the address of the first bus word. The following bus words are
referenced at consecutive addresses. The bus word count bwc specifies the
number of bus words to be transferred. Our protocol supports burst reads
and burst writes. All the bursts have a fixed length, i.e., they all transfer the
same amount of data. Thus, the bwc bits can be omitted; the status flags of
the bus protocol comprise the write/read flag wr, the burst flag, and the
byte enable flags BE.
MAd address
burst
w/r
req
reqp
Brdy
MDat D1 D2 D3 D4
flag. The memory announces the data by an active bus ready signal Brdy
1, one cycle ahead of time. After a request, it can take several cycles till
the data is put on the bus. During this time, the memory signals with
repq 1 that it is performing an access. This signal is raised one cycle
after the request and stays active repq 1 till one cycle before a new
request is allowed. The processor turns the address and the status signals
off one cycle after req 0. A new read access can be started one cycle
after req 0 and reqp 0.
On a burst read any of the bus words can be delayed by some cycles, not
just the first one. In this case, the Brdy line toggles between 0 and 1. The
burst transfer of figure 6.2 has a 4-2-3-1 access pattern; the first bus word
arrives in the fourth cycle, the second word arrives two cycles later, and so
on. The fastest read access supported by this protocol takes 2 bwc bus
cycles. The first word already arrives two cycles after the request.
&#
(
M EMORY S YSTEM MAd 1st address 2nd address 3rd address
D ESIGN
burst
w/r
req
reqp
Brdy
MDat D D1 D2 D3 D4
(figure 3.11). Based on the control signals and the offset, the memory
control MC generates the bank write signals mbw[3:0] which enable the
update of the memory (figure 3.12).
Now the memory system (figure 6.5) consists of the off-chip main mem-
ory M, the memory interface Mi f , the memory interface control Mi fC,
and the original memory control MC. The memory interface connects the
memory M to the data paths.
The memory of the 32-bit DLX architecture is byte addressable, but all
reads and the majority of the writes are four bytes (one word) wide. Thus,
the data bus MDat between the processor and the main memory can be
made one to four bytes wide. On a one-byte bus, half word and word
accesses require a burst access and take at least one to three cycles longer
than on a four-byte bus. In order to make the common case fast, we use
a four-byte data bus. On a write transfer, the 32-bit data are accompanied
by four byte enable flags BE 3 : 0. Since bursts are not needed, we restrict
the bus protocol to single-word transfers.
Memory Interface The memory interface Mif which connects the data
paths to the memory bus and the external memory uses 32-bit address and
data lines. Interface Mif forwards the data from the memory bus MDat to
the data output MDout. On MAdoe 1, the interface puts the address MA
on the address bus MAd, and on MDindoe 1 it puts the data MDin on
the bus MDat.
Except for the memory interface Mif, the data paths of environment
Menv are off-chip and are therefore not captured in the cost model. Thus,
CMenv CMi f 2 Cdriv 32
wr mw
MAdoe mem mr mw
MDindoe mw
The handshake signals are more complicated. Signal req is only active
during the first cycle of the access, and signal mbusy is always active except
during the last cycle of the transfer. Thus, for the control MifC a single
transfer is performed in three steps. In the first step, MifC starts the off-
chip transfer as soon as reqp 0. In the second step, which usually takes
several cycles, MifC waits till the memory signals Brdy 1. In the third
step, the transfer is terminated. In addition, MifC has to ensure that a
new request is only started if in the previous cycle the signals reqp and
Brdy were inactive. Since the accesses are not overlapped, this condition
is satisfied even without special precautions.
The signals req and mbusy are generated by a Mealy automaton which
is modeled by the FSD of figure 6.6 and table 6.2. According to section
2.6, cost and delay of the automaton depend on the parameters listed in
&/
(
else D2
M EMORY S YSTEM D1 D3
D ESIGN start wait finish
FSD underlying the Mealy automaton MifC; the initial state is start.
Disjunctive normal forms DNF of the Mealy automaton of MifC
DNF source state target state monomial m M length l m
D1 start wait mem 1
D2 wait wait /Brdy 1
D3 wait finish Brdy 1
table 6.3 and on the accumulated delay of its inputs Brdy and mem. Let
CMealy Mi fC denote the cost of the automaton, then
The input Brdy only affects the next state of the automaton, but input
mem also affects the computation of the Mealy outputs. Let the main mem-
ory provide the handshake signals with an accumulated delay of AM Brdy,
and let the bus have a delay of dbus . The inputs and outputs of the automa-
ton then have the following delays:
+%
Table 6.4 lists the cost of the DLX design and of the environments affected
by the change of the memory interface. The new memory interface is fairly
cheap and therefore has only a minor impact on the cost of the whole DLX
design.
&)
(
Parameters of the Mealy automaton used in the control MifC A M ONOLITHIC
M EMORY D ESIGN
# states # inputs # and frequency of outputs
k σ γ νsum νmax
3 2 2 3 2
fanin of the states # and length of monomials
fanmax fansum #M lsum lmax
2 3 3 3 1
Cost of the memory interface Mi f , of the data paths DP, of the control
CON and of the whole DLX for the two memory interfaces.
Cycle Time The cycle time τDLX of the DLX design is the maximum of
three times, namely: the cycle time TCON required by the control unit, the
time TM of a memory access, and the time TDP for all CPU internal cycles.
The connection of the DLX to an off-chip memory system only affects the
memory environment and the memory control. Thus, the formula of TCON
and TM need to be adapted, whereas the formula of TDP remains unchanged.
So far, time TCON accounted for the update of the main control automaton
Tauto and for the cycle time Tstall of the stall engine. The handling of
the bus protocol requires a Mealy automaton, which needs to be updated
as well; that takes TMealy Mi fC delays. In addition, the new automaton
provides signal mbusy to the stall engine. Therefore,
&*
(
Timing of Memory Accesses The delay formula of a memory access
M EMORY S YSTEM changes in a major way. For the timing, we assume that the off-chip mem-
D ESIGN ory is controlled by an automaton which precomputes its outputs. We fur-
ther assume that the control inputs which the off-chip memory receives
through the memory bus add dMhsh (memory handshake) delays to the cy-
cle time of its automaton.
The memory interface starts the transfer by sending the address and the
request signal req to the off-chip memory. The handshake signals of the
DLX processor are valid AMi f C delays after the start of the cycle. For-
warding signal req and address MA to the memory bus and off-chip takes
another Ddriv dbus delays, and the processing of the handshake signals
adds dMhsh delays. Thus, the transfer request takes
After the request, the memory performs the actual access. On a read
access, the memory reads the memory word, which on a 64 MB memory
takes DMM 64MB gate delays. The memory then puts the data on the
bus through a tristate driver. The memory interface receives the data and
forwards them to the data paths where they are clocked into registers. The
read cycle therefore takes at least
Table 6.5 lists the cycle times of the data paths and control, as well as
the access and request time of the memory system, assuming a bus delay
of dbus 15 and dMhsh 10. The access time of the memory depends on
the version of the DRAM used.
The control and the memory transfer time are less time critical. They can
tolerate a bus delay and handshake delay of dbus dMhsh 56 before they
slow down the DLX processor. However, the actual memory access takes
much longer than the other cycles, even with the fastest DRAM α 4.
In order to achieve a reasonable processor cycle time, the actual memory
access is performed in W cycles; the whole transfer takes W 1 cycles.
'
(
Cycle time of the DLX design and of its main parts, which are the data A M ONOLITHIC
paths DP, the control unit CON and the memory system MM. M EMORY D ESIGN
TCON TMaccess
TDP TMreq
maxA B α 4 α 8 α 16
13 dbus 14 dbus dMhsh
70 42 355 683 1339
28 39
The DLX design with a direct connection to the off-chip memory can then
be operated at a cycle time of
Increasing the number W of wait states improves the cycle time of the DLX
design, at least till W TMaccess TDP . For larger W, the main memory
is no longer time critical, and a further increase of the wait states has no
impact on the cycle time.
According to section 4.6, the performance is modeled by the reciprocal
of a benchmark’s execution time, and on a sequential DLX design, the run
time of a benchmark Be is the product of the instruction count IC Be of
the benchmark, of the average cycles per instruction CPI, and of the cycle
time τDLX :
Increasing the number of wait states improves the cycle time, but is also
increases the CPI ratio. Thus, there is a trade-off between cycle time and
cycle count which we now quantify based on SPECint92 benchmark work-
loads. Table 6.6 lists the DLX instruction mix of these workloads and the
number of cycles required per instruction. According to formula (4.8) from
section 4.6, the benchmarks and and the average SPECint92
workload, for example, achieve the following CPI ratios:
instruction mix
CPII
compress eqntott espresso gcc li AV
load 52 W 19.9 30.7 21.1 23.0 31.6 25.3
store 52 W 5.6 0.6 5.1 14.4 16.9 8.5
compute 41 W 55.4 42.8 57.2 47.1 28.3 46.2
call 51 W 0.1 0.5 0.4 1.1 3.1 1.0
jump 31 W 1.6 1.4 1.0 2.8 5.3 2.4
taken 41 W 12.7 17.0 9.1 7.0 7.0 10.6
untaken 31 W 4.7 7.0 6.1 4.6 7.8 6.0
Performance of the DLX core on the and benchmarks and
on the average SPECint92 workload. Parameter α denotes the factor by which
off-chip DRAM is slower than standard SRAM.
system with five wait states. The DLX system then spends about 61%
134 5110 of the run time waiting for the off-chip memory. On the
slower DRAM with α 8 (16), the memory is operated with 10 (19) wait
states, and the DLX even waits 76% (86%) of the time.
Thus a large, monolithic memory has got to be slow, and even in a se-
quential processor design, it causes the processor to wait most of the time.
Pipelining can increase the performance of a processor significantly, but
only if the average latency of the memory system is short W 2. Thus,
the monolithic memory is too slow to make pipelining worthwhile, and
the restriction to a single memory port makes things even worse. In the
next section, we therefore analyze whether a hierarchical memory system
is better suited.
. - -
The key for the nice temporal behavior of multi-level memory is a princi-
ple known as locality of reference [Den68]. This principle states that the
memory references, both for instructions and data, tend to cluster. These
clusters change over time, but over a short time period, the processor pri-
marily works on a few clusters of references. Locality in references comes
in two flavors:
1 Additional considerations come into play, when one level is no random access mem-
ory, like disks or tapes.
'&
(
Temporal Locality After referencing a sequence S of memory loca-
tions, it is very likely that the following memory accesses will also T HE M EMORY
reference locations of sequence S. H IERARCHY
All our designs use byte addressable memory. Let the main memory size be
2m bytes, and let the cache size be 2c bytes. The cache is much smaller than
the main memory; 2c 2m . The unit of data (bytes) transferred between
the cache and the main memory is called block or cache line. In order
to make use of spatial locality, the cache line usually comprises several
memory data; the line sizes specifies how many. The cache size therefore
equals
2c # lines line size
''
(
The cache lines are organized in one of three ways, namely: direct mapped,
M EMORY S YSTEM set associative, or fully associative.
D ESIGN
+ 5
For every memory address a am 1 : 0, the placement policy spec-
ifies a set of cache locations. When the data with memory address a is
brought into the cache, it is stored at one of these locations. In the simplest
case, all the sets have cardinality one, and the memory address a is mapped
to cache address
Since the cache is direct mapped, the c least significant bits of the two
addresses ca and a madr ca are the same, and one only needs to store
the leading m c bits of the memory address as tag:
A cache line therefore comprises three fields, the valid flag, the address
tag, and the data (figure 6.7). Valid flag and tag are also called the directory
information of the cache line. Note that each of the 2l cache lines holds
line-size many memory data, but the cache only provides a single tag and
valid bit per line. Let the cache address ca be a line boundary, i.e., ca is
divisible by the line size 2o , then
Thus, all the bytes of a cache line must belong to consecutive memory
addresses.
On a read access with address ca, the cache provides the valid flag v
valid ca, the tag t tag ca and the data
Each field of the cache line, i.e., valid flag, tag and data, can be updated
separately. A write access to the cache data can update as little as a single
byte but no more than the whole line.
cache data
cache directory
RAM. However, all the sectors of a cache line still have the same tag and
valid flag2 . The line-offset in the memory address is split accordingly in
an s-bit sector address and in a b-bit sector offset, where o s b. Figure
6.8 depicts the organization of such a direct mapped cache.
With sectoring, the largest amount of cache data to be accessed in par-
allel is a sector not a whole line. Thus, on read access with address ca the
sectored cache provides the data
d Csectorcac 1 : b 0b
Ccac 1 : b 0b 2b 1 : cac 1 : b 0b
cache position per cache address ca. These k positions form the set of ca.
There are two special cases of k-way set associative caches:
For k 1, the cache comprises exactly one way; the cache is direct
mapped.
If there is only one set, i.e., each way holds a single line, then each
cache entry is held in a separate way. Such a cache is called fully
associative.
The associativity of a set associative, first level cache is typically 2 or 4.
Occasionally a higher associativity is used. For example, the PowerPC
uses an 8-way cache [WS94] and the SuperSPARC uses a 5-way instruc-
tion cache [Sun92]. Of course, the cache line of a set associative cache can
be sectored like a line in a direct mapped cache. For simplicity’s sake, we
describe a non-sectored, set associative cache. We leave the extension of
the specification to sectored caches as an exercise (see exercise 6.1).
:
Let l denote the width of the line address, and let o denote the width of
the line offset. Each way then comprises 2l lines, and the whole cache
comprises 2l sets. Since in a byte addressable cache, the lines are still 2o
bytes wide, each way has a storage capacity of
¼
size way 2c 2l 2o
'*
(
bytes. The size (in byte) of the whole k-way set associative cache equals
M EMORY S YSTEM
D ESIGN k size way k 2l 2o
Since in a k-way set associative cache there are several possible cache
positions for a memory address a, it becomes more complicated to find
the proper entry, and the placement and replacement policies are no longer
trivial. However, the placement is such that at any given time, a memory
address is mapped to at most one cache position.
For this address ca, every way provides data di , a valid flag vi , and a tag t i :
A local hit signal hi indicates whether the requested data is held in way i
or not. This local hit signal can be generated as
hi vi t i am 1 : m t
In a set associative cache, a hit occurs if one of the k ways encounters a hit,
i.e.,
hit h0 h1 hk 1
On a cache hit, exactly one local hit signal hj is active, and the corre-
sponding way j holds the requested data d. On a miss, the cache provides
an arbitrary value, e.g., d 0. Thus,
k1
dj if hit 1 and h j 1
d d i hi
0 if hit 0
i 0
3 8
In case of a miss, the requested data is not in the cache, and a new line
must be brought in. The replacement policy specifies which way gets the
new line. The selection is usually done as follows:
1. As long as there are vacant lines in the set, the replacement circuit
picks one of them, for example, the way with the smallest address.
(
(
2. If the set is full, a line must be evicted; the replacement policy sug-
gests which one. The two most common policies are the following: T HE M EMORY
H IERARCHY
LRU replacement picks the line which was least recently used.
For each set, additional history flags are required which store
the current ordering of the k ways. This cache history must be
updated on every cache access, i.e., on a cache hit and on a line
replacement.
Random replacement picks a random line of the set and there-
fore manages without cache history.
1. Read Allocate: A write hit always updates the data RAM of the
cache. On a write miss, the requested data and the corresponding
line will not be transferred into the cache. Thus, new data is only
brought in on a read miss.
2. Write Allocate: A write always updates the data RAM of the cache.
In case of a write miss, the referenced line is first transferred from
the memory into the cache, and then the cache line is updated. This
policy allocates new lines on every cache miss.
3. Write Invalidate: A write never updates the data RAM of the cache.
On the contrary, in case of a write hit, the write even invalidates the
cache line. This allocation policy is less frequently used.
even avoid some of them. The latter results in a weak memory consistency.
The write policy specifies which of the two consistency models should be
used:
2. Write Back applies the weak consistency model. A write hit only
updates the cache. A dirty flag indicates that a particular line has
been updated in the cache but not in the main memory. The main
memory keeps the old data till the whole line is copied back. This
either occurs when a dirty cache line is evicted or on a special update
request. This write policy can be combined with read allocate and
write allocate but not with write invalidate (exercises in section 6.7).
Table 6.9 lists the possible combinations of the allocation and write poli-
cies.
Cache accesses of the memory transactions read, write and line in-
validate on a sectored, write through cache with write allocation.
and the history is updated as well. In case of a miss, the invalidation access
has no impact on the cache. Line invalidation is necessary, if a particular
level of the memory system comprises more than one cache, as it will be
the case in our pipelined DLX design (section 6.5). In that situation, line
invalidation is used in order to ensure that a particular memory word is
stored in at most one of those parallel caches.
The cache as part of the memory hierarchy has to support four types of
memory transactions which are reading (rw 1) or writing (mw 1) a
memory data, invalidating a cache line (linv 1), and initializing the whole
cache. Except for the initialization, any of the memory transactions is
performed as a sequence of the following basic cache accesses:
reading a cache sector including cache data, tag and valid flag,
In the cycle & /& , the cache fetches the last sector of the line. Due to
forwarding, the requested data are provided at the data output of the cache.
In addition, the directory is updated, i.e., the new tag is stored in the tag
RAM and the valid flag is turned back on:
This is the last cycle of a read transaction which does not hit the cache.
with
bytei Csectorway ca if CDwi 0
Xi
bytei Din if CDwi 1
The transaction ends with an 0 & 1 cycle, in which the memory per-
forms the requested write update.
(&
(#
3 .
This transaction also starts with a . access, in order to check A C ACHE D ESIGN
whether the requested line is in the cache. In case of a miss, the line is not
in the cache, and the transaction ends after the . access. In case
of a hit, the line is invalidated in the next cycle (2%+ &):
valid way : 0
Byte addressable, direct mapped cache with L 2 l lines. The cache
line is organized in S 2s sectors, each of which is B 2b bytes wide.
According to the FSD of figure 6.10, all the memory transactions start
with a cache read access ($rd 1); updates of the directory and of the
cache data only occur in later cycles. The design of the k-way set associa-
tive cache will rely on this feature.
Figure 6.11 depicts the data paths of a sectored, byte addressable, direct
mapped cache with L 2l cache lines. The cache consists of valid, tag and
data RAMs and an equality tester. The valid RAM V and the t bits wide
tag RAM T form the cache directory.
Since all sectors of a line share the same tag and valid flag, they are only
stored once; the valid and tag RAM are of size L 1 and L t. Both RAMs
are referenced with the line address a line. The write signals V w and Tw
control the update of the directory. On Tw 0 the tag RAM provides the
tag
tag T a line
and on Tw 1, the tag a tag is written into the tag RAM
The valid RAM V is a special type of RAM which can be cleared in just a
few cycles3 . That allows for a fast initialization on reset. The RAM V is
3 TheIDT71B74 RAM, which is used in the cache system of the Intel i486 [Han93],
can be cleared in two to three cycles [Int96].
((
(#
cleared by activating signal clear. On V w clear 0, it provides the flag
A C ACHE D ESIGN
v V a line
On every cache access, the equality tester EQ checks whether the line
entry is valid and whether the tag provided by the tag RAM matches the
tag a tag. If that is the case, a hit is signaled:
If CDwB 1 : 0 0B and if the access is a hit, the data RAMs are updated.
For every i with CDwi 1, bank i performs the update
The cost of this direct mapped cache (1-way cache, $1) run at:
The cache itself delays the read/write access to its data RAMs and directory
and the detection of a hit by the following amount:
d0 h 0
v 0
v[0:k-1]
k-1
d h k-1 vk-1
h[0:k-1]
Sel: k-way data select
8B
Do hit
Byte addressable, k-way set associative cache. The sectors of a cache
line are B 2b bytes wide.
The core of a set associative cache (figure 6.12) are k sectored, direct
mapped caches with L lines each. The k cache ways provide the local
hit signals hi , the valid flags vi , and the local data di . Based on these sig-
nals, the select circuit Sel generates the global hit signal and selects the
data output Do. An access only updates a single way. The write signal
adapter Wadapt therefore forwards the write signals Tw V w, and CDw to
this active cache way.
The replacement circuit Repl determines the address way of the active
cache way; the address is coded in unary. Since the active cache way
remains the same during the whole memory transaction, address way is
only computed during the first cycle of the transaction and is then buffered
in a register. This first cycle is always a cache read ($rd). Altogether, the
cost of the k way cache is:
C$k t l s b k C$1 t l s b CSel CWadapt
CRepl C f f k
+ -
Each cache way provides a local hit signal hi , a valid flag vi , and the local
data d i . An access is a cache hit, if one of the k-ways encounters a hit:
hit h0 h1 hk 1
()
(#
On a cache hit, exactly one local hit signal hi is active, and the correspond-
ing way i holds the requested data Do. Thus, A C ACHE D ESIGN
Do d j h j
j 0 k 1
When arranging these OR gates as a binary tree, the output Do and the hit
signal can be selected at the following cost and delay:
;
Circuit Wadapt gets the write signals Tw, V w and CDwB 1 : 0 which
request the update of the tag RAM, the valid RAM and the B data RAMs.
However, in a set associative cache, an access only updates the active cache
way. Therefore, the write signal adapter forwards the write signals to the
active way, and for the remaining k 1 ways, it disables the write signals.
Register way provides the address of the active cache way coded in
unary. Thus, the write signals of way i are obtained by masking the signals
Tw, V w and CDwB 1 : 0 with signal bit wayi, e.g.,
V w if wayi 1
V wi V w wayi
0 if wayi 0
The original B 2 write signals can then be adapted to the needs of the set
associative cache at the following cost and delay
CWadapt k Cand B 2
DWadapt Dand
Hw $rd clear
On clear 1, all the history vectors are initialized with the value Hid.
Since the same value is written to all the RAM words, we assume that
(*
(
a_line
M EMORY S YSTEM hit h[0:k-1]
EQ
D ESIGN $rd active
dec
ev
0 H
Al Aw history LRUup EV
RAM 1 H’ h[0:k-1]
Ar LxK Hid
clear hit
clear 2-port clear 0 1
1 Hid Hw 0 1
Hw w Din
0 H’ way
Circuit Repl of a k-way set associative cache with LRU replacement
this initialization can be done in just a few cycles, as it is the case for the
valid RAM. Circuit LRUup determines the new history vector H and the
eviction address ev; circuit active selects the address way.
Updating the cache history involves two consecutive RAM accesses, a
read of the cache history followed by a write to the history RAM. In order
to reduce the cycle time, the new history vector H and the address are
buffered in registers. The cache history is updated during the next cache
read access. Since the cache history is read and written in parallel, the
history RAM is dual ported, and a multiplexer forwards the new history
vector Hl , if necessary. On clear 1, register H is initialized as well. The
cost of circuit Repl can be expressed as:
%
For each set l, circuit Repl keeps a history vector
most (least) recently used. In case of a miss, the cache history suggests
the candidate for the line replacement. Due to LRU replacement, the least
recently used entry is replaced; the eviction address ev equals Hlk 1.
On power-up, the whole cache is invalidated, i.e., all the valid flags in
the k direct mapped caches are cleared. The cache history holds binary but
arbitrary values, and the history vectors Hl are usually not a permutation
of the addresses 0 k 1. In order to ensure that the cache comes up
properly, all the history vectors must be initialized, e.g., by storing the
identity permutation. Thus,
Hid H0 Hk 1
H i i
%
The cache history must be updated on every cache read access, whether
the access is a hit or a miss. The update of the history also depends on the
type of memory transaction. Read and write accesses are treated alike; line
invalidation is treated differently.
Let a read or write access hit the way Hli. This way is at position i in
vector Hl . In the updated vector R, the way Hli is at the first position, the
elements Hl0 Hli 1 are shifted one position to the right, and all the other
M EMORY S YSTEM all the elements of the history vector Hl are shifted one position to the right
D ESIGN and ev is added at the first position:
In case that an invalidation access hits the way Hli , the cache line corre-
sponding to way Hli is evicted and should be used at the next line fill. In
the updated vector I, the way Hli is therefore placed at the last position, the
elements Hli1 Hlk 1 are shifted one position to the left, and the other
If the invalidation access causes a cache miss, the requested line is not in
the cache, and the history remains unchanged: I Hl . Note that the vector
I can be obtained by shifting cyclically vector R one position to the left
I R1 Rk 1
R 0 (6.3)
by passing the local hit signals hk 1 : 0 through an encoder. The flag
xi 1 Hli J hit 1
where yi 0 indicates that the active cache way is not among the first i
positions of the history vector Hl . Thus, the first element of the updated
history vector R can be expressed as
J if hit 1
R0
Hlk 1
if hit 0
/
(#
hit
enc
h[0:k-1] J A C ACHE D ESIGN
x[0]
H0 EQ
log k parallel
...
prefix
...
OR
Hk-1 x[k-1]
... EQ
K y[k-1:1]
Hsel hit
ev H’
The cost of the whole history update circuit LRUup run at:
1 0 linv
H’
where K k log k. Note that these delays already include the propagation
delay of the register. Thus, clocking just adds the setup time δ.
Hl1 ; on a miss
way1 way0 way1
h1 ; on a hit
1 0 1
H l way1 XNOR linv H l H l
Thus, it suffices to keep one history bit per set, e.g., Hl1 . That simplifies
the LRU replacement circuit significantly (figure 6.16), and the initializa-
/&
(#
Repl $rd
EQ hit A C ACHE D ESIGN
$rd
way[0]
h1 0
a_line way[1]
Al Aw history 1
RAM 0 linv
Ar Lx1 $rd
1 H1
clear 2-port
$rd w Din H’
tion after power-up can be dropped. Since an inverter is not slower than an
XNOR gate, the cost and delay of circuit Repl can then be estimated as
The cache also updates the directory, the data RAMs and the cache history.
The update of the cache history H is delayed by
D$k ; H maxD$1 hit DRepl hi D$k hit DRepl hit DRepl a
Thus, the propagation delay from a particular input to the storage of the
k-way cache can be expressed as:
the signals $rd $w linv and l f ill specifying the type of the cache
access,
Table 6.11 lists the active control signals for each state of the standard
memory transactions.
/(
(#
a_byte CDw CDw hit hit
b AdG
a_sector rs A C ACHE D ESIGN
s ma ca a CACHE
a_line
Di Do
a_tag l+t Din 0 valid clear
s 1 rs
0 2b+3
lfill sector 0
1 Dout
MAd[31:b] 1
MDat 2b+3 $forw rs lfill
The cache interface receives the address a and the data Din and MDat.
Since all cache and memory accesses affect a whole sector, address a is a
sector boundary:
a byte 0
and the cache and memory ignore the offset bits a byte of the address.
The interface $if provides a hit signal, the data Dout, the memory address
MAd, a cache address, and the input data Di of the cache. On a line fill, Di
is taken from the memory data bus MDat, whereas on a write hit access,
the data is taken from Din
MDat if l f ill 1
Di (6.4)
Din if l f ill 0
Figure 6.17 depicts an implementation of such a cache interface. The
core of the interface is a sectored k-way cache, where k may be one. The
width of a sector (B 2b bytes) equals the width of the data bus between
the main memory and the cache. Each line comprises S 2s sectors. A
multiplexer selects the input data Di of the cache according to equation
6.4. The address generator circuit AdG generates the addresses and bank
write signals CDw for the accesses. Circuit $ f orw forwards the memory
data in case of a read miss. The cost of the cache interface runs at
C$i f t l s b C$k t l s b Cmux B 8 CAdG C$ f orw
C$ f orw 2 Cmux B 8 C f f B 8
lfill 1 0 EQ 0 1 lfill
inc
ca rs ma CDw[B-1:0]
Address generation for the line fill of a sectored cache. The outputs
ca and ma are the low order bits of the cache and memory address. Signal rs
indicates that the current sector equals the requested sector.
memory and of the cache address equal the sector address a sector. On
a line fill (l f ill 1), the whole line must be fetched from main memory.
The memory requires the start address of the cache line:
MAd 31 : b a tag a line ma
with
a sector if l f ill 0
ma
0s if l f ill 1
Thus, the address generator clears ma on a line fill.
On a line fill, the cache line is updated sector by sector. The address
generator therefore generates all the sector addresses 0 2s 1 for the
cache, using an s-bit counter scnt. The counter is cleared on scntclr 1.
The sector bits of the cache address equal
a sector if l f ill 0
ca
scnt if l f ill 1
In addition, circuit AdG provides a signal rs (requested sector) which indi-
cates that the current sector with address scnt equals the requested sector
rs 1 a sector scnt
This flag is obtained by an s-bit equality tester.
The address generator also generates the bank write signal CDwB 1 :
0 for the data RAM of the cache. Because of write allocate, the data RAM
is updated on a line fill and on a write hit (table 6.11). On a line fill, signal
Sw requests the update of the whole cache sector CDwB 1 : 0 1,
whereas on a write hit $w 1, the bank write signals of the memory
determine which cache banks have to be updated. Thus, for 0 i B, the
bank write signal CDwi is generated as
CDwi Sw MBW i $w
/)
(#
By cs$i f , we denote all the control inputs of the cache interface. These
signals are provided by the control unit CON. The data paths provide the A C ACHE D ESIGN
address a. Let ACON cs$i f and ADP a denote the accumulated delay of
these inputs. The cost and the cycle time of circuit AdG and the delay of
its outputs can then be expressed as
With respect to the on-chip cycles, the output Dout and the input data Di
of the cache have the following accumulated delays:
The k-way cache comprises RAMs and registers, which have to be up-
dated. The actual updating of a register includes the delay Df f of the reg-
ister and the setup time δ, whereas the updating of a RAM only includes
the setup time. The additional delay Df f for the registers is already incor-
porated in the delay of the k-way cache. In addition to the cache address
ca, the cache also needs the input data Di and the write signals in order to
update its directory and cache data. The minimal cycle time of the cache
interface can therefore be expressed as:
N SECTION 6.1.3, it has turned out that the sequential DLX core which
is directly connected to the slow external memory spends most of its
run time waiting for the memory system. We now analyze whether a fast
cache between the processor core and the external memory can reduce
this waiting time. Adding the cache only affects the memory environment
Menv and the memory control. As before, the global functionality of the
memory system and its interaction with the data paths and main control of
the DLX design remain the same.
)
(&
32 MDin 64
MDRw BE S EQUENTIAL DLX
MA[2] Din di
req WITH C ACHE
[31:0]
MDat M
0 do
w/r M EMORY
MDout $if burst
Dout
1 MAd a reqp
[63:32]
Brdy
MA[31:3] a clear hit
Dif Mif
reset
5 % 1
Figure 6.19 depicts the memory environment Menv. The cache interface
$i f of section 6.3 is placed between the memory interface Mif and the data
paths interface Dif. The cache interface implements the write through,
write allocate policy. Since there is only a single cache in the DLX design,
line invalidation will not be supported. The cache is initialized/cleared on
reset. The off-chip data bus MDat and the cache sectors are B 2b 8
bytes wide.
Memory Interface Mif The memory interface still forwards data and
addresses between the off-chip memory and the memory environment.
However, the memory address MAd is now provided by the cache inter-
face, and the data from the memory data bus are forwarded to the data
input MDat of the cache interface.
Interface Dif The cache interface is connected to the data paths through
a 32-bit address port MA and two data ports MDin and MDout. In the
memory environment, the data busses are 64 bits wide, whereas in the data
paths they are only 32 bit wide. Thus, the data ports must be patched
together. On the input port MDin, circuit Di f duplicates the data MDRw
On the output port Dout, a multiplexer selects the requested 32-bit word
within the double-word based on the address bit MA[2]:
Dout 31 : 0 if MA2 0
MDout
Dout 63 : 32 if MA2 1
mbw j MA2 ; i 1
Mbw4 i j j0 3
mbw j MA2 ; i 0
Stores always take several cycles, and the bank write signals are used in
the second cycle, at the earliest. The memory control therefore buffers the
signals Mbw in a register before feeding them to the cache interface and
to the byte enable lines BE of the memory bus. Register MBW is clocked
during the first cycle of a memory transaction, i.e., on $rd 1:
Mbw7 : 0 if $rd 1
MBW 7 : 0 :
MBW 7 : 0 if $rd 0
Thus, circuit MC provides the signal MBW at zero delay
AMC MBW 0
)
(&
The cost and cycle time of the memory control MC run at
S EQUENTIAL DLX
CMC CMC mbw Cand 8 Cinv C f f 8 WITH C ACHE
The remaining MifC control signals are Moore signals. Since the automa-
ton precomputes its Moore outputs, these control signals are provided at
zero delay
AMi f C AMi f C Moore 0
The MifC automaton receives the inputs mw and mr from the main con-
trol, the hit signal from the cache interface, and the handshake signals Brdy
and reqp from the memory. These inputs have an accumulated delay of
5 %
As in the DLX design without cache, we assume that the off-chip memory
is controlled by an automaton which precomputes its outputs and that the
control inputs which the off-chip memory receives through the memory
bus add dMhsh delays to the cycle time of its automaton. With a cache, the
)#
(
/Brdy /Brdy * reqp
M EMORY S YSTEM fill req wait /Brdy * /reqp lastwait /Brdy
D ESIGN /Brdy * reqp Brdy /Brdy * /reqp Brdy
/hit * mw Brdy Brdy * /reqp
/hit * mr fill last fill
Brdy * reqp mr
$RD mw
hit * mw
else Brdy
last M write M $write
/Brdy
FSD of the MifC control automaton; $RD is the initial state.
Active control signals for the FSD modeling the MifC control. Signals
$rd and mbusy are Mealy signals, the remaining signals are Moore signals.
off-chip memory only performs a burst read access or a single write access.
Both accesses start with a request cycle.
The memory interface starts the memory access by sending the address
and the request signal req to the off-chip memory, but the address is now
provided by the cache interface. That is the only change. Forwarding
signal req and address MAd to the memory bus and off-chip still takes
Ddriv dbus delays, and the processing of the handshake signals adds dMhsh
delays. Thus, the memory request takes
TMreq maxAMi f C A$i f MAd Ddriv dbus dMhsh ∆
After the request, the memory performs the actual access. The timing
of the single write access is modeled as in the design without cache. The
)&
(&
Parameters of the MifC Mealy automaton; index (1) corresponds to S EQUENTIAL DLX
the Moore signals and index (2) to the Mealy signals. WITH C ACHE
M EMORY
# states # inputs # and frequency of the outputs
k σ γ νsum νmax1 νmax2
9 5 15 40 7 4
fanin of the states #, length, frequency of the monomials
fansum fanmax #M lsum lmax lmax2
18 3 14 24 2 2
memory interface sends the data MDin and the byte enable bits. Once the
off-chip memory receives these data, it performs the access:
We assume, that for the remaining sectors, the actual memory access time
can be hidden. Thus, the cache interface receives the next sector with
a delay of Ddriv dbus . Circuit $if writes the sector into the cache and
forwards the sector to the data paths where the data are multiplexed and
clocked into a register:
Due to the memory access time, the write access and the reading of the
first sector take much longer than the CPU internal cycles. Therefore, they
are performed in W CPU cycles.
If a read access hits the cache, the off-chip memory is not accessed
at all. The cache interface provides the requested data with an delay of
A$i f Dout . After selecting the appropriate word, data MDout is clocked
into a register:
Updating the cache interface on a read or write access takes T$ . Thus, the
memory environment of the DLX design requires a CPU cycle time of at
least
%
Presently (2000) large workstations have a first level cache of 32KB to
64KB (table 6.8), but the early RISC processors (e.g. MIPS R2000/3000)
started out with as little as 4KB to 8KB of cache. We consider a cache
size of 16KB for our DLX design. This sectored, direct mapped cache is
organized in 1024 lines. A cache line comprises S 2 sectors, each of
which is B 8 bytes wide. The cache size and other parameters will be
optimized later on.
According to table 6.14, the 16KB cache increases dramatically the cost
of the memory environment Menv (factor 1200) and of the DLX processor
(factor 31), but the cost of the control stays roughly the same. Adding a
first level cache makes the memory controller MC more complicated; its
automaton requires 9 instead of 3 states. However, this automaton is still
fairly small, and thus, the whole DLX control is only 30% more expensive.
Table 6.15 lists the cycle times of the data paths, the control, and the
memory system. The stall engine generates the clock and write signals
based on signal mbusy. Due to the slow hit signal, signal mbusy has a
much longer delay. That more then doubles the cycle time of the control,
which now becomes time critical. The cycle time τDLX of the DLX core is
increased by a factor of 1.27.
A memory request, a cache update, and a cache read hit can be per-
formed in a single processor cycle. The time TMrburst is also not time crit-
ical. Reading the first word from the off-chip memory requires several
processor cycles; the same is true for the write access (TMaccess ). Since
the memory data is written into a register and into the cache, such a read
)(
(&
Cycle time of the DLX design which and without cache memory S EQUENTIAL DLX
WITH C ACHE
cache Ahit Ambusy TMi f C Tstall TCON TDP
M EMORY
no – 7 28 33 42 70
16KB 55 64 79 89 89 70
TMaccess
cache T$i f T$read TMreq TMrburst
α 4 α 8 α 16
no – – 39 – 355 683 1339
16KB 48 57 36 63 391 719 1375
Increasing the number W of wait states improves the cycle time, but it also
increases the CPI ratio. There is a trade-off between cycle time and cycle
count.
CPI Ratio For a given benchmark, the hit ratio ph measures the fraction
of all the memory accesses which are cache hits, and the miss ratio pm
1 ph measures the fraction of the accesses which are cache misses. This
means that the fraction pm of the memory accesses is a cache miss and
requires a line fill.
Let CPIideal denote the CPI ratio of the DLX design with an ideal mem-
ory, i.e., with a memory which performs every access in a single cycle.
In analogy to the CPI ratio of a pipelined design (section 4.6), the cache
misses and memory updates can be treated as hazards. Thus, the CPI ratio
of the DLX design with L1 cache can be expressed as:
The CPI ratio of the DLX design with ideal memory can be derived from
the instruction mix of table 6.6 in the same manner as the CPI ratio of the
DLX without cache. That table also provides the frequency of the loads
and stores. According to cache simulations [Kro97, GHPS93], the 16KB
direct mapped cache of the DLX achieves a miss ratio of 33% on the
SPECint92 workload. On the compress benchmark, the cache performs
slightly better pm 31%. Thus, the DLX with 16KB cache yields on
these two workloads a CPI ratio of
Based on these formulae, the optimal cycle time and optimal number of
wait states can be determined as before. Although the CPI and TPI ra-
tios vary with the workload, the optimal cycle time is the same for all the
))
(&
Optimal cycle time and number W of wait states S EQUENTIAL DLX
WITH C ACHE
L1 α4 α8 α 16
M EMORY
cache W τ W τ W τ
no 5 71 10 70 19 71
16KB 5 89 8 90 16 89
CPI and TPI ratios of the two DLX designs on the compress bench-
mark and on the average SPECint92 workload.
Cost Performance Trade-Off For any two variants A and B of the DLX
design, the parameter eq specifies the quality parameter q for which both
variants are of the same quality:
1 1
q 1q
q 1q
CA T PIA CB T PIB
)*
(
For quality parameters q eq, the faster of the two variants is better, and
M EMORY S YSTEM for q eq, the cheaper one is better. For a realistic quality metric, the
D ESIGN quality parameter q lies in the range of 02 05.
Depending on the speed of the off-chip memory, the break even point lies
between 0.14 and 0.28 (table 6.18). The DLX with cache is the faster of the
two designs. Thus, the 16KB cache improves the quality of the sequential
DLX design, as long as the performance is much more important than the
cost.
Altogether, it is worthwhile to add a 16KB, direct mapped cache to the
DLX fixed point core, especially in combination with a very slow external
memory. The cache increases the cost of the design by a factor of 31, but
it also improves the performance by a factor of 1.8 to 3.7. However, the
DLX still spends 13% to 30% of its run time waiting for the main memory,
due to cache misses and write through accesses.
Every cache design has many parameters, like the cache size, the line size,
the associativity, and the cache policies. This section studies the impact
of these parameters on the performance and cost/performance ratio of the
cache design.
Spatial Locality The cache also makes use of the spatial locality, i.e.,
whenever the processor accesses a data, it is very likely that it soon ac-
*
(&
Miss ratio of a direct mapped cache depending on the cache size [K S EQUENTIAL DLX
byte] and the line size [byte] for the average SPECint92 workload; [Kro97]. WITH C ACHE
M EMORY
cache line size [byte]
size 8 16 32 64 128
1 KB 0.227616 0.164298 0.135689 0.132518 0.150158
2 KB 0.162032 0.112752 0.088494 0.081526 0.088244
4 KB 0.109876 0.077141 0.061725 0.057109 0.059580
8 KB 0.075198 0.052612 0.039738 0.034763 0.034685
16 KB 0.047911 0.032600 0.024378 0.020493 0.020643
32 KB 0.030686 0.020297 0.015234 0.012713 0.012962
64 KB 0.020660 0.012493 0.008174 0.005989 0.005461
cesses a data which is stored close by. Starting a memory transfer requires
W cycles, and then the actual transfer delivers 8 bytes per cycle. Thus
fetching larger cache lines saves time, but only if most of the fetched data
are used later on. However, there is only limited amount of spatial locality
in the programs.
According to table 6.19, the larger line sizes reduces the miss ratio sig-
nificantly up to a line size of 32 bytes. Beyond 64 bytes, there is virtually
no improvement, and in some cases the miss ratio even increases. When
analyzing the CPI ratio (table 6.20), it becomes even more obvious that
32-byte lines are optimal. Thus, it is not a pure coincidence that commer-
cial processors like the Pentium [AA93] or the DEC Alpha [ERP95] use
L1 caches with 32-byte cache lines.
However, 32 bytes is not a random number. In the SPECint92 integer
workload, about 15% of all the instructions change the flow of control
(e.g., branch, jump, and call). On average, the instruction stream switches
to another cluster of references after every sixth instruction. Thus, fetching
more than 8 instructions (32 bytes) rarely pays off, especially since the
instructions account for 75% of the memory references.
Impact on Cost and Cycle Time Doubling the cache size cuts the miss
ratio by about one third and improves the cycle count, but it also impacts
the cost and cycle time of the DLX design (table 6.21). If a cache of 8KB
or more is used, the fixed point core with its 12 kilo gates accounts for less
than 10% of the total cost, and doubling the cache size roughly doubles the
cost of the design.
For a fixed cache size, doubling the line size implies that the number of
cache lines in cut by half. Therefore, the cache directory only requires half
*
(
M EMORY S YSTEM CPI ratio of the DLX with direct mapped cache on the SPECint92
D ESIGN workload. Taken from [Kro97].
as many entries as before, and the directory shrinks by half. Thus, doubling
the line size reduces the cost of the cache and the cost of the whole DLX
design. Increasing the line size from 8 to 16 bytes reduces the cost of the
DLX design by 7-10%. Doubling the line size to 32 bytes saves another
5% of the cost. Beyond 32 bytes, an increase of the line size has virtually
no impact on the cost.
Table 6.21 also lists the cycle time imposed by the data paths, the control
and the cache interface:
The cache influences this cycle time in three ways: T$i f and T$read account
for the actual update of the cache and the time of a cache read hit. The
cache directory also provides the hit signal, which is used by the control in
order to generate the clock and write enable signals (TCON ). This usually
takes longer than the cache update itself and for large caches it becomes
even time critical. Doubling the line size then reduces the cycle time by 3
gate delays due to the smaller directory.
*
(&
Cost and cycle time of the DLX design with a direct mapped cache S EQUENTIAL DLX
WITH C ACHE
cache cost CDLX [kilo gates] cycle time TDLX
M EMORY
size line size [B]
[KB] 8 16 32 64 128 8 16 32 64 128
1 42 39 37 36 36 80 70 70 70 70
2 69 62 59 57 57 83 80 70 70 70
4 121 109 103 100 98 86 83 80 70 70
8 226 202 190 185 182 89 86 83 80 70
16 433 388 365 354 348 92 89 86 83 80
32 842 756 713 692 681 95 92 89 86 83
64 1637 1481 1403 1364 1345 98 95 92 89 86
Impact on the Miss Ratio Table 6.22 lists the miss ratio of an asso-
ciative cache with random or LRU replacement policy on a SPECint92
workload. This table is taken from [Kro97], but similar results are given in
[GHPS93]. LRU replacement is more complicated than random replace-
ment because it requires a cache history, but it also results in a significantly
better miss ratio. Even with twice the degree of associativity, a cache with
random replacement performs worse than a cache with LRU replacement.
Thus, we only consider the LRU replacement.
In combination with LRU replacement, 2-way and 4-way associativity
improve the miss ratio of the cache. For moderate cache sizes, a 2-way
*#
(
M EMORY S YSTEM Miss ratio [%] of the SPECint92 workload on a DLX cache system
D ESIGN with 32-byte lines and write allocation; [Kro97].
Cost and CPU cycle time of the DLX design with a k-way set associa-
tive cache (32-byte lines).
cache achieves roughly the same miss ratio as a direct mapped cache of
twice the size.
Impact on the Cost Like for a direct mapped cache, the cost of the cache
interface with a set associative cache roughly doubles when doubling the
cache size. The cache interface accounts for over 90% of the cost, if the
cache size is 8KB or larger (table 6.23). 2-way and 4-way associativity
increase the total cost by at most 4% and 11%, respectively. The relative
cost overhead of associative caches gets smaller for larger cache sizes.
When switching from 2-way to 4-way associativity, the cost overhead
is about twice the overhead of the 2-way cache. That is for the following
*&
(&
reasons: In addition to the cache directory and the cache data RAMs, a
set associative cache with LRU replacement also requires a cache history S EQUENTIAL DLX
WITH C ACHE
and some selection circuits. In a 2-way cache, the history holds one bit per
sector, and in a 4-way cache, it holds 8 bits per sector; that is less than 0.5% M EMORY
of the total storage capacity of the cache. The significant cost increase
results from the selection circuits which are the same for all cache sizes.
In the 2-way cache, those circuits account for about 900 gate equivalents.
The overhead of the 4-way cache is about three times as large, due to the
more complicated replacement circuit.
Impact on the Cycle Time The cache provides the hit signal which is
used by the control in order to generate the clock signals. Except for small
caches (1KB and 2KB), the control even dominates the cycle time TDLX
which covers all CPU internal cycles (table 6.23). Doubling the cache size
then increases the cycle time by 3 gate delays due to the larger RAM.
In a 32-bit design, the tags of a direct mapped cache of size X KB are
t1 32 log X
bits wide according to figure 6.7. Thus, doubling the cache size reduces the
tag width by one. In a set associative cache, the cache lines are distributed
equally over the k cache ways, and each way only holds a fraction (1k) of
the lines. For a line size of 32 bytes, we have
Lk L1 k X 32 k
tk 32 log X log k t1 log k
The cache tags are therefore log k bits wider than the tags of an equally
sized direct mapped cache.
In each cache way, the local hit signal hi is generated by an equality
tester which checks the tk -bit tag and the valid flag:
The core of the tester is a tk 1-bit OR-tree. For a cache size of of 1KB
to 64KB and an associativity of k 4, we have
32 log 64K log 1 tk 32 log 1K log 4
17 tk 1 25
and the equality tester in the hit check circuit of the k-way cache has a fixed
depths. However, the access of the cache data and the directory is 3 log k
delays faster due to the smaller RAMs
The local hit signals of the k cache ways are combined to a global hit
signal using an AND gate and an k-bit OR-tree. For k 2, we have
D$k hit D$k hi Dand DORtree k
D$1 hi 3 log k 2 2 log k
Thus, for a moderate cache size, the 2-way cache is one gate delay slower
than the other two cache designs.
Impact on the Performance Table 6.24 lists the optimal cycle time of
the DLX design using an off-chip memory with parameter α 4 8, and
table 6.25 lists the CPI and TPI ratio of these designs. In comparison to a
direct mapped cache, associative caches improve the miss ratio, and they
also improve the CPI ratio of the DLX design. For small caches, 2-way
associativity improves the TPI ratio by 4 11%, and 4-way associativity
improves it by 5 17%. However, beyond a cache size of 4KB, the slower
cycle time of the associative caches reduces the advantage of the improved
miss ratio. The 64KB associative caches even perform worse than the
direct mapped cache of the same size.
Doubling the cache size improves the miss ratio and the CPI, but it also
increases the cycle time. Thus, beyond a cache size of 4KB, the 4-way
cache dominates the cycle time TDLX , and the larger cycle time even out-
weights the profit of the better miss ratio. Thus, the 4KB, 4-way cache
yields the best performance, at least within our model. Since larger caches
increase cost and TPI ratio, they cannot compete with the 4KB cache.
In combination with a fast off-chip memory (α 4), this cache speeds
the DLX design up by a factor of 2.09 at 8.8 times the cost. For a memory
*(
(&
S EQUENTIAL DLX
WITH C ACHE
CPI and TPI ratio of the DLX design with cache. The third table M EMORY
lists the CPI and TPI reduction of the set associative cache over the direct mapped
cache (32-byte lines).
CPI ratio
cache α4 α8
size 1 2 4 1 2 4
1 KB 6.67 6.29 5.90 7.74 7.20 6.96
2 KB 6.04 5.79 5.73 6.85 6.51 6.42
4 KB 5.51 5.46 5.40 6.18 6.05 5.96
8 KB 5.25 5.07 5.02 5.80 5.55 5.58
16 KB 5.06 4.94 4.89 5.53 5.35 5.28
32 KB 4.95 4.86 4.84 5.37 5.14 5.12
64 KB 4.87 4.83 4.82 5.16 5.11 5.10
TPI ratio
cache α4 α8
size 1 2 4 1 2 4
1 KB 466.9 440.3 425.0 549.3 504.2 487.0
2 KB 422.7 405.6 401.1 486.5 462.2 449.3
4 KB 441.1 404.2 378.2 494.7 447.4 423.2
8 KB 435.6 426.2 386.2 481.5 466.1 423.9
16 KB 435.5 429.6 420.6 475.9 465.7 454.5
32 KB 440.9 437.2 430.7 478.4 462.8 460.6
64 KB 447.9 449.4 443.7 474.4 475.0 468.8
*/
(
M EMORY S YSTEM Speedup and cost increase of the DLX with 4-way cache over the
D ESIGN design without cache
2.2
1KB 4-way
2 2KB 4-way
5
1KB 4-way
4.5 2KB 4-way
quality ratio (alpha =8)
4KB 4-way
4 no cache
3.5
3
2.5
2
1.5
1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q
Quality ratio of the designs with 4-way cache relative to the design
without cache for two types of off-chip memory.
*)
('
system with α 8, the cache even yields a speedup of 2.9. According to
table 6.26, the speedup of the 1KB and 2KB caches are at most 6% to 15% P IPELINED DLX
WITH C ACHE
worse than that of the 4KB cache at a significantly better cost ratio. Thus,
there is a trade-off between cost and performance, and the best cache size M EMORY
is not so obvious. Figure 6.22 depicts the quality of the DLX designs with
a 4-way cache of size 1KB to 4KB relative to the quality of the design
without cache. The quality is the weighted geometric mean of cost and
TPI ratio: Q C q T PI q 1.
N ORDER to avoid structural hazards, the pipelined DLX core of the sec-
tions 4 and 5 requires an instruction memory IM and a data memory DM.
The cache system described in this section implements this split memory
by a separate instruction and data cache. Both caches are backed by the
unified main memory, which holds data and instructions.
The split cache system causes two additional problems:
The arbitration of the memory bus. Our main memory can only
handle one access at a time. However, an instruction cache miss can
occur together with a write through access or a data cache miss. In
such a case, the data cache will be granted access, and the instruction
cache must wait until the main memory allows for a new access.
**
(
The data consistency of the two caches. As long as a memory word is
M EMORY S YSTEM placed in the instruction cache, the instruction cache must be aware
D ESIGN of the changes done to that memory word. Since in our DLX design,
all the memory writes go through the data cache, it is only the in-
struction cache which must be protected against data inconsistency.
As in the sequential DLX design (section 6.4), the caches only impact
the memory environments and the memory control circuits. This section
describes how to fit the instruction and data cache into the memory en-
vironments IMenv and DMenv of the pipelined design DLXΠ supporting
interrupts, and how the memory interface Mif connects these two environ-
ments to the external main memory. The new memory control is described
in section 6.5.2.
1 + 5 %
The core of the data environment DMenv (figure 6.23) is the cache interface
D$i f as it was introduced in section 6.3. The data cache (Dcache) is a
sectored, write through, write allocate cache with a 64-bit word size. In
#
('
dpc MAR DMRw DMRw
P IPELINED DLX
1 0 Dlinv
64 WITH C ACHE
32 MDin
a
M EMORY
Din
Mad D$a
reset clear D$if (Dcache)
MDat MDat
hit Dout[63:32, 31:0]
1 0 MAR[2]
Dhit DMout
On the output port Dout, a multiplexer selects the requested 32-bit word
within the double-word based on the address bit MAR[2]:
On an instruction cache miss the data cache is checked for the requested
line (Dlinv 1). In case of a snoop hit, the corresponding Dcache entry
is invalidated. For the snoop access and the line invalidation, the Dcache
interface uses the address d pc of the instruction memory instead of address
MAR:
MAR if Dlinv 0
a
d pc if Dlinv 1
A multiplexer selects between these two addresses. Since the Dcache is
only flushed on reset, the clear input of the Dcache interface D$if is con-
nected to the reset signal. The hit signal Dhit is provided to the memory
control.
The data memory environment communicates with the memory interface
Mif and the external memory via the address port D$a and the data ports
#
(
MAR dpc
M EMORY S YSTEM 1 0 Ilinv
D ESIGN 32
a Din
Mad I$a
reset clear I$if (Icache)
MDat MDat
hit Dout[63:32, 31:0]
1 0 dpc[2]
Ihit IMout
MDin and MDat. The Dcache interface provides the memory address
D$a Mad
Let the sectored cache comprise 2ld lines, each of which is split in S
2sdsectors. The data memory environment then has cost
Assuming that control signal Dlinv is precomputed, address a and data Din
have the following accumulated delay:
MAR if Ilinv 1
a
d pc if Ilinv 0
#
('
IMenv DMenv
I$a Mdat D$a MDat MDin
P IPELINED DLX
WITH C ACHE
MDindoe
M EMORY
Igrant 1 0 MDat 64
Mif
32
Mad
a do di
external memory
Interface Mif connecting IMenv and DMenv to the external memory
The Icache is, like the Dcache, flushed on reset; its hit signal Ihit is
provided to the memory control. The environment IMenv communicates
with memory interface Mif and the external memory via the address port
I$a and the data port MDat. The Icache interface provides the memory
address I$a Mad.
Let the instruction cache comprise 2li lines with 2si sectors per line and
b
2 8 bytes per sector; the cost of environment IMenv can be expressed
as
The Icache address a has the same accumulated delay as the Dcache ad-
dress.
I$a if Igrant 1
MAd
D$a if Igrant 0
using a 3-bit multiplexer. Thus, the cost of circuit MifC can be expressed
as
CMi f C Cmux 3 CI $i f C CD$i f C
The two automata I$ifC and D$ifC are very much like the Mealy au-
tomaton of the sequential MifC control, except that they provide some new
signals, and that they need two additional states for the snoop access. In
#'
( /Brdy
D$ifC /Brdy /Brdy * reqp
M EMORY S YSTEM DFreq Dwait /Brdy * /reqp DLwait
D ESIGN Dmra * /Dhit * /iaccess /Brdy * reqp Brdy /Brdy * /reqp Brdy
Dmwa * /Dhit * /iaccess Brdy Brdy * /reqp
Dfill DLfill
Brdy * reqp
Ireq Dmra
Dsnoop D$RD Dmwa
/Dhit Dmwa * Dhit * /iaccess
Dhit
Dlinv else
Mlast Mwrite D$w
Brdy /Brdy
I$ifC /Brdy
/Dinit * /isnoop
Dinit * /Brdy /Brdy * reqp
isnoop IFreq Iwait /Brdy * /reqp ILwait
/Ihit * /imal * /isnoop /Brdy * reqp Brdy /Brdy * /reqp Brdy
isnoop Brdy * /reqp
Isnoop Ifill ILfill
Dinit * Brdy Brdy * reqp
/Ihit
Ihit I$RD
else
Ilinv
FSDs modeling the Mealy automata of the controls D$if and I$if
the I$ifC automaton, the states for the memory write access are dropped.
Figure 6.26 depicts the FSDs modeling the Mealy automata of the D$ifC
and I$ifC control. Table 6.27 lists the active control signals for each state;
table 6.28 lists the parameters of the two automata, assuming that the au-
tomata share the monomials.
The inputs of the two automata have the following accumulated delay:
The two busy signals and the signals D$rd and I$rd are the only Mealy
control signals. As in the sequential design, these signals are just used for
clocking. The remaining cache control signals (cs$if) and the bus control
signals are of type Moore and can be precomputed. They have delay
AMi f C cs$i f 0
AMi f C req wr burst Dmux
Since the automata only raise the flags ibusy and dbusy in case of a non-
faulty memory access, the clock circuit of the stall engine can now simply
#(
('
P IPELINED DLX
WITH C ACHE
Active control signals for the FSDs modeling the MifC control; X
M EMORY
denotes the data (D) or the instruction (I) cache.
Parameters of the Mealy automata used in the memory interface con-
trol MifC
#/
(
obtain the busy signal as
M EMORY S YSTEM
D ESIGN busy ibusy dbusy
at an accumulated delay of
Ace busy maxAout 2 I$i fC Aout 2 D$i fC Dor
-
This is the tricky part. Let us call D$i fC the D-automaton, and let us call
I$i fC the I-automaton. We would like to show the following properties:
Before we can prove the lemma, we first have to formally define, in what
cycles a memory access takes place. We refer to the bus protocol and count
an access from the first cycle, when the first address is on the bus until the
last cycle, when the last data are on the bus.
completion. M EMORY
If state 2 is entered with Igrant 1, the access starts immediately,
and the D-automaton returns to its initial state within 0, 1 or 2 cycles. From
then on, things proceed as in the previous case.
In state signal isnoop is active which sends the I-automaton from
its initial state into state 2% . Similarly, in state 2 signal Ireq is
active which sends the D-automaton from its initial state into state % .
5 %
As for the sequential DLX design with cache, the temporal behavior of the
memory system is modeled by the request cycle time TMreq , the burst read
time TMrburst , the read/write access time TMaccess to off-chip memory, the
cache read access time T$read , and the cycle time T$i f of the caches (see
page 283).
In the pipelined DLX design, the Icache and the Dcache have the same
size, and their inputs have the same accumulated delay, thus
The formulae of the other three memory cycle times remain unchanged.
The cycle time TDLX of all internal cycles and the cycle time τDLX of the
whole system are still modeled as
. %
According to table 6.29, the 4KB cache memory increases the cost of the
pipelined design by a factor of 5.4. In the sequential design this increase
factor is significantly larger (8.8) due to the cheaper data paths.
#*
(
M EMORY S YSTEM Cost of the DLXΠ design without cache and with 2KB, 2-way Icache
D ESIGN and Dcache
Cycle time of the design DLXΠ with 2KB, 2-way Icache and Dcache
Maccess
MifC stall DP $read $if Mreq Mrburst
α4 α8
65 79 89 55 47 42 51 379 707
The two caches and the connection to the external memory account for
81% of the total cost of the pipelined design. The memory interface con-
trol now comprises two Mealy automata, one for each cache. It therefore
increases the cost of the control by 69%, which is about twice the increase
encountered in the sequential design.
Table 6.30 lists the cycle time of the DLXΠ design and of its memory
system, assuming a bus and handshake delay of dbus 15 and dMhsh
10. The data paths dominate the cycle time TDLX of the processor core.
The caches themselves and the control are not time critical. The memory
request and the burst read can be performed in a single cycle; they can
tolerate a bus delay of dbus 53.
the cache size, due to the computation of the hit signal. However, if the size
of a single cache way is at most 2KB, the control is not time critical. In
spite of the more complex cache system, this is the same cache size bound
as in the sequential DLX design. That is because the stall engine and main
control of the pipelined design are also more complicated than those used
in the sequential design.
The workload comprises 25.3% loads and 8.5% stores. Due to some empty
delay slots of branches, the pipelined DLX design must fetch 10% addi-
tional instructions, so that νf etch 11.
As in the sequential DLX design with cache interface, the memory ac-
cess time is not uniform (table 6.16, page 288). A read hit can be per-
formed in just a single cycle. A standard read/write access to the external
memory (TMaccess ) requires W processor cycles. Due to the write through
policy, a write hit then takes 2 W cycles. For a cache line with S sectors,
a cache miss adds another S W cycles. Let pIm and pDm denote the miss
ratio of the instruction and data cache. Since on a cache miss, the whole
pipeline is usually stalled, the CPI ratio of the pipelined design with cache
#
(
M EMORY S YSTEM Miss ratios of a split and a unified cache system on the SPECint92
D ESIGN workload depending on the total cache size and the associativity.
Effective Miss Ratio According to table 6.32, the instruction cache has a
much better miss ratio than the data cache of the same size. That is not sur-
prising, because instruction accesses are more regular than data accesses.
For both caches, the miss ratio improves significantly with the cache size.
The pipelined DLX design strongly relies on the split first level cache,
whereas the first level cache of the sequential DLX design and any higher
level cache can either be split or unified. We have already seen that a split
cache system is more expensive, but it maybe achieves a better perfor-
mance.
For an easy comparison of the two cache designs, we introduce the ef-
fective miss ratio of the split cache as:
This effective miss ratio directly corresponds to the miss ratio of a unified
cache. According to table 6.32, a split direct mapped cache has a smaller
miss ratio than a unified direct mapped cache; that is because instructions
and data will not thrash each other. For associative caches, the advantage
#
('
Optimal cycle time τ, number of wait states W , CPI and TPI ratio of P IPELINED DLX
the pipelined DLX design with split 2-way cache. WITH C ACHE
M EMORY
total memory: α 4 memory: α 8
cache size W τ CPI TPI W τ CPI TPI
1 KB 4 90 2.82 253.5 8 89 3.72 331.2
2 KB 4 92 2.46 226.3 8 89 3.19 283.7
4 KB 4 95 2.22 211.1 8 89 2.83 251.9
8 KB 5 89 2.12 188.4 8 89 2.49 221.4
16 KB 4 97 1.85 179.2 8 97 2.27 220.1
32 KB 4 100 1.77 177.3 7 103 2.06 212.3
of a split system is not so clear, because two cache ways already avoid
most of the thrashing. In addition, the unified cache space can be used
more freely, e.g., more than 50% of the space can be used for data. Thus,
for a 2-way cache, the split approach only wins for small caches ( 4KB).
On the other hand, the split cache can also be seen as a special asso-
ciative cache, where half the cache ways are reserved for instructions or
data, respectively. Since the unified cache space can be used more freely,
the unified 2-way (4-way) cache has a better miss ratio than the split direct
mapped (2-way) cache. Commercial computer systems use large, set asso-
ciative second and third level caches, and these caches are usually unified,
as the above results suggest.
Performance Impact Table 6.33 lists the optimal number of wait states
and cycle time of the pipelined DLX design as well as the CPI and TPI
ratios for two versions of main memory. The CPI ratio improves signifi-
cantly with the cache size, due to the better miss ratio. Despite the higher
cycle time, increasing the cache size also improves the performance of the
pipelined design by 30 to 36%. In the sequential DLX design, the cache
size improved the performance by at most 12% (table 6.25). Thus, the
speedup of the pipelined design over the sequential design increases with
the cache size.
Compared to the sequential design with 4-way cache, the pipelined de-
sign with a split 2-way cache yields a 1.5 to 2.5 higher performance (table
6.34). The cache is by far the most expensive part of the design; a small
1KB cache already accounts for 60% of the total cost. Since the pipelined
and sequential cache interfaces have roughly the same cost, the overhead
of pipelining decreases with the cache size. The pipelined DLX design is at
most 27% more expensive, and the cost increase is smaller than the perfor-
##
(
M EMORY S YSTEM Speedup and cost increase of the pipelined design with split 2-way
D ESIGN cache relative to the sequential design with unified 4-way cache.
%&
This and the following exercises deal with the design of a
write back cache and its integration into the sequential DLX design. Such
a cache applies the weak consistency model. A write hit only updates the
cache but not the external memory. A dirty flag for each line indicates
that the particular line has been updated in the cache but not in the main
memory. If such a dirty line is evicted from the cache, the whole line
must be copied back before starting the line fill. Figure 6.27 depicts the
operations of a write back cache for the memory transactions read and
write.
Modify the design of the k-way cache and of the cache interface in order
to support the write back policy and update the cost and delay formulae.
Special attention has to be payed to the following aspects:
#&
(/
A cache line is only considered to be dirty, if the dirty flag is raised
and if the line holds valid data. E XERCISES
Integrate the write back cache interface into the sequential
DLX design and modify the cost and delay formulae of the memory sys-
tem. The memory environment and the memory interface control have to
be changed. Note that the FSD of figure 6.27 must be extended by the bus
operations.
A write back cache basically performs four types of accesses,
namely a cache read access (read hit), a cache update (write hit), a line fill,
and a write back of a dirty line. Let a cache line comprise S sectors. The
read hit then takes one cycle, the write hit two cycles, and the line fill and
the write back take W S cycles each.
Show that the write back cache achieves a better CPI ratio than the write
through cache if the number of dirty misses and the number of writes
(stores) obey:
W 1 # dirty misses
W S # writes
Analyze the impact of the write back policy on the cost, per-
formance, and quality of the sequential DLX design. Table 6.35 lists the
ratio of dirty misses to writes for a SPECint92 workload [Kro97].
#'
(
M EMORY S YSTEM
D ESIGN
#(
Chapter
7
IEEE Floating Point
Standard and Theory of
Rounding
N THIS chapter, we introduce the algebra needed to talk concisely about
floating point circuits and to argue about their correctness. In this for-
malism, we specify parts of the IEEE floating point standard [Ins85], and
we derive basic properties of IEEE-compliant floating point algorithms.
Two issues will be of central interest: the number representation and the
rounding.
n1 p1
an 1 : 0 f 1 : p 1 ∑ ai 2i ∑ fi 2 i
i 0 i 1
0m n
a f bg 0 p q 2 p 1
sm : 0 t 1 : p 1 2 p 1
st
/# . ,
The IEEE floating point standard makes use of a rather particular integer
format called the biased integer format. In this format, a string
en 1 : 0
0 1
n n
Strings interpreted in this way will be called biased integers Biased inte-
gers with n bits lie in a range emin : emax , where
emin 1 2n 1
1 2n 1
2
emax 2 2 2
n n1
1 2 n1
1
and therefore
Thus, the two numbers excluded in the biased format are at the bottom of
the range of representable numbers. Converting a biased integer xn 1 : 0
to a two’s complement number yn 1 : 0 requires solving the following
equation for y
xbias y
x y 2
n1
1 y 1
n1
y 1n 1 mod 2n
#*
/
IEEE F LOATING Components of an IEEE floating point number
P OINT S TANDARD
normal denormal
AND T HEORY OF
exponent ebias emin
ROUNDING
significand 1 f 0 f
hidden bit 1 0
Obviously, single precision numbers fit into one machine word and double
precision numbers into two words.
IEEE floating point numbers can represent certain rational numbers as
well as the symbols ∞, ∞ and NaN. The symbol NaN represents ‘not a
number’, e.g., the result of computing 00. Let s e f be a floating point
number, then the value represented by s e f is defined by
1s 2ebias 1 f if e
0 1
n n
1s 2emin 0 f if e 0n
s e f
1s ∞ if e 1n and f 0p 1
normal if e
0 1 and
n n
denormal (denormalized) if e 0n .
e 0n for denormal numbers. Observe also, that string f alone does not
determine the significand, because the exponent is required to determine
the hidden bit. If we call the hidden bit f 0, then the significand obviously
#
/
2emin - (p-1) 2emin - (p-1)
N UMBER F ORMATS
2z - (p-1)
2z 2z+1
2emax - (p-1)
Xmax
2emax 2emax +1
f 0 2 p 1
1 2 p 1
f 0 1 2 p 1
Thus, we have
1 f 2 2 p 1
0 f 1 2 p 1
#
/
Figure 7.1 depicts the non-negative representable numbers; the picture for
IEEE F LOATING the negative representable numbers is symmetric. The following properties
P OINT S TANDARD characterize the representable numbers:
AND T HEORY OF
ROUNDING 1. For every exponent value z emin emax , there are two inter-
vals containing normal representable numbers, namely 2z 2z1 and
2z1 2z . Each interval contains exactly 2p 1 numbers. The
bers equals 2emin p 1. This is the same gap as in the intervals
2emin 2emin 1 and 2emin 1 2emin . The property, that the gap be-
tween the numbers 2emin and 2emin is filled with the denormal num-
bers is called gradual underflow.
Note that the smallest and largest positive representable numbers are
Xmin 2emin 2 p 1
The number x 0 has two representations, one for each of the two pos-
sible sign bits. All other representable numbers have exactly one represen-
tation. A representable number x s e f is called even if f p 1 0,
and it is called odd if f p 1 1. Note that even and odd numbers alter-
nate through the whole range of representable numbers. This is trivial to
see for numbers with the same exponent. Consecutive numbers with dif-
ferent exponent have significands 0, which is even, and 1 1p 1 , which
is odd.
we are aware that the letters e and f are used with two meanings
depending on context, and
#
/ 8- 5
R ∞ R ∞ ∞
Since R is not closed under the arithmetic operations, one rounds the result
of an arithmetic operation to a representable number or to plus infinity or
minus infinity. Thus, a rounding is a function
r : IR R ∞
mapping real numbers x to rounded values r x. The IEEE standard defines
four rounding modes, which are
ru round up,
rd round down,
ru x miny R ∞ x y
rd x maxy R ∞ x y
rd x if x 0
rz x
ru x if x0
The fourth rounding mode is more complicated to define. For any x with
Xmax x Xmax , one defines rne x as a representable number y closest
to x. If there are two such numbers y, one chooses the number with even
significand. Let
Xmax 2emax 2 2 p
(see figure 7.2). This number is odd, and thus, it is the smallest number,
that would be rounded by the above rules to 2emax 1 if that would be a
representable number. For x
Xmax Xmax , one defines
∞ Xmax x
Xmax
if
if Xmax x Xmax
rne x
Xmax Xmax x Xmax
∞
if
if x Xmax
# &
/
/
ROUNDING
Let
r : IR R ∞
be one of the four rounding functions defined above, and let
Æ : IR 2 IR
ÆI : R 2 R ∞
x ÆI y r x Æ y
r x Æ y r x Æ y (7.1)
In this case, the factoring is neither normal nor denormal. The value of a
factoring is defined as
s e f 1s 2e f
x s e f
η̂ x η x if x 2emin
Let α be an integer. Let q range over all integers, then the open intervals
q 2 α q 1 2 α and the singletons q 2 α form a partition of the
real numbers (see figure 7.3). Note that 0 is always an endpoint of two
intervals.
Two real numbers x and y are called α–equivalent if according to this
partition they are in the same equivalence class, i.e., if they lie either in the
same open interval or if they both coincide with the same endpoint of an
interval. We use for this the notation x α y. Thus, for some integer q we
have
α α
x α y x y q2
q 1 2
α
or x y q2
2e x α e 2e x and 2
e
xα e
2e xα
# /
/ e-p e-p
2 2
IEEE F LOATING e-p
P OINT S TANDARD y y+2 z
e - (p-1)
AND T HEORY OF 2
ROUNDING
Geometry of the values y, y 2 e p, and z
x y α x y
x β x
The salient properties of the above definition are, that under certain cir-
cumstances rounding x and its representative leads to the same result, and
that representatives are very easy to compute. This is made precise in the
following lemmas.
1. r x r x p e
2. η x p e s e f p
3. if x pe x, then r x r x .
d 2e p 1
y q 2e p 1
z q 1 2e p 1
normal because
s e f is IEEE-normal,
x 2emin x p e
2emin , and
f is normal iff f p is normal.
This proves part 2. Part 3 follows immediately from part 1, because
r x r x p e r x p e
r x
The next lemma states how to get p-representatives of the value of a
binary fraction by a so called sticky bit computation. Such a computation
simply replaces all bits f p 1 : v by the OR of these bits.
If s 0 then f gs, and there is nothing to show. In the other case,
we have
v
g f g ∑ f i 2 i
g 2 p
i p1
Thus,
f p g 2 p1
g1 gs
# *
/
f: f[-u : 0] . f[1 : p] f[p+1 : v]
IEEE F LOATING
P OINT S TANDARD OR-tree
AND T HEORY OF
ROUNDING g s
Let η̂ x s ê fˆ. Along the lines of the proof of lemma 7.1, one shows
the following lemma:
Let x 0, let η̂ x s ê fˆ, and let r be an IEEE rounding mode, then
1. r̂ x r̂ x p ê
2. η̂ x p ê s ê fˆ p
3. if x pê x, then r̂ x r̂ x .
Let r be any rounding mode. We would like to break the problem of com-
puting r x into the following four steps:
##
/
1. IEEE normalization shift. This step computes the IEEE-normal fac-
toring of x ROUNDING
η x s e f
f1 sigrd s f
The function sigrd will be defined below separately for each round-
ing mode. It will produce results f1 in the range 0 2.
e1 1 f2 2 if f1 2
e2 f2 post e f1
e f1 otherwise
4. Exponent round. This step takes care of cases where the intermediate
result 1s 2e2 f2 lies outside of R . It computes
e3 f3 exprd s e2 f2
The function exprd will be defined below separately for each round-
ing mode.
We will have to define four functions sgrd and four functions exprd such
that we can prove
s e3 f3 η r x
We define
x1 s e f1 1s 2e f1
The following lemma summarizes the properties of the significand round-
ing:
r x if x Xmax
x1
r̂ x if x Xmax
For f 1 2, x lies in the interval 2e 2e1 if s 0, and it lies in
2e1 2e if s 1. Mirroring this interval at the origin in case of s 1
and scaling it by 2 e translates exactly from rounding with r̂ to signifi-
lates from rounding with r into significand rounding in the interval 0 1.
Mirroring if s 1 and scaling by 2emin translates in the other direction.
Finally observe that r x r̂ x if x Xmax and f is normal.
s e2 f2 η x1
##
/
Post normalization obviously preserves value: ROUNDING
emax 2 2 p 1
if e2 emax
exrdz s e2 f2
e2 f2 if e2 emax
∞ 0 if e2 emax
exrdne s e2 f2
e2 f2 if e2 emax
Let
x3 s e3 f3 1s 2e3 f3
We can proceed to prove the statement
s e3 f3 η r x
of the theorem.
∞ if r̂ x r x and s0
x2Xmax if
if
r x
r̂ x
r̂ x r x
and s1
r x
because x2 r̂ x by lemma 7.5.
The proof for the other three rounding modes is completely analogous.
We summarize the results of this subsection: Let η x s e f , it then
holds
η r x s exprd s post e sigrd s f (7.4)
Exactly along the same lines, one shows for x 0 and η̂ x s ê f that
ˆ
/ / 8-
By the lemmas 7.1 and 7.2, we can substitute in the above algorithms f and
fˆ by their p-representatives. This gives the following rounding algorithms:
%&
/# C
/# C
As r̂ x r̂ x p
ê , we can also conclude that
T INYa x T INYa x p ê
3 -%
The two definitions for loss of accuracy are denormalization loss:
LOSSa x r x r̂ x
and inexact result
LOSSb x r x x
An example for denormalization loss is x 00p 1 because
rne x 0 and r̂ x x
A denormalization loss implies an inexact result, i.e.,
LOSSa x LOSSb x
The lemma is proven by contradiction. Assume r x x, then x R R̂
and it follows that
r̂ x x r x
##/
/
Let η̂ x s ê fˆ and η x s e f . By definition,
IEEE F LOATING
P OINT S TANDARD x pê pê x
AND T HEORY OF
ROUNDING Since ê e, we have
x pê pe x
and hence,
r x p ê r x
This shows, that
LOSSb x LOSSb x p ê
and therefore, the conditions can always be checked with the representative
x p ê instead of with x.
r̂ x x sigrd s f f
sigrd s f p f p
x aÆb
##)
/#
be the exact result. The proper definition of the result of the IEEE operation
is then E XCEPTIONS
a ÆI b r y
where
x 2 α if OV F x OV Fen
x 2α if UNF x UNFen
y
x otherwise.
Thus, whenever non masked overflows or underflows occur, the expo-
nent of the result is adjusted. For some reason, this is called wrapping the
exponent. The rounded adjusted result is then given to the interrupt service
routine. In such cases one would of course hope that r y itself is a normal
representable number. This is asserted in the following lemma:
The adjusted result lies strictly between 2emin and Xmax : (
1. OV F x 2emin x2 α Xmax
We only show the lemma for multiplication in the case of overflow. The
remaining cases are handled in a completely analogous way.
The largest possible product of two representable numbers is
emax 1 2
x
2
Xmax 2 22emax 2
2 emax 2 α 2 2n 1
1 2 3 2n 2
4 2n 2
3 2n 2
2n 2
emax
emax α 2n 1
1 3 2n 2
2n 2
1
2 n1
2 emin
##*
/
The following lemma shows how to obtain a factoring of r y from a
IEEE F LOATING factoring of x.
P OINT S TANDARD
AND T HEORY OF
) Let η̂ r̂ x s u v, then
ROUNDING
1. OV F x η x 2 α s u α v
2. UNF x η x 2α s u α v
We only show part 1; the proof of part 2 is completely analogous. Let
η̂ x s ê fˆ
then
α
η̂ x 2
s ê α fˆ
Define f1 and u v as
f1 sigrd s fˆ p ê
u v post ê f1
u α v post ê α f1
It follows that
η r y η̂ r̂ y
and part 1 of the lemma is proven.
#&
/&
/#& . 8 -
A RITHMETIC ON
Let S PECIAL O PERANDS
x2 α
if OV F x OV Fen
x 2α if UNF x UNFen
y
x otherwise.
be the exact result of an IEEE operation, where the exponent is wrapped in
case an enabled overflow or underflow occurs. The IEEE standard defines
the occurrence of an inexact result by
holds, then exponent rounding does not take place, and significand round-
ing is the only source of inaccuracy. Thus, we have in this case
INX y sigrd s f f OV F x
sigrd s f p f p OV F x p e
N THE IEEE floating point standard [Ins85], the infinity arithmetic and
the arithmetic with zeros and NaNs are treated as special cases. This
special arithmetic is considered to be always exact. Nevertheless, there
#&
/
are situations in which an invalid operation exception INX or a division by
IEEE F LOATING zero exception DBZ can occur.
P OINT S TANDARD
In the following subsections, we specify this special arithmetic and the
AND T HEORY OF
possible exceptions for any IEEE operation. The factorings of the numbers
ROUNDING
a and b are denoted by sa ea fa and sb eb fb respectively.
There are two different kinds of not a number, signaling NaN and quiet
NaN. Let e en 1 : 0 and f f 1 : p 1. The value represented by
the floating point number s e f is a NaN if e 1n and f 0 p 1 . We
chose f 1 1 for the quiet and f 1 0 for the signaling variety of NaN1 .
qNAN indicates that the result must be one of the quiet input NaNs.
For the absolute value and reversed sign operations, this restriction does
not apply. These two operations modify the sign bit independent of the
type of the operand.
1 The IEEE standard only specifies that the exponent en 1 : 0 1n is reserved for in-
finity and NaN; further details of the coding are left to the implementation. For infinity and
the two types of NaNs we therefore chose the coding used in the Intel Pentium Processor
[Int95]
2 x : x
#&
/&
Result of the addition; x and y denote finite numbers. A RITHMETIC ON
S PECIAL O PERANDS
ab b
a y ∞ ∞ qNAN sNAN
x r x y ∞ ∞
∞ ∞ ∞ qNAN
∞ ∞ qNAN ∞
qNAN qNAN
sNAN qNAN
Since zero has two representations, i.e., 0 and 0, special attention must
be paid to the sign of a zero result a b. In case of a subtraction, the sign
of a zero result depends on the rounding mode
0 if ru rne rz
xx x x
0 if rd
When adding two zero numbers with like signs, the sum retains the sign of
the first operand, i.e., for x 0 0,
xx x x x
#&#
/
IEEE F LOATING Result of the multiplication a b; x and y denote finite non-zero numbers.
P OINT S TANDARD
ab b
AND T HEORY OF
a y 0 ∞ qNAN sNAN
ROUNDING
x r x y 0 ∞
0 0 0 qNAN
∞ ∞ qNAN ∞
qNAN qNAN
sNAN qNAN
Table 7.4 lists the result of the multiplication a b for the different types
of operands. If the result of the multiplication is a NaN, the sign does not
matter. In any other case, the sign of the result c a b is the exclusive or
of the operands’ signs:
sc sa sb
There are just a few cases in which floating point exceptions do or might
occur:
The exceptions OVF, UNF and INX depend on the value of the exact
result (section 7.3); they can only occur when both operands are
finite non-zero numbers.
Table 7.5 lists the result of the division ab for the different types of
operands. The sign of the result is determined as for the multiplication.
This means that except for a NaN, the sign of the result c is the exclusive
or of the operands’ signs: sc sa sb .
In the following cases, the division signals a floating point exception:
The exceptions OVF, UNF and INX depend on the value of the exact
result (section 7.3); they can only occur when both operands are
finite non-zero numbers.
/&'
The comparison operation is based on the four basic relations greater than,
less than, equal and unordered. These relations are defined over the set
R ∞ NaN consisting of all representable numbers, the two infinities, and
NaN:
R ∞ NaN R ∞ ∞ NaN
x y R IR x ÆI y xÆy
#&'
/
IEEE F LOATING Floating point predicates. The value 1 (0) denotes that the relation is
P OINT S TANDARD true (false). Predicates marked with are not indigenous to the IEEE standard.
AND T HEORY OF
predicate greater less equal unordered INV if
ROUNDING
true false ? unordered
F T 0 0 0 0
UN OR 0 0 0 1
EQ NEQ 0 0 1 0
UEQ OGL 0 0 1 1
No
OLT UGE 0 1 0 0
ULT OGE 0 1 0 1
OLE UGT 0 1 1 0
ULE OGT 0 1 1 1
SF ST 0 0 0 0
NGLE GLE 0 0 0 1
SEQ SNE 0 0 1 0
NGL GL 0 0 1 1
Yes
LT NLT 0 1 0 0
NGE GE 0 1 0 1
LE NLE 0 1 1 0
NGT GT 0 1 1 1
The two infinities (∞ and ∞) are interpreted in the usual way. For
any finite representable x R , we have
∞ I x I ∞
elements of the relation ‘unordered’, and that are the only elements. Let
this relation be denoted by the symbol ?, then
Table 7.6 lists all the predicates in question and how they can be obtained
from the four basic relations. The predicates OLT and UGE, for example,
#&(
/&
can be expressed as
A RITHMETIC ON
S PECIAL O PERANDS
OLT x y UGE x y x I y x I y x I y x?y
Note that for every predicate the implementation must also provide its
negation.
In addition to the boolean value Æ x y, the comparison also signals an
invalid operation. With respect to the flag INV, the predicates fall into one
of two classes. The first 16 predicates only signal INV when comparing a
signaling NaN, whereas the remaining 16 predicates also signal INV when
the operands are unordered.
Comparisons are always exact and never overflow or underflow. Thus,
INV is the only IEEE floating point exception signaled by a comparison,
and the flags of the remaining exceptions are all inactive:
Conversions have to be possible between the two floating point formats and
the integer format. Integers are represented as 32-bit two’s complement
numbers and lie in the set
R s 1 2 24
2
128
12 24
2
128
R d 1 2 53
2
1024
12 53
2
1024
Table 7.7 lists the floating point exceptions which can be caused by the
different format conversions. The result of the conversion is rounded as
specified in section 7.2, even if the result is an integer. All four rounding
modes must be supported.
#&/
/
IEEE F LOATING Floating point exceptions which can be caused by format conversions
P OINT S TANDARD (d: double precision floating point, s: single precision floating point, i: 32-bit
AND T HEORY OF two’s complement integer)
ROUNDING
INV DBZ OVF UNF INX
d s + + + +
sd +
is +
id
si + +
d i + +
%&
1. T INYa x T INYb x
2. LOSSa x LOSSb x FALSE
1. LOSSa x
2. LOSSb x
#&*
Chapter
8
Floating Point Algorithms
and Data Paths
a sA eA n 1 : 0 fA 1 : p 1
b sB eB n 1 : 0 fB 1 : p 1
where
53 11 if db 1
n p
24 8 if db 0
As shown in figure 8.1, single precision inputs are fed into the unit as
the left subwords of FA63 : 0 and FB63 : 0. Thus,
sA eA n 1 : 0 fA 1 : p 1
FA263 FA262 : 55 FA254 : 32 if db
FA263 FA262 : 52 FA251 : 0 if db
FA2 FB2
Fc fcc
Fr
129 (sr, er, fr, flr)
FXrnd FPrnd
Fx Fp
Top level schematics of the floating point unit. The outputs Fc, Fx and
Fp consist of a 64-bit data and the floating point exception flags.
Let
x aÆb
be the exact result of an arithmetic operation, and let
η̂ x s ê fˆ
In the absence of special cases the converter, the multiply/divide unit and
the add/subtract unit deliver as inputs to the rounder the data sr er 12 :
0 fr 1 : 55 satisfying
x pê 1sr 2er 12:0 fr 1 : 55
and
fr 1 : 0 00 OV F x 0
Note that η̂ x is undefined for x 0. Thus, a result x 0 is always
handled as a special case. Let
x2 α
if OV F x OV Fen
x 2α if UNF x UNFen
y
x otherwise
The rounder then has to output r y coded as a (packed) IEEE floating
point number. The coding of the rounding modes is listed in table 8.1.
+%
The cost of the floating point unit depicted in figure 8.2 can be expressed
as
CFPU CFCon CFPunp CFXunp CCvt CMulDiv
CAddSub CFXrnd CFPrnd C f f 129 4 Cdriv 129
We assume that all inputs of the FPU are taken from registers and therefore
have zero delay. The outputs Fx , Fp , Fc and f cc then have the following
accumulated delay:
AFPU maxAFCon AFXrnd AFPrnd
#'#
) FA2[63:0] FB2[63:0]
F LOATING P OINT
F2[63:0] F2[63:0]
A LGORITHMS AND
DATA PATHS Unpack Unpack
s, e[10:0], lz[5:0], f[0:52], einf, fz, ez, h[1], h[2:52] h[2:52], s, e[10:0], lz[5:0], f[0:52], einf, fz, ez, h[1]
SpecUnp sa ha hb sb SpecUnp
ZERO, INF, SNAN, NAN NaN select ZERO, INF, SNAN, NAN
ZEROa, INFa, SNANa, NANa snan, fnan[1:52] ZEROb, INFb, SNANb, NANb
4 53 4
fla nan flb
Note that AFCon includes the delay of the inputs f la and f lb . In our
implementation, the multiply/divide unit, the add/subtract unit and the two
rounders FP RND and FX RND have an additional register stage. Thus, the
FPU requires a minimal cycle time of
$ 3/#
- U NPACK
The circuit U NPACK (figure 8.4) has the following control inputs
ez
10 7
1 10 1 7
lzero(53)
11 11 CLS(53)
dbs 1 0 normal
1 0
The data inputs are F263 : 0. Single precision numbers are fed into the
unpacking circuit as the left subword of F263 : 0 (figure 8.1). Input data
are always interpreted as IEEE floating point numbers, i.e.,
s ein n 1 : 0 fin 1 : p 1
F263 F262 : 52 F 251 : 0 if dbs 1
F263 F262 : 55 F 254 : 32 if dbs 0
We now explain the computation of the outputs. The flag
ein f 1 ein 1n
signals that the exponent is that of infinity or NaN. The signals ezd and
ezs indicate a denormal double or single precision input. The flag
ez 1 ein 0n
DATA PATHS
is fed into the incrementer. We conclude for denormal inputs
Thus, h0 is the hidden bit of the significand. Padding single precision
significands by 29 trailing zeros extends them to the length of double pre-
cision significands
F251 : 0 if dbs 1
h1 : 52
F254 : 32 029 if dbs 0
and we have
h1 : 52 fin 1 : p 1
Hence, for normal or denormal inputs the binary fraction h0h1 : 53
represents the significand and
Let lz be the number of leading zeros of the string h0 : 53, then
lz lz5 : 0
1s 2e lz f
if normal 1
1s 2e f
s ein fin
if normal 0
fz 1 fin 1 : p 1 0 p 1
#'(
)
Signal h1 is used to distinguish the two varieties of NaN. We chose
h1 0 for the signaling and h1 1 for the quiet variety of NaN (sec- U NPACKING
tion 7.4.1). Inputs which are signaling NaNs produce an invalid operation
exception (INV).
The cost of circuit U NPACK can be expressed as
From the flags ein f , h1, f z and ez one detects whether the input codes
zero, plus or minus infinity, a quiet or a signaling NaN in an obvious way:
ZERO ez f z
INF ein f f z
NAN ein f h1
SNAN ein f h1 f z ein f h1 NOR f z
- NA N SELECT
This circuit determines the representation snan enan fnan of the output
NaN. According to the specifications of section 7.4, the output NaN pro-
vided by an arithmetic operation is of the quiet variety. Thus,
sa 1 ha 2 : 52 if NANa 1
snan fnan 1 : 52
sb 1 hb 2 : 52 if NANa 0
+%
The floating point unpacker FP UNP of figure 8.3 has cost
With f la and f lb we denote the inputs of the registers buffering the flags f la
and f lb . These signals are forwarded to the converter C VT and to circuit
FC ON; they have delay
Assuming that all inputs of the FPU are provided by registers, the outputs
of the unpacker then have an accumulated delay of
S sa ea fa sb eb fb
1sa 2ea fa 1sb 2eb fb
δ
2ea 1sa fa 1sb 2
f b
δ
f 2
fb p1
If δ 3, then
δ
f 2
fb
and there is nothing to prove. If δ 2 then
δ
2
fb
2 2
2 12
ea eb δ emin δ
#'*
)
Thus, for δ 2, the fa and the factoring sa ea fa are normal, and hence,
F LOATING P OINT
δ
A LGORITHMS AND 1sa fa 1sb 2
fb 1 12 12
DATA PATHS
It follows that
ê ea 1 emin and p ê p 1 ea
Since
δ
f p1 2
fb
and fa is a multiple of 2 p1 , one concludes
δ
1sa fa 1sb 2
fb p1 1sa fa 1sb f
S p1ea 2ea 1sa fa 1sb f
) - %
Figure 8.6 depicts an add/subtract unit which is divided into two pipeline
stages. The essential inputs are the following
a sa ea fa b sb eb fb
AlignShift
[0:52]
sa
SigAdd
fb3 fszero
fb[0:52] [0:55]
Sign Select
eb[10:0] sa2 ss
ss1
sb sb2
sub sx
sb’ sb’
sa
ZEROs
sa
sb sb
INV
SpecAS
fla, flb INFs fls
NANs
sa
nan
nan RM[1:0]
the rounding mode RM, which is needed for the sign computation,
and
b 1sub b
sb eb fb sb sub eb fb
S ab
The first stage outputs sign bits sa2 , sb2 , an exponent es , and significands
fa2 , fb3 satisfying
es maxea eb
S pê 2es 1sa2 fa2 1sb2 fb3
The second stage adds the significands and performs the sign computation.
This produces the sign bit ss and the significand fs .
+%
Let the rounding mode RM be provided with delay ARM . Let the circuit
S IG A DD delay the significand fs by DSigAdd f s and the flags f szero and
ss1 by DSigAdd f lag. The cost and cycle time of the add/subtract circuit
and the accumulated delay AAddSub of its outputs can then be expressed as
The circuit A LIGN S HIFT depicted in figure 8.7 is somewhat tricky. Subcir-
cuit E XP S UB depicted in figure 8.8 performs a straightforward subtraction
of n-bit two’s complement numbers. It delivers an n 1-bit two’s com-
plement number asn : 0. We abbreviate
as asn : 0
then
as ea eb
ea eb as 0 asn 1
This justifies the use of result bit asn as the signal ‘eb gt ea’ (eb greater
than ea ), and we have
es maxea eb
#(
)
A DDITION AND
0 S UBTRACTION
es[10:0]
1
eb_gt_ea
ea[10:0] as[12:0] as2[5:0]
ExpSub Limit
LRS(55) fb3[0:54]
eb[10:0]
eb_gt_ea fb3[55]
Sticky
(sticky)
fa[0:52] fb2[0:54]
sa fa2[0:52]
Swap
sa2
fb[0:52] sx
sb’ sb2
10 1
add(12)
as[11]
eb_gt_ea as[10:0]
Circuit E XP S UB
1 as1[10:0] [5:0]
7 as2[5:0]
as[10:0] 0 Ortree 6
[10:6]
eb_gt_ea
Circuit L IMIT which approximates and limits the shift distance
#(#
)
Cost and delay of circuit E XP S UB run at
F LOATING P OINT
A LGORITHMS AND CExpSub Cinv 11 Cadd 12
DATA PATHS
DExpSub Dinv Dadd 12
δ as
The obvious way to compute this distance is to complement and then in-
crement asn : 0 in case as is negative. Because this computation lies on
the critical path of this stage, it makes sense to spend some effort in order
to save the incrementer.
Therefore, circuit L IMIT depicted in figure 8.9 first computes an approx-
imation as1 n 1 : 0 of this distance by
asn 1 : 0 if as 0
as1 n 1 : 0
asn 1 : 0 if as 0
Since
0 δ 1 2n 1
we have
as1 n 1 : 0 asn : 0 δ 1
Thus,
δ if ea eb
as1
δ 1 if ea eb
Circuit L IMIT of figure 8.9 has the following cost and delay
#(&
)
sa, fa[0:52] sb’, fb[0:52] sb’ fb[0:52] 0 sa 0 fa[0:52]
A DDITION AND
S UBTRACTION
0 1 eb_gt_ea 0 1
0
sa2, fa2[0:52] sb2, fb2[0:54]
Circuit S WAP in figure 8.10 swaps the two operands in case ea eb . In this
case, the representation of significand fa will be shifted in the alignment
shifter by a shift distance δ 1 which is smaller by 1 than it should be. In
this situation, the left mux in figure 8.10 preshifts the representation of fa
by 1 position to the right. Hence,
fa fb if ea eb
fa2 fb2
fb fa 2 if ea eb
It follows that
2 δ fb if ea eb
2 as1
fb2 δ
2 fa if ea eb
Note that operand fb2 is padded by a trailing zero and now has 54 bits after
the binary point. The swapping of the operands is done at the following
cost and delay
3 +
The right part of circuit L IMIT limits the shift distance of the alignment
shift. Motivated by theorem 8.1 (page 359), we replace significand 2 as1
fb3 2
as1
fb2 p1
hdec(6)
sticky
as2[5:0]
[54]
fb2[0]
b log p 3 6
and
B 2b 1 1b p2
then
n1
as1 B 1
i b
and
B if as1 B
as2
as1 otherwise
The alignment shift computation is completed by a 55-bit logical left
shifter and the sticky bit computation depicted in figure 8.11.
% -
Consider figure 8.12. If fb2 0 : p 1 is shifted by as2 bits to the right,
then for each position i bit fb2 i is moved to position i as2 . The sticky
bit computation must OR together all bits of the shifted operand starting at
position p 2. The position i such that bit fb2 i is moved to position p 2
is the solution of the equation
This means, that the last as2 bits of fb2 0 : p 1 must be ORed together.
The last p 2 outputs of the half decoder in figure 8.11 produce the mask
0 p2 as2
1 as2
ANDing the mask bitwise with fb2 and ORing the results together produces
the desired sticky bit. Cost and delay of circuit S TICKY run at
CSticky Chdec 6 55 Cand CORtree 55
DSticky Dhdec 6 Dand DORtree 55
The correctness of the first stage now follows from the theorem 8.1 because
¼ δ
2ea 1sa fa 1sb 2
fb p1 if ea eb
S p ê ¼ δ
2eb 1sb fb 1sa 2 fa p1
if ea eb
2es 1sa2 fa2 1sb2 2 δ fb2 p1
2es 1sa2 fa2 1sb2 fb3
(8.1)
4
Figure 8.13 depicts the addition/subtraction of the significands fa2 and fb3 .
Let
0 if sa sb
sx sa sb
1 if sa sb
#(/
)
00 fa2[0:52] 000 00 fb3[0:55] sx
F LOATING P OINT
A LGORITHMS AND 2 53 3
DATA PATHS
add(58)
ovf neg sum[-2:55]
Therefore, both the sum and its absolute value can be represented by a
two’s complement fraction with 3 bits before and p 2 bits behind the
binary point.
Converting binary fractions to two’s complement fractions and extend-
ing signs, the circuit S IG A DD computes
0
2
fa2 0 fa2 1 : p 103
sx
2
fb3 0 sx fb3 1 : p 2 sx sx 2
p2
abs[n-2:0]
Let
neg sum2
be the sign bit of the two’s complement fraction sum2 : 0sum1 : p
1. Table 8.2 lists for the six possible combinations of sa , sb and neg the
resulting sign bit ss1 such that
¼
1ss1 fs 1sa fa2 1sb fb3 (8.2)
holds. In a brute force way, the sign bit ss1 can be expressed as
For the factoring ss1 es fs it then follows from the Equations 8.1 and 8.2
that
S sa ea fa sb eb fb
pê 2 1sa2 fa2
es
1sb2 fb3
2es 1ss1 fs
+%
Circuit S IGN generates the sign bit ss1 in a straightforward manner at the
following cost and delay:
A LGORITHMS AND
result sa sb neg ss1
DATA PATHS
fa2 fb3 0 0 0 0
impossible 0 0 1 *
fa2 fb3 0 1 0 0
fa2 fb3 0 1 1 1
fa2 fb3 1 0 0 1
fa2 fb3 1 0 1 0
impossible 1 1 0 *
fa2 fb3 1 1 1 1
For the delay of the significand add circuit S IG A DD, we distinguish be-
tween the flags and the significand fs . Thus,
CSigAdd Cxor 58 Cadd 58 Czero 58 CAbs 58 CSign
DSigAdd f lag Dxor Dadd 58 maxDzero 58 DSign
DSigAdd f s Dxor Dadd 58 DAbs 58
The circuit S PEC AS checks whether the operation involves special num-
bers, and checks for an invalid operation. Further floating point exceptions
– overflow, underflow and inexact result – will be detected in the rounder.
Circuit S PEC AS generates the following three flags
INFs signals an infinite result,
-
If the result is a finite non-zero number, circuit S IG A DD already provides
the correct sign ss1 . However, in case of a zero or infinite result, special
rules must be applied (section 7.4.2). For NaNs, the sign does not matter.
In case of an infinite result, at least one operand is infinite, and the result
retains the same sign. If both operands are infinite, their signs must be
alike. Thus, an infinite result has the following sign
sa if INFa
ss3
sb if INFb INFa
In case of an effective subtraction sx sa sb 1, a zero result is
always positive, except for the rounding mode rd (round down) which is
coded by RM 1 : 0 11. In case of sx 0, the result retains the same sign
as the a operand. Thus, the sign of a zero result equals
0 if sx RM 1 NOR RM 0
sx RM 1 RM 0
ss2
1
sa
if
if sx
Depending on the type of the result, its sign ss can be expressed as
ss3 if INFs
INFs
ss
ss2
ss1
if
if INFs
fs 0
fs 0
#/
)
RM0] sa
F LOATING P OINT RM[1]
A LGORITHMS AND 0 1 sx sa sb’
ss1 ss2
DATA PATHS
fszero 0 1 INFa 1 0
INFs ss3
NANs INFs 0 1
ZEROs ss
The cost and the maximal delay of circuit S IGN S ELECT can be ex-
pressed as
η̂ a sa ea lza fa
η̂ b sb eb lzb fb
then
ab sq eq q
and the exponent e of the rounded result satisfies
e eq 1
and hence
2eq fd pe¼ 2eq q
Thus, it suffices to determine fd and then feed sq eq fd into the
rounding unit.
x
x0 x1 x2 f(x)
Newton iteration for finding the Zero x̄ of the mapping f x, i.e.,
f x̄ 0. The figure plots the curve of f x and its tangents at f x i for i 0 1 2.
xi1 as the zero of the tangent. From figure 8.16 it immediately follows
that
f xi 0
f xi
xi xi1
Solving this for xi1 gives
xi1 xi f xi f xi
Determining the inverse of a real number fb is obviously equivalent to
finding the zero of the function
f x 1x fb
The iteration step then translates into
xi1 xi 1xi fb x2i
xi 2 fb xi
Let δi 1 fb xi be the approximation error after iteration i, then
δ i 1 1 fb xi1
1 fb 2xi fb x2i
fb 1 fb xi 2
fb δ2i 2 δ2i
Observe that δi 0 for i 1.
#/&
)#
For later use we summarize the classical argument above in a somewhat
peculiar form: M ULTIPLICATION
AND D IVISION
Let (
xi1 xi 2 fb xi
δi 1 fb xi and
δi1 1 fb xi1
δi1 2 δ2i
x0 0x0 1 : γ 1
We first show the upper bound. Consider the mapping f x 1x as de-
picted in figure 8.17. Let u v 1 2 and let u v, then
1.0 u v 2.0
The mapping gx f u f ¼ u x u is the tangent to f x 1x
at x u.
For the lower bound we first show that the product of two representable
numbers u and v cannot be 1 unless both numbers are powers of 2. Let ui
and v j be the least significant nonzero bits of (the representations of) u and
v. The product of u and v then has the form
for some integer A. Thus, the product can only be 1 if A 0, in which case
the representations of u and v have both an 1 in the single position i or j,
respectively.
1 any finite precision approximation of 1 fb
Thus, for representable fb
is inexact, and the lower bound follows for all fb 1.
For fb 1 we have fb 1 2 γ 1
. Consider again figure 8.17. The
mapping f x 1x is convex and lies in the interval (1,2) entirely under
the line through the points (1,1) and (2,1/2). The line has slope 12.
Thus,
1
f 1 t 1 t 2
1t
for all t 0 1. For t 2 γ 1 we get
γ 2
x f fb 1 2
#/(
)#
)## ! 78 . , "
M ULTIPLICATION
We establish some notation for arguments about finite precision calcula- AND D IVISION
tions where rounding is done by chopping all bits after position σ. For real
numbers f and nonnegative integers σ we define
f σ f 2σ 2σ
then
f 0 f
Moreover, if f f i : 0 f 1 : s and s σ, then
f σ f i : 0 f 1 : σ
z z0z1 : s
2z 100s 0z0z1 : s
100s 1z0z1 : s 2 s
mod 4
z0z1 : s 2 s
s
z0z1 : σ ∑ zi 2
i
2
s
i σ1
σ
z0z1 : σ 2
#//
)
The simplified finite precision Newton-Raphson iteration is summarized
F LOATING P OINT as
A LGORITHMS AND
DATA PATHS zi fb xi
Ai zi 0 : σ
xi1 xi Ai σ
δi 1 fb xi
Ai appr 2 fb xi
( Let σ 4, let x0 12 1 and let 0 δ0 18. Then
xi1 0 1 and
σ1
0 δi1 2 δ2i 2
14
for all i 0.
δi1 ∆1 ∆2 ∆3
where
∆1 1 fb xi 2 zi
∆2 xi 2 zi xi Ai
∆3 xi Ai xi Ai σ
0 ∆1 2 δ2i
0 zi fb xi 2
0 ∆2 xi 2 zi Ai
xi 2σ 2 σ
Obviously, we have
σ
0 ∆3 2
#/)
)#
and the first two inequalities of the lemma follow. By induction we get
M ULTIPLICATION
δ i 1 2 δ2i 2σ1 AND D IVISION
18 18 14
14 1 fb 14 xi 1 fb 1
2 p2
2 if p 24
i then δi
3 if p 53
δ0 15 2 9
δ1 2 152 2 18
2
56
46 2 18
δ2 4232 2 36
2
56
4233 2 36
2 30
δ3 35837 2 72
2
55
35 2 62
2
56
2 55
By similar arguments one shows that one iteration less suffices, if one
starts with γ 15, and one iteration more is needed if one starts with γ 5
(exercise 8.2). The number of iterations and the corresponding table size
and cost are summarized in table 8.3 We will later use γ 8.
#/*
)
F LOATING P OINT Size and cost of the 2 γ γ lookup ROM depending on the number of
A LGORITHMS AND iterations i, assuming that the cost of a ROM is one eighth the cost of an equally
DATA PATHS sized RAM.
lookup ROM
i γ
size [K bit] gate count
1 15 480 139277
2 8 2 647
3 5 0.16 61
xi 1 fb xi 2 p2
fa xi fa fb q fa xi 2 p1
Thus,
fa xi p1 fa xi q
fa xi 2 p1
fa xi p1 2
p
In other words,
E fa xi p1
is an approximation of q, and the exact quotient lies in the open interval
E E 2 p . Moreover, we have
E 2 p2
if fa fb E 2 p1
determines which one of the three cases applies, and whether the result is
exact.
#)
lza[5:0] lzb[5:0]
)#
fa[0:52] fb[0:52] ea[10:0] eb[10:0] M ULTIPLICATION
sa sb nan fla flb
AND D IVISION
SigfMD Sign/ExpMD SpecMD
53
(with register stage)
sq, eq[12:0] nan, ZEROq, INFq, NANq, INV, DBZ
fq[-1:55] flq
where for n p 11 53 the exponents are given as n–bit two’s comple-
ment numbers
and for
r log p
the numbers of leading zeros are given as r–bit binary numbers
In the absence of special cases the factorings are normalized, and thus
fa fb 1 2
#)
)
For operations Æ , let
F LOATING P OINT
A LGORITHMS AND x aÆb
DATA PATHS
be the exact result of the operation performed, and let
η̂ x s ê fˆ
+%
Circuit S IGF MD which produces the significand fq has an internal register
stage. Thus, the cost and the cycle time of the multiply/divide circuit and
the accumulated delay AMulDiv of its outputs can be expressed as
1 -
Figure 8.19 depicts the circuit S IGN /E XP MD for the computation of the
sign and the exponent. The computation of the sign
sq sa sb
We can estimate eq by
1
add(13)
sq eq[12:0]
#)#
)
fa[0:52] 05 fb[0:52] 05
F LOATING P OINT
faadoe fbbdoe
A LGORITHMS AND opa[0:57]
DATA PATHS
Aadoe
opb[0:57]
[1:8]
4/2mulTree(58, 58)
xbdoe
256 x 8 116 116
xadoe
Eadoe
lookup cce c sce s
table
adder (116)
8 fm[-1:114]
01 048
[0:57] [0:25] [26:54]
58 60
db29 Ortree
01
tlu 1 0 [-1:54]
A Ai 1
x xi
Dcnt dcnt0 i
The loop is left after i dcnt0 iterations. For this i, we have after state
#)&
2 10 11
)#
zero? decrement
0 1 db M ULTIPLICATION
Dcnt Dcntce dcnt0 AND D IVISION
Dcntzero tlu 0 1
unpack lookup
Newton 1 quotient 1
Newton 2 quotient 2
Newton 3 quotient 3
Newton 4 quotient 4
Dcnt > 0 Dcnt = 0
FSD underlying the iterative division. The states to
represent one Newton-Raphson iteration. Dcnt counts the number of iterations;
it is counted down.
$ &%&
E fa xi p1
Eb E fb
f a Da and fb Db
- S ELECT FD
Figure 8.23 depicts the circuit selecting the p 1-representative fd of the
quotient q according to the RTL instructions of state & , . Since
E E 2 p1
#)'
)
F LOATING P OINT RTL instructions of the iterative division (significand only). A multi-
A LGORITHMS AND plication always takes two cycles.
DATA PATHS
state RTL instruction control signals
unpack normalize FA, FB
lookup x table fb xce, tlu, fbbdoe
Dcnt db?3 : 2 Dcntce,
Newton 1/2 Dcnt Dcnt 1 Dcntce, xadoe, fbbdoe
A appr 2 x b 57 Ace
Newton 3/4 x A x57 Aadoe, xbdoe, sce, cce
xce
quotient 1/2 E a xp1 faadoe, xbdoe, sce, cce
Da f a Db f b faadoe, fbbdoe, Dce, Ece
quotient 3/4 Eb E fb Eadoe, fbbdoe, sce, cce
Ebce
select fd E E 2 p1,
β f
a Eb 2
p1 fb
E 2 p2 ; if β 0
fd
EE 2 p2 ;; ifif ββ
0
0
029 029
11
00
E[0:25] E[26:54]
00 01
11
Eb[0:114] 1 0 db
129 sfb[25:111]
1 0 db 0 1 1 126 13
3/2 adder(116)
inc(55)
56 1
0 E’[-1:54] adder (117)
neg
1 0
0110
beta
r[-1:54]
zero(117)
27 1 28 db
1 0 db
#)(
)#
its computation depends on the precision. For double precision (p 53)
holds M ULTIPLICATION
E 0E 1 : 54 E 0E 1 : 54 2 54 AND D IVISION
54
E 0E 1 : 25 ∑2 i
2
54
i 26
E 0E 1 : 25 129 2 54
β fa Eb 2 p1 fb
E if β 0
r
E if β 0
then
r if β0
r 2 p2
fd
if β 0
Thus, in case β 0 one has to force bit fd p 2 to 1.
+%
Figure 8.23 depicts circuit S ELECT FD which selects the representative of
the quotient. The cost and the delay of this circuit run at
#)/
)
CSelectF d Cinc 55 Cmux 29 Cmux 56 Cmux
F LOATING P OINT
Cmux 87 C3 2add 116 Cadd 117
A LGORITHMS AND
DATA PATHS Czero 117 203 Cinv Cand
DSelectF d 2 Dmux maxDinc 55 Dmux
2 Dinv D32add 116 Dadd 117 Dzero 117
Circuit S ELECT FD is part of the circuit which performs the division and
multiplication of the significands. The data paths of circuit S IGF MD have
the following cost
The counter Dcnt and the control automaton modeled by figure 8.22 have
been ignored. The accumulated delay of output fq and the cycle time of
circuit S IGF MD can be expressed as:
The flag INVd which indicates an invalid division is signaled in the fol-
lowing three cases (section 7.4.4): when an operand is a signaling NaN,
when both operands are zero, or when both operands are infinite. Thus,
The IEEE exception flag INV is selected based on the type of the operation
INVm if f div
INV
INVd if f div
8 -
The flags NANq, INFq and ZEROq which indicate the type of a special
result are generated according to the tables 7.4 and 7.5.
The result is a quiet NaN whenever one of the operands is a NaN, and
in case of an invalid operation; this is the same for multiplications and di-
visions. Since signaling NaNs are already covered by INV, the flag NANq
can be generated as
+%
All the inputs of the floating point rounder have zero delay since they are
taken from registers. Thus, the cost and cycle time of the rounder FP RND
and the accumulated delay AFPrnd of its outputs run at
CFPrnd CNormShi f t CREPp C f f 140 CSigRnd
CPostNorm CAd justExp CExpRnd CSpecFPrnd
TFPrnd ANormShi f t DREPp ∆
AFPrnd ASigRnd DPostNorm DAd justExp DExpRnd DSpecFPrnd
#*
)&
UNF/OVFen fr er s flr
F LOATING P OINT
fn REPp 58 ROUNDER
NormShift
11 RM
SigRnd
f2
PostNorm
SIGovf e2
RND f3
AdjustExp
OVF e3
ExpRnd
IEEEp Fp[63:0]
Let x
2 0 be the exact, finite result of an operation, and let
2 α x ; if OV F OV Fen
2α x ; if UNF UNFen
y
x ; otherwise (8.3)
η̂ x s ê fˆ
η y s e f
The purpose of circuit RND (figure 8.24) is to compute the normalized,
packed output factoring s eout fout such that s eout fout r y, i.e.,
Moreover the circuit produces the flags TINY, OVF and SIGinx. The ex-
ponent in the output factoring is in biased format. The inputs to the circuit
are
the mask bits UNFen and OVFen (underflow / overflow enable)
#*
)
the rounding mode RM 1 : 0
F LOATING P OINT
A LGORITHMS AND the signal dbr (double precision result) which defines
DATA PATHS
11 53 ; if dbr 1
n p
8 24 ; otherwise
The input factoring has only to satisfy the following two conditions:
By far the most tricky part of the rounding unit is the normalization
shifter N ORM S HIFT. It produces an approximated overflow signal
OV F1 2er fr 2emax 1
which can be computed before significand rounding takes place. The re-
sulting error is characterized by
e α ; if OV F2 OV Fen
en
e ; otherwise (8.6)
fn p f
SIGinx 1 f2 f1
After post normalization, the correct overflow signal is known and the error
produced by the approximated overflow signal OV F1 can be corrected in
circuit A DJUST E XP. Finally, the exponent is rounded in circuit E XP R ND.
emax 1 α 1 ; if OV F2 OV Fen
e3 f3
e2 f3 ; otherwise (8.8)
eout fout exprd s e3 f3
In addition, circuit E XP R ND converts the result into the packed IEEE for-
mat, i.e., bit fout 0 is hidden, and emin is represented by 0n in case of a
denormal result.
With the above specifications of the subcircuits in place, we can show in
a straightforward way:
If the subcircuits satisfy the above specifications, then equation (8.4) holds, ((
i.e., the rounder RND works correctly for a finite, non-zero x.
Let lz be the number of leading zeros of fr 1 : 55. In general, the nor-
malization shifter has to shift the first 1 in fr to the left of the binary point
and to compensate for this in the exponent. If the final result is a denormal
number, then x must be represented as
2er fr 2emin 2er emin
fr
This requires a left shift by er emin which in many cases will be a right
shift by emin er (see exercise 8.3). Finally, for a wrapped exponent one
might have to add or subtract α in the exponent. The normalization shifter
in figure 8.25 works along these lines.
First in circuit F LAGS the signals TINY, OVF1 and the binary represen-
tation lz5 : 0 of the number lz are computed. Then, the exponent en and
the (left) shift distance σ are computed in circuits E XP N ORM and S HIFT-
D IST.
We derive formulae for en and σ such that equations (8.6) hold. From
equations (8.3) and
UNF UNFen T INY UNFen
we conclude
1s 2ê α fˆ ; if OV F OV Fen
lz[5:0]
ShiftDist ExpNorm
sh[12:0]
SigNormShift
The two factorings η̂ y and η y are the same except if y is denormal, i.e.,
if T INY UNFen. In this case,
and
ê α fˆ ; if OV F OV Fen
s
s ê α fˆ ; if T INY UNFen
η y
emin 2ê emin fˆ ; if T INY UNFen
s
s ê fˆ ; otherwise
and therefore,
ê α ; if OV F OV Fen
ê α ; if T INY UNFen
e
; if T INY UNFen
emin
ê ; otherwise (8.9)
and
σ lz ; unless T INY UNFen
If T INY UNFen holds, then x y and ê emin . From equations (8.9)
and (8.5) we know
f 2ê emin
fˆ
2 f
ê ˆ
pê 2 fr
er
2ê emin
fˆ pêemin 2er emin
fr
emin 1
f pêemin 2er emin
fr 2er
f
Thus, we have
holds.
With the above specifications of the subcircuits of the normalization
shifter in place (up to issues of number format), we can immediately con-
clude
fn p f 2σ
Then equations (8.4) hold, i.e., the normalization shifter works correctly.
,
Figure 8.26 depicts circuit F LAGS which determines the number lz of lead-
ing zeros and the flags TINY and OVF1. The computation of lz5 : 0 is
completely straightforward. Because no overflow occurs if fr 1 : 0 00,
we have
Now recall that bias emax 2n 1 1 1n 1 and that n either equals
er 1 emin if f 1 2
T INY
er 1 lz emin if f 0 1
er 1 lz emin 0
#*/
)
fr[-1:55] 64-57 er[12:0] er[9:7]
1 dbr
F LOATING P OINT
emax er[12:0]
A LGORITHMS AND er[11:10]
lzero(64) er[12]
DATA PATHS 13
03dbr31 equal(13)
lz[6]
0 fr[-1]
add(13)
6 12
because lz 0 for an f in the interval 1 2. Thus, the TINY flag can be
computed as the sign bit of the sum of the above 4 operands. Recall that
emin 1 bias. Thus
CFLAGS Clz 64 Cadd 13 CEQ 13 8 Cinv 5 Cor 3 Cand
AFLAGS maxDlz 64 Dinv Dadd 13 DEQ 13 Dand Dor
4 Dor 2 Cand
then the circuit computes the following sums sum and sum 1:
sum er lz 1 γ bias
er 1 1 lz5 : 0 1 γ bias
er 1 1 lz5 : 0 δ
where
bias α 1 ; if OV F OV Fen
δ bias α 1 ; if T INY UNFen
bias 1 ; otherwise .
Recall that α 3 2n 2 110n 2 and bias 2n 11 1n 1 . Hence
bias 1 10n 1
00100
n2
bias α 1 110n 2
100n 2
1010n 2
01010
n2
α 1001
n 2
1 1010
n2
bias 1 α 1110
n 2
11110
n2
#**
)
F LOATING P OINT
A LGORITHMS AND 11 10
bias-a+1 bias+1
DATA PATHS OVFen
1 0
01 OVF1
bias+a+1
er[10:0] lz[5:0] UNFen
1 0 TINY
03 03
0 1 dbr
15
δ 06
3/2add(11)
11 11
1
add2(11)
UNFen
emin+1 emin
0 1 0 1
TINY
eni[10:0] en[10:0]
07 add(13)
13 13
0 1
sh[12:0]
Circuit S HIFT D IST provides the shift distance of the normalization
shift. Depending on the precision, constant 1 e min equals 03 dbr 3 17 .
&
)&
+
The circuit in figure 8.28 implements the shift distance σ of the equations F LOATING P OINT
(8.10) in a straightforward way. Recall that emin 1 bias 2n 1 2.
ROUNDER
Thus
1 emin 1 2n 1 2 2n 1 1 1n 1
It follows that
sh12 : 0 σ
The shift is a right shift if sh12 1.
Circuit S HIFT D IST generates the shift distance sh in the obvious way.
Since the inputs of the adder have zero delay, the cost and the accumulated
delay of the shift distance can be expressed as
CShi f tDist Cadd 13 Cmux 13 Cand Cinv
AShi f tDist maxDadd 13 AFLAGS Dand
AUNFen Dinv Dand Dmux
and(64) and(64)
fn[0:63] fn[64:127]
. ( * Let f fr 2. The output f s of the cyclic left shifter satisfies
cls f σ ; if σ 0
fs
crs f σ ; otherwise.
For non-negative shift distances the claim follows immediately. For nega-
tive shift distance σ it follows that
σ ; if σ 0
t
σ 1 ; otherwise
Next, the distance in the mask circuit is limited to 63: the output of the
OR-tree equals 1 iff t 16 63, hence
t ; if t 63
σ ; if 0 σ 63
sh σ 1 ; if 63 σ 1
63 ; otherwise
63 ; otherwise
We show that
. ( The distance of the left shift in the significand normalization shift is boun-
ded by 56, i.e., σ 56.
&
)&
sh[11:0]
F LOATING P OINT
1 0 sh[11]
ROUNDER
6 t[11:0]
6
16
Ortree 1 0
sh’[5:0]
hdec(6)
1 1
flip
h[63:0]
0 1 sh[12]
u[0:63]
and(64)
v[0:63] w[0:63]
Left shifts have distance lz 56 or er emin 1. The second case only
occurs if T INY UNFen. In this case we have e emin and fr 0.
Assume er emin 55. Since
In case the shift distance is negative a 1 is appended at the right end and
the string is flipped. Thus, for mask u we have
064 σ 1σ
; if 0σ
1 σ 064 σ 63 σ 1
u0 : 63
; if
164 ; if σ 64
&#
)
a) 0 63 0 63
F LOATING P OINT fs[] fs[] f’[σ:56] 0 7 * *
A LGORITHMS AND σ
DATA PATHS
v[] w[] 1 ... 1 0 ... 0
b) 0 63 0 63
fs[] fs[] * f’[0:56] 07 *
|σ| |σ|
v[] w[] 0 ... 0 1 ... 1 0 ... 0
c) 0 63 0 63
fs[] fs[] * cls( f’[0:56] 07, σ’ )
Relation between the strings f s f s and the masks vw in the three cases
a) 0 σ, b) 63 σ 1, and c) σ 64.
sta
st_sg
st_db
st_db
029
1 0 dbr
f1[0:24] f1[25:54]
and the whole normalization shifter N ORM S HIFT has cost and delay
st fn i
i p1
f1 1 : p 1 fn 1 : p st
inc(53) l, r, st
Rounding Decision
0 54
inc
0 1
f2[-1:52] SIGinx
Circuit S IG R ND
In figure 8.33, the least significand bit l, round bit r and sticky bit st are
selected depending on the precision and fed into circuit ROUNDING D ECI -
SION. The rounded significand is exact iff the bits r and st are both zero:
SIGinx r st
The rounding decision is made according to table 8.5 which were con-
structed such that
f2 sigrd s f1
holds for every rounding mode.
Note that in mode rne (nearest even), the rounding decision depends on
bits l r and st but not on the sign bit s. In modes ru , rd , the decision depends
on bits r st and the sign bit s but not on l. In mode rz , the significand is
always chopped, i.e., inc 0. From table 8.5, one reads off
r l st if rne
s r st if ru
inc
s r st if rd
With the coding of the rounding modes from table 8.1, this is implemented
in a straightforward way by the circuit of figure 8.34.
&(
)&
Rounding decision of the significand rounding. The tables list the value F LOATING P OINT
of the flag inc which indicates that the significand needs to be incremented. On ROUNDER
round to zero (r z ), the flag equals 0.
l r st rne s r st ru rd
0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 1 0
0 1 0 0 0 1 0 1 0
0 1 1 1 0 1 1 1 0
1 0 0 0 1 0 0 0 0
1 0 1 0 1 0 1 0 1
1 1 0 1 1 1 0 0 1
1 1 1 1 1 1 1 0 1
l st RM[0] r st s RM[0]
0 1 RM[1]
inc
as
SIGov f f2 1
In addition, circuit P OST N ORM has to compute
en 1 1 ; if f2 2
e2 f3
en f2 ; otherwise
Since the normalization shifter N ORM S HIFT provides en and eni en 1,
the exponent e2 can just be selected based on the flag SIGovf. With a single
OR gate, one computes
1052 ; if f2 2
f3 0 f3 1 : 52 f2 1 f2 0 f3 0 : 52
f2 0 : 52 ; otherwise
Thus, the cost and the delay of the post normalization circuit P OST N ORM
are
)&( 1 D-
The circuit shown in figure 8.36 corrects the error produced by the OV F1
signal in the most obvious way. The error situation OV F2 is recognized
by an active SIGov f signal and e2 e2 10 : 0bias emax 1. Since
xbias x bias, we have
01
0 1
OVF2 e3[10:0]
imply
emax 1 α 001
n2
bias
0 1 OVF
OVFen
eout[10:0] fout[1:52]
Infinity decision of the exponent rounding. The tables list the value of
the flag in f which indicates that the exponent must be set to infinity.
This circuit (figure 8.38) covers the special cases and detects the IEEE
floating point exceptions overflow, underflow and inexact result. In case a
is a finite, non-zero number,
x s er fr pê a
&
)&
NANr UNF/OVFen SIGinx
ZEROr TINY F LOATING P OINT
s eout fout nan ZEROr NANr INFr OVF
ROUNDER
SpecSelect spec RndExceptions
sp ep[10:0] fp[1:52] OVFp UNPp INXp
Fp[63:0] IEEEp
s eout fout rd s er fr
Depending on the flag dbr, the output factoring is either in single or double
precision. The single precision result is embedded in the 64-bit word F p
according to figure 8.1. Thus,
The circuit P RECISION implements this selection in the obvious way with
a single 64-bit multiplexer.
In addition, circuit S PEC FP RND detects the floating point exceptions
OVF, UNF and INX according to the specifications of section 7.3. These
&
) ZEROr fnan
052
F LOATING P OINT s snan eout fout
0 1 NANr
A LGORITHMS AND 11
DATA PATHS 0 1 spec 0 1 0 1 spec
sp ep[10:0] fp[1:52]
OV FP spec OV F
UNFP spec T INY UNFen LOSSb
spec T INY UNFen INX
INXP spec INX
Since an overflow and an underflow never occur together, signal INX can
be expressed as
SIGinx OV F OV Fen
IGURE 8.40 depicts the schematics of circuit FC ON. The left subcir-
cuit compares the two operands FA2 and FB2, whereas the right sub-
circuit either computes the absolute value of operand FA2 or reverses its
sign. Thus, circuit FC ON provides the following outputs:
&
)'
FA2[63:0] FB2[63:0] FB2[62:0] FA2[63, 62:0]
C IRCUIT FC ON
0, FA2[62:0] abs
1 1
EQ(64)
add(64) FA2[63] FB2[63]
neg
e fla flb
sign s
sa sb
FCON[3:0] FP test
fcc inv
ftest
04 INV
Circuit FC ON; the left subcircuit performs the condition test, whereas
the right subcircuit implements the absolute value and negate operations.
Its data inputs are the two packed IEEE floating point operands
a sa eA n 1 : 0 fA 1 : p 1
b sb eB n 1 : 0 fB 1 : p 1
and the flags f la and f lb which signal that the corresponding operand has
a special value. The circuit is controlled by
Except for the flags f la and f lb which are provided by the unpacker
FP UNP, all inputs have zero delay. Thus, the cost of circuit FC ON and the
&#
)
F LOATING P OINT Coding of the floating point test condition
A LGORITHMS AND
predicate coding less equal unordered INV if
DATA PATHS
true false Fcon[3:0] ? unordered
F T 0000 0 0 0
UN OR 0001 0 0 1
EQ NEQ 0010 0 1 0
UEQ OGL 0011 0 1 1
No
OLT UGE 0100 1 0 0
ULT OGE 0101 1 0 1
OLE UGT 0110 1 1 0
ULE OGT 0111 1 1 1
SF ST 1000 0 0 0
NGLE GLE 1001 0 0 1
SEQ SNE 1010 0 1 0
NGL GL 1011 0 1 1
Yes
LT NLT 1100 1 0 0
NGE GE 1101 1 0 1
LE NLE 1110 1 1 0
NGT GT 1111 1 1 1
CFCon CEQ 64 Cadd 64 Cinv 63 CFPtest Cand Cnor
AFCon maxDEQ 64 Dinv Dadd 64 AFPunp f la f lb
DFPtest Dand
Table 8.7 lists the coding of the predicates to be tested. The implementa-
tion proceeds in two steps. First, the basic predicates unordered, equal and
less than are generated according to the specifications of section 7.4.5, and
then the condition flag f cc and the invalid operation flag inv are derived as
&&
)'
"
C IRCUIT FC ON
The operands a and b compare unordered if and only if at least one of them
is a NaN. It does not matter whether the NaNs are signaling or not. Thus,
the value of the predicate unordered equals:
Note that for the condition test the sign of zero is ignored (i.e., 0 0),
and that NaNs never compare equal. Thus, the result of the predicate equal
can be expressed as
if a b 0 0
1
0 if a NaN sNaN
equal
if b NaN sNaN
0
e otherwise
" 3
According to section 7.4.5, the relation I is a true subset of the R 2∞ . Thus,
the value of the predicate less can be expressed as
less l unordered
where for any two numbers a b R ∞ the auxiliary flag l indicates that
l1 a b
Thus, let sign sn p denote the sign bit of the difference
we then have
l sa sb sign
sa sb ZEROa NAND ZEROb
sa sb sign NOR e
a 0 eA fA 2 eA 1
bias
fA
eB 2 p1
eB 0 p 1 eB fB
eA fA 1n 1 0 1 p 1
1n 1 1 0 p 1
eB fB
For the packed floating point operand a R ∞ with a sA eA fA , the
absolute value a satisfies
Thus, the packed representation of the value a can simply be obtained by
clearing the sign bit of operand FA2. The value a satisfies
a sA eA fA
In the IEEE floating point standard, the two operations absolute value and
sign reverse are considered to be special copy operations, and therefore,
they never signal a floating point exception.
The floating point condition test is always exact and never overflows
nor underflows. Thus, it only signals an invalid operation; the remaining
exception flags are always inactive.
Depending on the control signal f test which requests a floating point
condition test circuit FC ON selects the appropriate set of exception flags:
inv if f test 1
INV inv f test
0 if f test 0
INX UNF OVF DBZ 0 0 0 0
In the packed format, bit f 0 is hidden, i.e., it must be extracted from the
exponent. Thus,
sP eP n 1 : 0 fP 1 : p 1 rds x
where
a2 α
if OV Fs a OV Fen
a 2α if UNFs a UNFen
x
a otherwise
&*
)
If a is a zero, infinity or NaN, then
F LOATING P OINT
¼ ¼
A LGORITHMS AND sA 0n 0 p 1
¼ ¼
if a 1sA 0
DATA PATHS sA 1n 0 p 1 a 1sA ∞
sP eP fP
if
¼ ¼
sA 1n 10 p 2
if a NaN
In case of a double precision result, the rounding can be omitted due to the
p 53 bit significand. However, a normalization is still required. Thus,
converting a two’s complement integer x into a double precision floating
point number provides the packed factoring
ηd x31 32 f 1 : 32 if x 0
sP eP fP
0 0n 0 p 1
if x 0
The same argument can be made for all four rounding modes.
and in this case, the conversion only needs to signal an invalid operation,
but the rounding itself can be omitted. Such an overflow is signaled by
Iov f1 .
In case of Iov f1 0, the absolute value y is at most 232 . Thus, y can
be represented as 33-bit binary number y32 : 0 and y as 33-bit two’s
complement number
Let
y32 : 0 if s0
x32 : 0 y32 : 0 s s
z32 : 0 if s1
if the integer x lies in the set T32 it then has the two’s complement repre-
sentation x31 : 0:
One could provide a separate circuit for each type of conversion. However,
the arithmetic operations already require a general floating point unpacker
and a floating point rounder which convert from a packed floating format to
an internal floating format and vice versa. In order to reuse this hardware,
every conversion is performed in two steps:
An unpacker converts the input FA2[63:0] into an internal floating
point format. Depending on the type of the conversion, the input
FA2 is interpreted as 32-bit two’s complement integer
In addition to the unpacker FP UNP and the rounder FP RND, the conver-
sions then require a fixed point unpacker FX UNP, a fixed point rounder
FX RND, and a circuit C VT which adapts the output of FP UNP to the input
format of FP RND (figure 8.2).
& #
)
The conversion is controlled by the following signals:
F LOATING P OINT
A LGORITHMS AND signals dbs and dbr indicate a double precision floating point source
DATA PATHS operand and result,
the two enable signals which select between the results of the circuits
C VT and FX UNP.
Let
a2 α
if OV F a OV Fen
a 2α if UNF a UNFen
y
a otherwise
Depending on the flags f lv , circuit FP RND (section 8.4) then provides the
packed factoring
sv 0n 0 p 1
sv 1n 0 p 1
if ZEROv
if INF v
sP eP fP
snan 1n fnan 1 : p 1
η rd y
if a NaN v
otherwise
Circuit FX UNP converting a 32-bit integer x31 : 0 into the internal
floating point format; f lags denotes the bits INFu, NANu, INV, and DBZ.
The remaining flags are inactive and nan can be chosen arbitrarily:
1 if x0
su x31
0 if x 0
eu 13 : 0 07 1 05
fu 1 : 55 02 y31 : 0 023
The circuit of figure 8.41 implements the fixed point unpacker FX UNP in
a straightforward manner at the following cost and delay:
& (
)(
Rounding Since an integer to floating point conversion never overflows,
the representation su eu fu f lu meets the requirements of the rounder F ORMAT
FP RND. Thus, the correctness of the floating point rounder FP RND implies C ONVERSION
the correctness of this integer to floating point converter.
Due to normal 1, the significand fa is normal, and for any a 2emin ,
the number lza is zero.
This representation is provided to the fixed point rounder FX RND which
generates the data Fx63 : 32 Fx31 : 0 and the floating point exception
flag INV. For a finite number a, let rdint a sa y and
x 1sa y 1sa 20 y
For x T32 , the conversion is valid (INV 0) and x has the two’s comple-
ment representation Fx[31:0]. If a is not finite or if x T32 , the conversion
is invalid, i.e., INV 1, and Fx[31:0] is chosen arbitrarily.
The equation suggests to make the exponent 1 and shift the significand
ea positions left, then shift p 1 positions right; this moves the bit with
weight 1 into the position p 1 right of the binary point.
The significand fout provided by rddn has at most two bits to the left of
the binary point, whereas y has up to 32 bits to the left of the binary point.
However, the rounding rdint is only applied if fa 0 2 and ea 231 . For
p 32 it then follows that
¼
fr 2ea p 1 fa 231
31
fa 2
Thus, the significand sigrnd sa fr has at most 2 bits to the left to the
binary point. The significand y can then be obtained by shifting the output
significand fout emin positions to the left:
Circuit FX RND Figure 8.42 depicts the top level schematics of the fixed
point rounder FX RND. Circuits N ORM S HIFT X, REP P X, and S IG R ND X
from the floating point rounder are adapted as follows:
SIGinx
SpecFX SigRndX
f3[-1:31]
IEEEx Fx[63:0]
Schematics of the fixed point rounder FX RND; IEEEx denote the
floating point exceptions flags.
fn[0:127] Iovf1
Circuit N ORM S HIFT X The circuit depicted in figure 8.43 performs the
normalization shift. In analogy to the floating point rounder, its outputs
satisfy
¼
Iov f1 2ea fa 232
en emin 31
e¼a emin
fn p f 2 fa
Circuit S HIFT D IST FX (figure 8.44) provides the distance σ of the nor-
malization shift as 13-bit two’s complement number. Since f is defined as
¼
2ea emin fa , the shift distance equals
The cost and delay of the normalization shifter N ORM S HIFT X and of its
modified subcircuits can be expressed as
Circuit REP P X The circuit of figure 8.45 performs the sticky bit com-
putation in order to provide a p-representative of fn :
Since we now have only a single precision, this circuit becomes almost
trivial:
&#
)(
fn[0:32] fn[33:127]
F ORMAT
ORtree
C ONVERSION
st
f1[0:32] f1[33]
l, r, st
incf(33) Rounding Decision
33 inc
0 1 sa
f3[-1:31] SIGinx
Circuit S PEC FX This circuit supports the special cases and signals float-
ing point exceptions. If the rounded result x is representable as a 32-bit
two’s complement number, we have
032 if ZEROr 1
Fx31 : 0 x31 : 0 ZEROr
x31 : 0 if ZEROr 0
Fx63 : 32 Fx31 : 0
According to the specifications from page 422, the overflow of the conver-
sion and the invalid operation exception can be detected as
cycle time
FPU Bus FR A DD /S UB M UL /D IV FP RND FX RND
98 98 63 69 98 76
% ,"
In the top level schematics of the FPU (figure 8.2), there are two register
stages: The output of the unpacker FP UNP and the intermediate result on
bus F R are clocked into registers. Result F R is provided by the unpacker
FX UNP, the converter C VT, the add/subtract unit or by the multiply/divide
unit, thus,
These units require a minimal cycle time of 98 gate delays like the update
of register F R. The floating point rounder FP RND is 30% slower than the
other three units.
before the input signal UNFen and OV Fen dominate the cycle time TFPU .
The accumulated delay of the rounding mode RM is more time critical.
Already for ARM 9, the rounding mode dominates the delay AFPU , i.e., it
slows down the computation of the FPU outputs.
,"
Table 8.11 lists the cost of the floating point unit FPU and of its major com-
ponents. Circuit S IG R ND of the floating point rounder FP RND either uses
a standard 53-bit incrementer or a fast 53-bit CSI incrementer. Switching
to the fast incrementer increases the cost of the rounder FP RND by 3%,
but it has virtually no impact on the total cost (0.2%). On the other hand,
the CSI incrementer improves the accumulated delay of the FPU consid-
erably. Therefore, we later on only use the FPU design version with CSI
incrementer.
The multiply/divide unit is by far the most expensive part of the float-
ing point unit, it accounts for 70% of the total cost. According to table
8.12, the cost of the multiply/divide unit are almost solely caused by cir-
&#&
))
Cost of the FPU and its sub-units. Circuit S IG R ND of rounder FP RND S ELECTED
either uses a standard incrementer or a fast CSI incrementer. R EFERENCES AND
A DD /S UB 5975 F URTHER R EADING
M UL /D IV 73303
FC ON 1982
FP UNP 6411
FX UNP 420
FP RND 7224 / 7422
FX RND 3605
C VT 2
rest 4902
total: FPU 103824 / 104022
Cost of the significand multiply/divide circuit S IGF MD with a 256
8
lookup ROM. The last column lists the cost relative to the cost of the multi-
ply/divide unit M UL /D IV.
0i a if s 1
r
a0i otherwise
Thus, in trivial n i-right shifters, the i bits which are shifted out are the
last i bits of the result.
One can realize the alignment shift and sticky bit computation of the float-
ing point adder by a stack of trivial shifters. The sticky bit is computed by
simply ORing together bits, which are shifted out.
p 24 γ 16
1
2
if
if p 24 γ 8 or p 53 γ 16
i
γ 5 or p 53 γ 8
3
4
if
if
p 24
p 53 γ5
For γ 8, this bound was already shown in section 8.3.4. Repeat the argu-
ments for the remaining cases.
Determine the cost of the FPU for γ 16 and γ 5.
( The next three exercises deal with the normalization shifter
N ORM S HIFT used by the floating point rounder FP RND. The functionality
of the shifter is specified by Equation 8.6 (page 393); its implementation
is described in section 8.4.2.
The shifter N ORM S HIFT gets as input a factoring s er fr ; the significand
fr 1 : 55 has two bits to the right of the binary point. The final rounded
result may be a normal or denormal number, and fr may have leading zeros
or not.
&#(
)*
Determine the maximal shift distance σ for each of these four cases.
E XERCISES
Which of these cases require a right shift?
( The normalization shifter N ORM S HIFT (figure 8.25, page
395) computes a shift distance σ, and its subcircuit S IG N ORM S HIFT then
shifts the significand f . However, in case of a right shift, the represen-
tation of f 2σ can be very long. Circuit S IG N ORM S HIFT therefore only
provides a p-representative fn :
fn 0 : 63 fn p f 2σ
where δ is a constant.
The implementation of E XP N ORM depicted in figure 8.27 (page 400) uses
a 3/2-adder and a compound adder A DD 2 to perform this task. Like in the
computation of flag TINY, the value lz5 : 0 can be included in the con-
stant δ , and then, the 3/2-adder in the circuit E XP N ORM can be dropped.
&#/
Chapter
9
Pipelined DLX Machine with
Floating Point Core
N THIS chapter, the floating point unit from the previous chapter is in-
tegrated into the pipelined DLX machine with precise interrupts con-
structed in chapter 5. Obviously, the existing design has to be modified in
several places, but most of the changes are quite straightforward.
In section 9.1, the instruction set is extended by floating point instruc-
tions. For the greatest part the extension is straightforward, but two new
concepts are introduced.
1. The floating point register file consists of 32 registers for single pre-
cision numbers, which can also be addressed as 16 registers for dou-
ble precision floating point numbers. This aliasing of addressing will
mildly complicate both the address computation and the forwarding
engine.
2. Except during divisions, the execute stage can be fully pipelined, but
it has variable latency (table 9.1). This makes the use of so called re-
sult shift registers in the CA-pipe and in the buffers-pipe necessary.
1. For instructions which can be fully pipelined, i.e., for all instruc-
tions except divisions, two result shift registers in the precomputed
control and in the stall engine take care of the variable latencies of
instructions.
For the prepared machine FDLXΣ constructed in this way we are able to
prove the counter part of the (dateline) lemma 5.9.
In section 9.4, the machine is finally pipelined. As in previous construc-
tions, pipelining is achieved by the introduction of a forwarding engine
and by modification of the stall engine alone. Because single precision
values are embedded in double precision data paths, one has to forward
the 32 low order bits and the 32 high order bits separately. Stalls have to
be introduced in two new situations:
A simple lemma will show for this FDLXΠ design, that the execution of
instructions stays in order, and that no two instructions are ever simultane-
ously in the same substage of the execute stage.
* ," 8
The FPU provides 32 floating point general purpose registers FPRs, each
of which is 32 bits wide. In order to store double precision values, the reg-
isters can be addressed as 64-bit floating point registers FDRs. Each of the
16 FDRs is formed by concatenating two adjacent FPRs (table 9.3). Only
even numbers 0 2 30 are used to address the floating point registers
FPR; the least significant address bit is ignored.
1
In the design, it is sometimes necessary to store a single precision value xs
in a 64-bit register, i.e., the 32-bit representation must be extended to 64
bits. This embedding will be done according to the convention illustrated
in figure 9.1, i.e., the data is duplicated.
63 32 31 0
x.s x.s
IEEEf. The registers can be read and written by special move instructions.
Register FCC is one bit wide and holds the floating point condition code.
FCC is set on a floating point comparison, and it is tested on a floating
point branch instruction. Register RM specifies which of the four IEEE
rounding modes is used (table 9.4).
Register IEEEf (table 9.5) holds the IEEE interrupt flags, which are over-
flow OVF, underflow UNF, inexact result INX, division by zero DBZ, and
invalid operation INV. These flags are sticky, i.e., they can only be reset at
the user’s request. Such a flag is set whenever the corresponding exception
is triggered.The IEEE floating point standard 754 only requires that such
an interrupt flag is set whenever the corresponding exception is triggered
&&
*
Coding of the interrupt flags IEEEf E XTENDED
I NSTRUCTION S ET
symbol meaning
A RCHITECTURE
IEEEf[0] OVF overflow
IEEEf[1] UNF underflow
IEEEf[2] INX inexact result
IEEEf[3] DBZ division by zero
IEEEf[4] INV invalid operation
The FPU adds six internal interrupts, namely the five interrupts requested
by the IEEE Standard 754 plus the unimplemented floating point operation
interrupt uFOP (table 9.7). In case that the FPU only implements a sub-
set of the DLX floating point operations in hardware, the uFOP interrupt
causes the software emulation of an unimplemented floating point opera-
tion. The uFOP interrupt is non-maskable and of type continue.
The IEEE Standard 754 strongly recommends that users are allowed to
specify an interrupt handler for any of the five standard floating point ex-
ceptions overflow, underflow, inexact result, division by zero, and invalid
operation. Such a handler can generate a substitute for the result of the
exceptional floating point instruction. Thus, the IEEE floating point inter-
rupts are maskable and of type continue. However, in the absence of such
an user specific interrupt handler, the execution is usually aborting.
&&#
*
P IPELINED DLX Interrupts handled by the DLX architecture with FPU
M ACHINE WITH
interrupt symbol priority resume mask external
F LOATING P OINT
C ORE reset reset 0 abort no yes
illegal instruction ill 1 abort no no
misaligned access mal 2
page fault IM Ipf 3 repeat
page fault DM Dpf 4
trap trap 5 continue
FXU overflow ovf 6 abort yes
FPU overflow fOVF 7 abort/
FPU underflow fUNF 8 continue
FPU inexact result fINX 9
FPU division by zero fDBZ 10
FPU invalid operation fINV 11
FPU unimplemented uFOP 12 continue no
external I/O ex j 12 j continue yes yes
6 5 5 16
FI-type Opcode Rx FD Immediate
6 5 5 5 3 6
FR-type Opcode FS1 FS2 / Rx FD 00 Fmt Function
Floating point instruction formats of the DLX. Depending on the pre-
cision, "# " and $ specify 32-bit or 64-bit floating point registers. %&
specifies a general purpose register of the FXU. is an additional 6-bit
opcode. specifies a number format.
The DLX machine uses two formats (figure 9.2) for the floating point in-
structions; one corresponds to the I-type and the other to the R-type of the
fixed point core FXU.
The FI-format is used for moving data between the FPU and the memory.
Register of the FXU together with the 16-bit immediate specify the
memory address. This format is also used for conditional branches on the
condition code flag FCC of the FPU. The immediate then specifies the
branch distance. The coding of these instructions is given in table 9.8.
&&&
*
FI-type instruction layout. All instructions except the branches also DATA PATHS
increment the PC by four. The effective address of memory accesses equals ea WITHOUT
GPRRx sxt imm, where sxt imm denotes the sign extended version of
F ORWARDING
the 16-bit immediate imm. The width of the memory access in bytes is indicated
by d. Thus, the memory operand equals m M ea d 1 M ea.
The FR-format is used for the remaining FPU instructions (table 9.9). It
specifies a primary and a secondary opcode ( # $%& %), a number
format &, and up to three floating point registers. For instructions which
move data between the FPU and the fixed point unit FXU, the field /7
specifies the address of a general purpose register in the FXU.
Since the FPU of the DLX machine can handle floating point numbers
with single or double precision, all floating point operations come in two
version; the field & in the instruction word specifies the precision used
(table 9.10). In the mnemonics, we identify the precision by adding the
corresponding suffix, e.g., suffix ‘.s’ indicates a single precision floating
point number.
&&(
*
IMenv
DATA PATHS
EPCs IR.1 WITHOUT
IRenv Daddr F ORWARDING
FPemb PCenv
fl D EXenv R
DMenv
Ffl’
1111
0000 SH4Lenv
C’ FC’
RFenv
ception flags. In order to support double precision loads and stores, the
data registers MDRw and MDRr associated with the data memory are now
64 bits wide. Thus, the cost of the enhanced DLX data paths can be ex-
pressed as
1 .8
So far, the environment IRenv of the instruction register selects the im-
mediate operand imm being passed to the PC environment and the 32-bit
immediate operand co. In addition, IRenv provides the addresses of the
register operands and two opcodes.
The extension of the instruction set has no impact on the immediate
operands or on the source addresses of the register file GPR. However, en-
vironment IRenv now also has to provide the addresses of the two floating
point operands FA and FB. These source addresses FS1 and FS2 can di-
rectly be read off the instruction word and equal the source addresses Aad
and Bad of the fixed point register file GPR:
FS1 Aad IR25 : 21
FS2 Bad IR20 : 17
Thus, the cost and delay of environment IRenv remain unchanged.
- +
Circuit Daddr generates the destination addresses Cad and Fad of the gen-
eral purpose register files GPR and FPR. In addition, it provides the source
address Sas and the destination address Sad of the special purpose register
file SPR.
Address Cad of the fixed point destination is generated by circuit Caddr
as before. The selection of the floating destination Fad is controlled by a
signal FRtype which indicates an FR-type instruction:
IR15 : 11 if FRtype 1
Fad 4 : 0
IR20 : 16 if FRtype 0
The SPR source address Sas is generated as in the DLX design. It is
usually specified by the bits SA IR10 : 6, but on an RFE instruction
it equals the address of register ESR. Except for an RFE instruction or a
floating point condition test ( f c 1), the SPR destination address Sad is
specified by SA. On RFE, ESR is copied into the status register SPR[0],
and on f c 1, the condition flag f cc is saved into register SPR[8]. Thus,
SA if r f e1 0
00000 if r f e1 1
Sas
00001 if r f e1 1
Sad
01000 if f c1 1
SA otherwise
&&)
IR[15:11] IR[10:6] 00000 IR[10:6]
*
IR[20:11] DATA PATHS
IR[20:16] rfe.1 0 1 01000 00001
Jlink WITHOUT
Caddr 0 1 FRtype 0 1 fc.1 0 1 rfe.1
Rtype F ORWARDING
Cad Fad Sas Sad
The circuit of figure 9.4 provides these four addresses at the following
cost and delay:
" 1
Due to the extended ISA, the PC environment has to support two additional
control instructions, namely the floating point branches , and ,%.
However, except for the value PC ui , the environment PCenv still has the
functionality described in chapter 5.
Let signal b jtaken, as before, indicate a jump or taken branch. On in-
struction Ii , the PC environment now computes the value
EPCi 1 if Ii ,
PCi 1 immi
if b jtakeni Ii % " "
if b jtakeni Ii , ,%
u
PC i PCi 1 immi
PCi 1 4 otherwise
This extension has a direct impact on the glue logic PCglue, which gen-
erates signal b jtaken, but the data paths of PCenv including circuit nextPC
remain unchanged.
Signal b jtaken must now also be activated in case of a taken floating
point branch. Let the additional control signal f branch denote a floating
point branch. According to table 9.11, signal b jtaken is now generated as
This increases the cost of the glue logic by an OR, AND, and XOR gate:
Both operands A’ and FCC are provided by the register file environment,
but A’ is passed through a zero tester in order to obtain signal AEQZ. Thus,
FCC has a much shorter delay than AEQZ, and the delay of signal b jtaken
remains unchanged.
1 ,"
Environment FPemb of figure 9.5 selects the two floating point source
operands and implements the embedding convention of figure 9.1. It is
controlled by three signals,
the flag dbs1 requesting double precision source operands,
the least significant address bit FS10 of operand FA, and
the least significant address bit FS20 of operand FB.
Circuit FPemb reads the two double words f A63 : 0 and f B63 : 0 and
provides the two operands FA1 and FB1, each of which is 64 bits wide.
Since the selection and data extension of the two source operands go
along the same lines, we just focus on operand FA1. Let the high order
word and the low order word of input f A be denoted by
f Ah f A63 : 32 and f Al f A31 : 0
On a double precision access (dbs1 1), the high and the low order word
are just concatenated, i.e., FA1 f Ah f Al. On a single precision access,
one of the two words is selected and duplicated; the word fAl is chosen on
an even address and the word fAh on an odd address. Thus,
f Ah f Al if dbs1 1
dbs1 0 FS10 1
FA163 : 0
f Ah f Ah if
f Al f Al if dbs1 0 FS10 0
&'
*
a) b)
FS1[0] fA dbs.1 fB FS2[0] a db fh fl fh fl a db DATA PATHS
WITHOUT
64 64
F ORWARDING
a fh, fl db db fh, fl a
Fsel Fsel 1 0 0 1
* 5 %
In every cycle, the memory stage passes the address MAR, the 64-bit data
MDRw and the floating point flags Ffl.3 to the write back stage:
C4 : MAR
FC4 : MDRw
F f l 4 : F f l 3
byte j w w8 j 7 : 8 j
bytei: j w bytei w byte j w
On a read access with address a31 : 0, the data memory DM pro-
vides the requested double word, assuming that the memory is not busy
and that the access causes no page fault. In any other case, the mem-
ory DM provides a default value. Thus, for the double word boundary
e a31 : 3000, we get
A write access only updates the data memory, if the access is perfectly
aligned (dmal 0), and if the access causes no page fault (d p f 0). On
such an d-byte write access with byte address a a31 : 0 and offset
o a2 : 0, the data memory performs the update
Dhit DMout
read time T$read by the delay Dmux and possibly the burst read time TMrburst
as well. However, these two cycle times were not time critical.
T$read A$i f Dout D f f
TMrburst Ddriv dbus δ
maxD$i f MDat; $i f D$i f MDat; Dout D f f
CDMenv CD$i f Cmux 32
According to table 9.13, the bank write signals are then generated in a
brute force way by
The memory control DMC also checks for a misaligned access. A byte
access is always properly aligned. A double word access is only aligned
if it starts at byte 0, i.e., if B0 1. A word access is aligned if it starts
at byte 0 or 4, and a half word access is aligned if it starts at an even byte.
Thus, the misalignment can be detected by
This includes the 8-bit register DMBw which buffers the bank write sig-
nals. Signals DMBw are still provided at zero delay. The accumulated
delay ADMC of the remaining outputs and the cycle time of circuit DMC
run at
* # ;
The DLX architecture now comprises three register files, one for the fixed
point registers GPR, one for the special purpose registers SPR, and one for
the floating point registers FPR. These three register files form the envi-
ronment RFenv
The data paths of the write back stage consist of the environment RFenv
and of the shifter environment SH4Lenv. Environment GPRenv is the only
environment which remains unchanged.
&''
*
MDRr MDRr[63:32, 31:0]
P IPELINED DLX
FC.4 C.4[2] 1 0 C.4[1:0]
M ACHINE WITH
MDs
F LOATING P OINT 64 32
C ORE shifter SH4L
1 0 dbr.4
R C.4
0 1 load.4 1 0 load.4
FC’ C’
1 &3
In addition to the fixed point result C , the environment SH4Lenv now also
provides a 64-bit floating point result FC . The environment is controlled
by two signals,
signal load 4 indicating a load instructions and
signal dbr4 indicating a double precision result.
The fixed point result C is almost computed as before, but the memory
now provides a double word MDRr. The shifter SH4L still requires a 32-bit
input data MDs. Depending on the address bit C42, MDs either equals
the high or low order word of MDRr:
MDRr63 : 32 if C42 1 (high order word)
MDs
MDRr31 : 0 if C42 0 (low order word)
Let sh4l a dist denote the function computed by the shifter SH4L. The
fixed point result C is then selected as
sh4l MDs C41 : 0000 if load 4 1
C
C4 if load 4 0
Depending on the type of the instruction, the output FC’ is selected
among the two 64-bit inputs FC.4 and MDRr and the 32-bit word MDs
which is extended according to the embedding convention. On a load in-
struction, the environment passes the memory operand, which in case of
double precision equals MDRr and MDs, otherwise. On any other instruc-
tion, the environment forwards the FPU result FC.4 to the output FC’.
Thus,
MDRr if load 4 1 dbr4 1
MDs MDs if load 4 1 dbr4 0
FC 63 : 0
FC4 otherwise
&'(
*
SR PC.4 DDPC.4 Ffl.4[4:0]
MCA DPC.4 C.4 DATA PATHS
IEEEf[4:0] Sas.1
repeat WITHOUT
FCRsel C.4 Sad.4
JISR fxSPRsel F ORWARDING
sel 032 027 032
SPRw/r
Di[0] Di[1] Di[2] Di[3] Di[4] Di[5] Di[6] Di[7] Di[8] Din adr adw
SPRw (9 x 32) special register file
w[8:0] w/r
Do[0] Do[1] Do[2] Do[3] Do[4] Do[5] Do[6] Do[7] Do[8] Dout
As before, the special purpose registers are held in a register file with an
extended access mode. Any register SPRs can be accessed through the
regular read/write port and through a distinct read port and a distinct write
port. In case of a conflict, a special write takes precedence over the write
&'/
*
access specified by address Sad. Thus, for any s 0 8, register SPRs
P IPELINED DLX is updated as
M ACHINE WITH
F LOATING P OINT Dis if SPRws 1
SPRs :
C ORE C4 if SPRws 0 SPRw 1 s Sad
Dos SPRs
Sout SPRSas
Registers fxSPR The registers fxSPR still have the original functional-
ity. The write signals of their distinct write ports and signal sel are gener-
ated as before:
Circuit f xSPRsel which selects the inputs Dis of these write ports can be
taken from the DLX design of section 5 (figure 5.6).
Registers FCR Although the rounding mode RM, the IEEE flags and the
condition flag FCC only require a few bits, they are held in 32-bit registers.
The data are padded with leading zeros.
The condition flag FCC can be updated by a special move + or by
a floating point condition test. Since in either case, the result is provided
by register C4, the distinct write port of register FCC is not used. Thus,
Read Access The register file environment FPRenv provides the two
source operands f A and f B. Since both operands have double precision,
they can be specified by 4-bit addresses FS14 : 1 and FS24 : 1:
f A63 : 0 FPRFS14 : 1 1 FPRFS14 : 1 0
f B63 : 0 FPRFS24 : 1 1 FPRFS24 : 1 0
For the high order word the least significant address bit is set to 1 and for
the low order word it is set to 0.
Write Access The 64-bit input FC or its low order word FC 31 : 0 is
written into the register file. The write access is governed by the write
signal FPRw and the flag dbr4 which specifies the width of the access.
In case of single precision, the single precision result is kept in the high
and the low order word of FC , due to the embedding convention. Thus,
on FPRw 1 and dbr4 0, the register with address Fad4 is updated to
FPRFad44 : 0 : FC 63 : 32 FC 31 : 0
On FPRw 1 and dbr4 1, the environment FPRenv performs a double
precision write access updating two consecutive registers:
FPRFad44 : 1 1 : FC 63 : 32
FPRFad44 : 0 0 : FC 31 : 0
&'*
*
FS1[4:1]
P IPELINED DLX FS2[4:1]
M ACHINE WITH Fad4[4:1]
F LOATING P OINT adA adB adC Fad4[0] FRRw adA adB adC
C ORE dbr.4
FC’[63:32] 3-port RAM 3-port RAM FC’[31:0]
(16 x 32) w w (16 x 32)
Di Di
Dod ODD wod FPRcon wev EVEN Dev
Da Db Da Db
The control FPRcon of the FPR register file can generate these two write
signals at the following cost and delay:
* & 1-
The execute environment EXenv is the core of the execute stage (figure
9.10). Parts of the buffer environment and of the cause environment CAenv
also belong to the execute stage. The buffers pass the PCs, the destination
addresses and the instruction opcodes down the pipeline. Environment
CAenv collects the interrupt causes and then processes them in the memory
stage.
1- 1
Environment EXenv comprises the 32-bit fixed point unit FXU, the 64-bit
floating point unit FPU of chapter 8, and the exchange unit FPXtr. It gets
the same fixed point operands as before (A, B, S, co, link) and the two
floating point operands FA2 and FB2.
Fixed Point Unit FXU The FXU equals the execute environment of the
DLX architecture from section 5.5.4. The functionality, the cost and delay
of this environment remain unchanged. The FXU still provides the two
fixed point results D and sh and is controlled by the same signals:
&(
*
FA[63:0] B[31:0] sh[31:0] FB[63:0]
[31:0]
1
0011
0 fmov.2 0011
0 1 fstore.2
DATA PATHS
WITHOUT
05 F ORWARDING
0 1 store.2
Exchange Unit FPXtr The FPXtr unit transfers data between the fixed
point and the floating point core or within the floating point core. It is
controlled by
The operands B[31:0], FA[63:0] and FB[63:0] are directly taken from reg-
isters, operand sh31 : 0 is provided by the shifter of the fixed point unit.
Circuit FPXtr selects a 69-bit result t f p and a 32-bit result t f x. The bits
tfp[63:0] either code a floating point or fixed point value, whereas the bits
tfp[68:64] hold the floating point exception flags.
According to the IEEE floating point standard [Ins85], data move in-
structions never cause a floating point exception. This applies to stores
and the special moves , and ,. Thus, the exchange unit selects the
results as
t f x31 : 0 FA31 : 0
FB63 : 0
sh 31 : 0 sh31 : 0
if f store2
if store2 f store2
t f p63 : 0
FA63 : 0
B31 : 0 B31 : 0
if f mov2
otherwise
t f p68 : 64 00000
The circuit of figure 9.11 implements the exchange unit in the obvious
way. Assuming that the control signals of the execute stage are precom-
&(#
*
puted, cost and accumulated delay of environment FPXtr run at
P IPELINED DLX
M ACHINE WITH CFPXtr 3 Cmux 64
F LOATING P OINT AFPXtr AFXU sh 2 Dmux 64
C ORE
Functionality of EXenv Environment EXenv generates two results, the
fixed point value D and the 69-bit result R. R[63:0] is either a fixed point or
a floating point value; the bits R[68:64] provide the floating point exception
flags. Circuit EXenv selects output D among the result D of the FXU, the
result t f x of the exchange unit, and the condition flag f cc of the FPU. This
selection is governed by the signals m f 2i and f c which denote a special
move instruction , or a floating point compare instruction, respectively:
D31 : 0 if m f 2i 0 f c 0
if m f 2i 1 f c 0
D 31 : 0
t0f31xf31cc: 0 if m f 2i 0 f c 1
The selection of result R is controlled by the four enable signals FcRdoe,
F pRdoe, FxRdoe and t f pRdoe. At most one of these signals is active at a
time. Thus,
Fc68 : 0 FcRdoe 1
F p68 : 0
if
if F pRdoe 1
R68 : 0
Fx68 : 0
t f p68 : 0
if
if
FxRdoe 1
t f pRdoe 1
Cost and Cycle Time Adding an FPU has no impact on the accumu-
lated delay AFXU of the results of the fixed point core FXU. The FPU itself
comprises five pipeline stages. Its cycle time is modeled by TFPU and the
accumulated delay of its outputs is modeled by AFPU (chapter 8). Thus,
cost and cycle time of the whole execute environment EXenv can be esti-
mated as
8 - 8
An n-bit shift register RSR is a kind of queue with f entries R1 R f , each
of which is n bits wide. In order to account for the different latency, the
RSR can be entered at any stage, not just at the first stage. The RSR (figure
9.12) is controlled by
a distinct clock signal cei for each of the f registers Ri ,
a common clear signal clr, and
a distinct write signal wi for each of the f registers Ri .
The whole RSR is cleared on an active clear signal. Let T and T 1
denote successive clock cycles. For any 1 i f , an active signal clrT 1
implies
RiT 1 0
On an inactive clear signal clrT 0, the entries of the RSR are shifted
one stage ahead, and the input Din is written into the stage i with wTi 1,
provided the corresponding register is clocked:
Din if ceTi 1 wTi 1
RSRiT 1 RSRTi ceTi 1 wT i1
0n
1 if
if ceTi 1 wT
i 0
0 i 1
i
&((
Din
*
n DATA PATHS
clr
WITHOUT
ce[1:f]
w1 R1
F ORWARDING
n
...
wf Rf n
Din
ce[1] clr ce[2] clr ce[f] clr
w[1] w[2] w[f]
1 1 ... 1
r1 r2 rf
0n 0 R1 0 R2 0 Rf
clr
The following lemma states that data Din which are clocked into stage i in
cycle T are passed down the RSR, provided the clear signal stays inactive,
the corresponding registers are clocked at the right time and they are not
overwritten.
Let Din enter register Ri at cycle T , i.e., wTi 1, ceTi 1 and clrT 0. )
For all t 1 f i let
then
DinT RiT 1 RiTt t 1 RTf f i1
The outputs R of the RSR have zero delay; the inputs r of its registers are
delayed by a multiplexer and an AND gate:
CA.4
cause processing CApro
ue.3 MCA, jisr.4, repeat
The floating point unit adds 6 new internal interrupts, which are as-
signed to the interrupt levels 7 to 12 (table 9.7).
Cause Collection The interrupt events of the fetch and decode stage are
collected in the registers CA.1 and CA.2, as before. These data are then
passed through a 5-stage RSR.
An illegal instruction, a trap and a fixed point overflow are still detected
in the execute stage and clocked into register CA.3. Since these events
cannot be triggered by a legal floating point instruction, the corresponding
instruction always passes from stage 2.0 directly to stage 3.
The floating point exceptions are also detected in the execute stage.
These events can only be triggered by a floating point instruction which
is signaled by f op? 1. Circuit CAcol therefore masks the events with
flag f op?. The ‘unimplemented floating point operation’ interrupt uFOP
&(*
*
is signaled by the control in stage ID. The remaining floating point events
P IPELINED DLX correspond to the IEEE flags provided by the FPU. Environment CAcol
M ACHINE WITH gets these flags from the result bus R68 : 64.
F LOATING P OINT Let TCAcol denote the cycle time of circuit CAcol used in the design
C ORE without FPU. Cost and cycle time of the extended cause collection circuit
can then be expressed as
CCAcol 6 Cand Cor 13 C f f CRSR 5 3
TCAcol maxTCAcol ACON uFOP ∆ AFPU Ddriv ∆
IKE IN previous DLX designs (chapters 4 and 5), the control of the
prepared sequential data paths is derived in two steps. We start out
with a sequential control automaton which is then turned into precomputed
control.
Figures 9.16 to 9.18 depict the FSD underlying the sequential control
automaton. To a large extent, specifying the RTL instructions and active
control signals for each state of the FSD is routine. The complete specifi-
cation can be found in appendix B.
The portion of the FSD modeling the execution of the fixed point in-
structions remains the same. Thus, it can be copied from the design of
chapter 5 (figure 5.12). In section 8.3.6, we have specified an automaton
which controls the multiply/divide unit. Depending on the precision, the
underlying FSD is unrolled two to three times and is then integrated in the
FSD of the sequential DLX control automaton.
Beyond the decode stage, the FSD has an outdegree of one. Thus, the
control signals of the execute, memory and write back stage can be pre-
computed. However, the nonuniform latency of the floating point instruc-
tions complicates the precomputed control in two respects:
The execute stage consists of 5 substages. Fast instructions bypass
some of these substages.
&/
fetch
decode
floating point arithmetic fixed point FSD
FSD
FSD
64-bit
32-bit result
result
flag, WBd flag, WBs fcWB sh4l.d sh4l.s WBd WBs WB stage
FSD underlying the control of the DLX architecture with FPU. The portions modeling the execution of the fixed point instructions
and of the floating point arithmetic are depicted in figures 5.12, 9.17 and 9.18.
D ESIGN
P REPARED
*#
S EQUENTIAL
C ONTROL OF THE
&/
*
P IPELINED DLX
M ACHINE WITH
F LOATING P OINT fdiv.d fmul.d fadd..d fsub.d cvt.i.d cvt.s.d
C ORE lookup.d
netwon1.d
newton2.d
newton3.d
newton4.d
netwon1.d
newton2.d
newton3.d
newton4.d
netwon1.d
newton2.d
newton3.d
newton4.d
quotient1.d
quotient2.d
quotient3.d
rd1.d
rd2.d
&/
*#
C ONTROL OF THE
P REPARED
fdiv.s fmul.s fadd.s fsub.s cvt.i.s cvt.d.s cvt.s.i cvt.d.i S EQUENTIAL
D ESIGN
lookup.s
netwon1.s
newton2.s
newton3.s
newton4.s
netwon1.s
newton2.s
newton3.s
newton4.s
quotient1.s
quotient2.s
quotient3.s
rd1.s rd1.i
rd2.s rd2.i
&/#
*
x.0 Con.2.0
P IPELINED DLX
M ACHINE WITH RSR
x.1 Con.2.1
F LOATING P OINT
C ORE x.2 Con.2.2
5
RSRw w x.3 Con.2.3
z
Con.4
Like in previous designs (e.g., chapter 4), the control signals for the ex-
ecute, memory and write back stages are precomputed during ID. The
signals are then passed down the pipeline together with the instruction.
However, fast instructions bypass some of the execute stages. In order to
keep up with the instruction, the precomputed control signals are, like the
interrupt causes, passed through a 5-stage RSR (figure 9.19).
When leaving stage 2.0, an instruction with single cycle latency contin-
ues in stage 3. Instructions with a latency of 3 or 5 cycles continue in stage
2.3 or 2.1, respectively. The write signals of the RSRs can therefore be
generated as
10000 if lat5 1
RSRw1 : 5
00100
00001
if
if
lat3 1
lat1 1
lat5 0 lat3 0 lat1
(9.1)
The execute stage now consists of five substages. Thus, the signals of type
x are split into five groups x0 x4 with the obvious meaning.
Tables B.12 and B.14 (appendix B) list all the precomputed control sig-
nals sorted according to their type. The signals x0 comprise all the x-type
signals of the DLX design without FPU. In addition, this type includes the
signals specifying the latency of the instruction and the signals controlling
the exchange unit FPXtr and the first stage of the FPU.
The stages 2.1 up to 4 are governed by 22 control signals (table 9.16).
These signals could be passed through a standard 5-stage RSR which is 22
bits wide. However, signals for type xi are only needed up to stage 2i.
&/'
*
We therefore reduce the width of the RSR registers accordingly. The cost
P IPELINED DLX of the RSR and of the precomputed control can then be estimated as
M ACHINE WITH
F LOATING P OINT CConRSR Cinv 5 Cor 22 15 2 12 9 Cand Cmux C f f
C ORE C preCon CConRSR C f f 53 C f f 6
Thus, the RSR only buffers a total of 70 bits instead of 110 bits. Compared
to a standard 22-bit RSR, that cuts the cost by one third.
1
The stages k of the pipeline are ordered lexicographically, i.e.,
Except for the execute stage, the scheduling functions of the designs DLXΣ
and FDLXΣ are alike. One cycle after reset, the execution starts in the write
back stage with a jump to the ISR. For k 0 1 3, instruction Ii passes
from stage k to k 1:
IΣ k T i IΣ k 1 T 1 i
IΣ 4 T i IΣ 0 T 1 i 1
In the FDLXΣ design, the execute stage comprises 5 substages. Fast in-
structions bypass some of these substages, that complicates the scheduling.
For any execute stage 2k with k 0, the instruction is just passed to the
next stage, thus
IΣ 3 T 1 if k4
IΣ 2k T i i
IΣ 2 k 1 T 1 if k 3
The stall engine of figure 9.20 implements the new schedule in an obvi-
ous way. As in the sequential design of section 5.5.6, there is one central
clock CE for the whole FDLXΣ design. During reset, all the update enable
flags uek are inactive, and the full vector is initialized. In order to let an
&/(
*#
CE full.0 C ONTROL OF THE
/reset P REPARED
CE ue.0 S EQUENTIAL
CE full.1
D ESIGN
CE ue.1
CE full.2.0 CE
Din r1 ue.2.0
R1 full.2.1 ue.2.1
r2
reset clr R2 full.2.2
r3 ue.2.2
CE ce
R3 full.2.3
RSRw w ue.2.3
r4
R4 full.2.4
r5 ue.2.4
RSR R5 full.3
reset
/reset ue.3
CE
CE full.4
CE ue.4
Stall engine of the FDLXΣ design without support for divisions
instruction bypass some execute stages, the full flags of stages 2.1 to 2.4
and of the memory stage 3 are held in an RSR. This RSR is, like any other
RSR of the sequential DLX design, controlled by the write signals RSRw
of equation 9.1. The RSR of the stall engine is operated in a particularly
simple way, because all its clock enable signals are all tied to the common
clock enable CE.
Figure 9.21 illustrates how the precomputed control, the stall engine and
the data paths of the execute environment fit together. As before, the pre-
computed control provides the clock request signals RCe which are com-
bined (AND) with the appropriate update enable flags to obtain to the actual
clock signal RCe .
However, special attention must be payed to the clock signals of the
registers MDRw and Ffl.3. According to the specification in appendix B,
these two registers are clocked simultaneously. They either get their data
input from stage 2.0 or from stage 2.4, depending on the latency of the
instruction. Thus, the clock signal is obtained as
&/)
*#
The remaining registers of the FDLX data paths receive their data inputs
just from one stage. Register MAR, for example, is only updated by in- C ONTROL OF THE
structions with an 1-cycle execute latency; therefore P REPARED
S EQUENTIAL
MARce ue24 MARce20 D ESIGN
+
Along the lines of section 3.4 it can be shown that this FDLXΣ design
interprets the extended DLX instruction set of section 9.1 with delayed PC
semantics but without floating point divisions. The crucial part is to show
that the instruction and its data pass through the pipeline stages at the same
speed. More formally:
Let IΣ 20 T i, and let X be a register whose content is passed through )
one of the RSRs, i.e., X IR Cad Fad Sad PC DPC CA3 : 2. For any
stage k 21 3 with IΣ k T i, we have
¼ ¼
X 20T X kT and f ull kT 1
This follows from the definition of the write signals RSRw (equation 9.1)
and from lemma 9.1. Observe that the hypothesis of lemma 9.1 about the
clock enable signals is trivially fulfilled for the RSR in the stall engine. The
construction of the stall engine ensures, that the hypothesis about the clock
enable signals is also fulfilled for the remaining RSRs in the data paths and
in the control.
Outside the stall engine we update the registers of result shift registers
with separate update enable signals. Thus, during the sequential execution
of a single instruction it is still the case, that no stage k 0 1 3 4 or
substage 2 j is clocked twice. Not all instructions enter all substages, but
the dateline lemma 5.9 stays literally the same.
...
...
(2.0.1-2.0.16)
...
f.2.0.15
RSR
divhaz 0 1 f.2.0.16
ue.2.0.16
opaoe, opboe tlu f.2.1
ue.2.1
x.2 f.2.2
Con2.1 ue.2.2
Con2.2
x.3 f.2.3
ue.2.3
Con2.3
x.4 f.2.4
Con2.4 ue.2.4
f.3
Con.3
Main control for the stages 2.0 to 3 of the full FDLX Σ design
In the final four steps ($ &%&# & , # $% # $% ), the di-
vision passes through the stages 21 24. This is again controlled
by the precomputed control.
Thus, the main control (figure 9.22) of the floating point DLX design con-
sists of the stall engine, the precomputed control with its 5-stage RSR, and
the ‘division automaton’. Except for circuit S IGF MD, the data paths are
governed by the precomputed control, whereas the stall engine controls the
update of the registers and RAMs.
IΣ 20 j 1 T 1 if 0 j 16
IΣ 20 j T i i
IΣ 21 T 1 if j 16
In the remaining pipeline stages, the division is processed like any instruc-
tion with a 5-cycle execute latency. Thus, the scheduling function requires
no further modification.
Unlike the stall engine, the cause environment, the buffer environment
and the precomputed control still use a 5-stage RSR. Up to step 2.0.16, a
division is frozen in stage 2.0 and then enters the first stage of these RSRs.
Thus, the write signals RSRw1 : 5 of the RSRs in the data paths and in the
precomputed control are generated as
The output registers s and c of the multiplication tree are also used by
multiplications (stage 2.1). A division uses these registers up to stage 2.1.
Thus, the registers s and c can be updated at the end of step 2.1 without
any harm, even in case of a division.
Table 9.17 lists for each register the stages in which its clock signal must
be active. A particular clock request signal is then obtained by ORing the
update enable flags of the listed stages, e.g.:
the enable signals for the operand busses opa and opb
The signals f div and db are fixed for the whole execution of an instruction.
Therefore, they can directly be taken from the RSR of the precomputed
control.
The flag tlu selects the input of register x. Since this register is only used
by divisions, the flag tlu has no impact on a multiplication or addition.
Thus, flag tlu is directly provided by the division automaton.
&)
*#
The operand busses opa and opb are controlled by both, the precom-
puted control and the division automaton. Both control units precompute C ONTROL OF THE
their control signals. The flag divhaz selects between the two sets of con- P REPARED
trol signals before they are clocked into the register Con21. Let opaoe S EQUENTIAL
and opboe denote the set of enable signals generated by the division au- D ESIGN
tomaton; this set is selected on divhaz 1. The operand busses are then
controlled by
An active signal divhaz grants the division automaton access to the operand
busses during stages 201 to 2016. Since the enable signals are precom-
puted, signal divhaz must also be given one cycle ahead:
15
divhaz f ull 20k f ull 20 f div20
k 1
The 5 clock request signals and the 7 enable signals together have an
accumulated frequency of νsum 30 and a maximal frequency of νmax 9.
Thus, the control for circuit S IGF MD requires the following cost and cycle
time:
The division automaton delays the clock signals of circuit S IGF MD by the
following amount
+ 3
With respect to the dateline lemma we are facing two additional problems:
Some registers are updated by more than one stage. Registers c and
s of the circuit /!,1 for instance are updated after stage 2.0.16
during divisions and after stage 2.1 during multiplications. Thus,
classifying the registers by the stage, which updates them, is not
possible any more.
We coarsely classy the stages into two classes. The class of stages PP
which are operated in a pipelined fashion and the class of stages SQ which
are operated in a sequential manner:
) Let k t PP and let IΣk T i. For every register and memory cell R
&)&
*&
The value of the output registers of stage 2.0.16 at the end of the it-
erations for a division operation depend only on the value of the output P IPELINED DLX
registers of stage 2.0 before the iterations: D ESIGN WITH FPU
¼
and let V be an output register of stage 2.0.16. Then VT depends only on
the values QU 1 of the output registers Q of stage 2.0 which were updated
¼
after cycle U .
DMenv
11111
00000 SH4Lenv
Ffl’ C’ FC’
RFenv
Data paths of the pipelined FDLX design with result forwarding
Like in the pipelined designs DLXπ and DLXΠ, the register files GPR, SPR
and FPR are updated in the write back stage. Since they are read by ear-
lier stages, the pipelined floating point design FDLXΠ also requires result
forwarding and interlocking. For the largest part, the extension of the for-
warding and interlock engine is straightforward, but there are two notable
complications:
The execute stage has a variable depth, which depending on the in-
struction varies between one and five stages. Thus, the forwarding
and interlock engine has to inspect up to four additional stages.
Since the floating point operands and results have single or double
precision, a 64-bit register of the FPR register file either serves as
one double precision register or as two single precision registers.
The forwarding hardware has to account for this address aliasing.
&)(
*&
9 "- 8
The move instruction , is the only floating point instruction which up- P IPELINED DLX
dates the fixed point register file GPR. The move , is processed in the D ESIGN WITH FPU
exchange unit FPXtr, which has a single cycle latency like the fixed point
unit.
Thus, any instruction which updates the GPR enters the execute stage
in stage 2.0 and then directly proceeds to the memory stage 3. Since the
additional stages 2.1 to 2.4 never provide a fixed point result, the operands
A and B can still be forwarded by circuit Forw 3 of figure 4.16. However,
the extended instruction set has an impact on the computation of the valid
flags v4 : 2 and of the data hazard flag.
Valid Flags The flag v j indicates that the result to be written into the
GPR register file is already available in the circuitry of stage j, given that
the instruction updates the GPR at all. The result of the new move in-
struction , is already valid after stage 2.0 and can always be forwarded.
Thus, the valid flags of instruction Ii are generated as before:
v4 1; v3 v2 Dmr
Data Hazard Detection The flags dhazA and dhazB signal that the oper-
and specified by the instruction bits RS1 and RS2 cause a data hazard, i.e.,
that the forwarding engine cannot deliver the requested operands on time.
These flags are generated as before.
In the fixed point DLX design, every instruction I is checked for a data
hazard even if I requires no fixed point operands:
dhazFX dhazA dhazB
This can cause unnecessary stalls. However, since in the fixed point de-
sign almost every instruction requires at least one register operand, there is
virtually no performance degradation.
In the FDLX design, this is no longer the case. Except for the move ,,
the floating point instructions have no fixed point operands and should not
signal a fixed point data hazard dhazFX. The flags opA and opB therefore
indicate whether an instruction requires the fixed point operands A and B.
The FDLX design uses these flags to enable the data hazard check
dhazFX dhazA opA dhazB opB
The data hazard signals dhazA and dhazB are generated along the same
lines. Thus, the cost and delay of signal dhazFX can be expressed as
CdhazFX 2 CdhazA 2 Cand Cor
AdhazFX AdhazA Dand Dor
&)/
*
"- 8
P IPELINED DLX Due to the FPU, the special purpose registers SPR are updated in five situ-
M ACHINE WITH ations:
F LOATING P OINT
C ORE 1. All special purpose registers are updated by JISR. As in the DLXΠ
design, there is no need to forward these values. All instructions
which could use forwarded versions of values forced into SPR by
JISR get evicted from the pipe by the very same occurrence of JISR.
In case 5, which only applies to register IEEEf, the result is passed down
the pipeline in the Ffl.k registers. During write back, the flags Ffl.4 are
then ORed to the old value of IEEEf. In the uninterrupted execution of Ii ,
we have
IEEE f i IEEE f i 1 F f li
2. on an , instruction, the two exception PCs are read during decode,
&))
*&
3. the cause environment reads the interrupt masks SR in the memory
stage, P IPELINED DLX
D ESIGN WITH FPU
4. the rounders of the FPU read SR in the execute stage 2.3,
Forwarding of the Exception PCs Since the new floating point instruc-
tions do not access the two exceptions PCs, the forwarding hardware of
EPC and EDPC remains unchanged. EPC is forwarded by the circuit
SFor 3 depicted in figure 5.17. The forwarding of EDPC is still omit-
ted, and the data hazard signal dhaz EDPC is generated as before.
dhaz IEEE f
ms2i 1
Sas1 7
f opk f ull k hit 2 f op3 f ull 3
2 0k2 4
hit 2 NOR hit 3 f op4 f ull 4
&)*
*
Sas.1 0111 fop.2.[0:4] full.2.[0:4] hit.2 fop.4 full.4 hit.2 hit.3 fop.4 full.4
P IPELINED DLX
M ACHINE WITH equal 5-AND
F LOATING P OINT ms2i.1 OR
C ORE
dhaz(IEEEf)
The circuit of figure 9.24 generates the flag in the obvious way. The hit
signals are provided by circuit SFor 3. Thus,
Let instruction Ii read the rounding mode RM in stage 2.2 or 2.4. Fur- )
thermore, let I j be an instruction preceding Ii which updates register RM.
Assuming that the instructions pass the pipeline stages strictly in program
order, I j updates register RM before Ii reads RM.
1) Any instruction which passes the rounder FP RND or FX RND has an ex-
ecute latency of at least 3 cycles. Thus, the rounder of stage 2.4 processes
Ii in cycle T 2, at the earliest:
IΠ 24 T i with T T 2
tT t 2 T 2
, " 8
While an instruction (division) is processed in the stages 2.0.0 to 2.0.15,
the signal divhaz is active. Since the fetch and decode stage are stalled on
divhaz 1, it suffices to forward the floating point results from the stages
k PP with k 20. In the following, stage 2.0 is considered to be full, if
one of its 17 substages 2.0.0 to 2.0.16 is full, i.e.,
Depending on the flag dbs, the floating point operands either have single
or double precision. Nevertheless, the floating point register file always
delivers 64-bit values f a and f b. Circuit FPemb of stage ID then selects
the requested data and aligns them according to the embedding convention.
However, the forwarding engine, which now feeds circuit FPemb, takes the
width of the operands into account. That avoids unnecessary interlocks.
The floating point forwarding hardware FFOR (figure 9.25) consists of
two circuits F f or. One forwards operand FA, the other operand FB. In ad-
dition, circuit F f or signals by f haz 1 that the requested operand cannot
be provided in the current cycle. Circuit F f or gets the following inputs
the 64-bit data Din from a data port of register file FPR, and
for each stage k PP with k 20 the destination address Fad k, the
precision dbr, the write signal FPRwk and an appropriately defined
intermediate result FC k.
Like in the fixed point core, the forwarding is controlled by valid flags
f v which indicate whether a floating point result is already available in one
&*#
*
of the stages 2.0, 2.1 to 4. After defining the valid flags f v, we specify the
P IPELINED DLX forwarding circuit F f or and give a simple realization.
M ACHINE WITH The flags opFA and opFB indicate whether an instruction requires the
F LOATING P OINT floating point operands FA and FB. These flags are used to enable the
C ORE check for a floating point data hazard:
Forwarding engine FFOR provides this flag at the following cost and delay
CFFOR 2 CF f or
CdhazFP 2 Cand Cor
AdhazFP ACON csID DF f or f haz Dand Dor
Valid Flags Like for the results of the GPR and SPR register files, we
introduce valid flags f v for the floating point result FC. Flag f vk in-
dicates that the result FC is already available in the circuitry of stage k.
The control precomputes these valid flags for the five execute substages
20 21 24 and for the stages 3 and 4.
In case of a load instruction (Dmr 1), the result only becomes avail-
able during write back. For any other floating point operation with 1-cycle
execute latency, the result is already available in stage 2.0. For the re-
maining floating point operations, the result becomes available in stage 2.4
independent of their latency. The floating point valid flags therefore equal
Since the flags f vk for stage k 21 23 4 have a fixed value, there
is no need to buffer them. The remaining three valid flags are passed
through the RSR of the precomputed control together with the write signal
FPRw.
In any stage k 20, the write signal FPRwk, the valid flag f vkk and
the floating point destination address Fad k are available. For some of
these stages, the result FC k is available as well:
FC 4 is the result to be written into register file FPR,
FC 3 is the input of the staging register FC.4, and
FC 2 is the result R to be written into register MDRw. Depending
on the latency, R is either provided by stage 2.0 or by stage 2.4.
&*&
*&
Lemma 4.8, which deals with the forwarding of the fixed point result,
can also be applied to the floating point result. However, some modifica- P IPELINED DLX
tions are necessary since the result either has single or double precision. D ESIGN WITH FPU
Note that in case of single precision, the high and low order word of the
results FC k are identical, due to the embedding convention (figure 9.1).
Thus, we have:
For any instruction Ii , address r r4 : 0, stage k PP with k 20, )
and for any cycle T with IΣ k T i we have:
Lemma 9.7 implies that instruction I requests the high (low) order word
if the operand has double precision or an odd (even) address. Due to the
embedding convention (figure 9.1), a single precision result is always du-
plicated, i.e., the high and low order word of a result FC k are the same.
&*'
*
P IPELINED DLX Floating point hit signals for stage k 2 0 3, assuming that the
M ACHINE WITH instruction in stage k produces a floating point result (FPRw f ull k 1) and
F LOATING P OINT that the high order address bits match, Fad k4 : 1 ad 4 : 1.
C ORE
destination source
hitH.k hitL.k
dbr.k Fad.k[0] dbs.1 ad[0]
0 0 0 1
0 0 0 1 0 0
1 * 0 1
0 0 0 0
0 1 0 1 1 0
1 * 1 0
0 0 0 1
1 * 0 1 1 0
1 * 1 1
The two hit signals of stage k therefore have the values listed in table 9.19;
they can be expressed as
Moreover, flag topH k signals for the high order word that there occurs
a hit in stage k but not in the stages above:
topH k hitH k
hitH x
2 0xk xPP
The flags topLk of the low order word have a similar meaning. In case
of topH k 1 and topL j 1, the instructions in stages k and j generate
data to be forwarded to output Do. If these data are not valid, a data hazard
f haz is signaled. Since f v4 1, we have
While an instruction is in the stages 2.1 to 2.3 its result is not valid
yet. Furthermore, the execute stages 2.0 and 2.4 share the result bus R
which provides value FC 2. Thus, circuit F f or only has to consider three
results for forwarding. The high order word of output Do, for example,
&*(
*&
Do[63:32] Do[31:0]
P IPELINED DLX
hitH.2.0 hitL.2.0 D ESIGN WITH FPU
0 1 0 1
hitH.2.4 hitL.2.4
FC’.2[63:32] FC’.2[31:0]
hitH.3 0 1 hitL.3 0 1
FC’.3[63:32] FC’.3[31:0]
hitH.4 0 1 hitL.4 0 1
The delay of Do is largely due to the address check. The actual data Din
and FC j are delayed by no more than
DF f or Data DF f orSel
All the address and control inputs of circuit FFOR are directly taken from
registers. FFOR therefore provides the operands FA1 and FB1 with an
accumulated delay of
Before the operands are clocked into the registers FA and FB, circuit
FPemb aligns them according to the embedding convention. Thus, fetch-
ing the two floating point operands requires a minimal cycle time of
Since the divider is only partially pipelined, the division complicates the
scheduling considerably. Like for the sequential design, we therefore first
ignore divisions. In a second step, we then extend the simplified scheduler
in order to support divisions.
1. several instructions can reach a stage k at the same time like the
instructions I1 and I3 do, and
Notation So far, the registers of the RSR are numbered like the pipeline
stages, e.g., for entry R we have R20 R24 R3. The execute latency
l of an instruction specifies how long the instruction remains in the RSR.
Therefore, it is useful to number the entries also according to their height,
i.e., according to their distance from the write back stage (table 9.21). An
instruction with latency l then enters the RSR at height l.
In the following, we denote by f ull d the full flag of the stage with
height d, e.g.:
&**
*
P IPELINED DLX Height of the pipeline stages
M ACHINE WITH
stage 2.0 2.1 2.2 2.3 2.4 3 4
F LOATING P OINT
height 6 5 4 3 2 1 0
C ORE
Structural Hazards According to lemma 9.1, the entries of the RSR are
passed down the pipeline one stage per cycle, if the RSR is not cleared
and if the data are not overwritten. Thus, for any stage with height d
2 5 we have,
This means that an instruction once it has entered the RSR proceeds at full
speed. On the other hand, let instruction Ii with latency li be processed in
stage 2.0 during cycle T . The scheduler then tries to assign Ii to height
li for cycle T 1. However, this would cause a structural hazard, if the
stage with height li 1 is occupied during cycle T . In such a situation, the
scheduler signals an RSR structural hazard
and it stalls instruction Ii in stage 2.0. Thus, structural hazards within the
RSR are resolved.
T d li T 1
Since j i, the instructions would be not executed in-order (i.e., the con-
dition of equation 9.4 is violated), if Ii leaves stage 2.0 at the end of cycle
T.
For d li , we have T d li T , i.e., instruction I j reaches height li
before instruction Ii . Thus, in order to ensure in-order execution, Ii must
be stalled in stage 2.0 if
5
RSRorderT f ull d T 1
d li 2
'
*&
The flag RSRhaz signals a structural hazard or a potential out-of-order
execution: P IPELINED DLX
D ESIGN WITH FPU
5
RSRhaz RSRstr RSRorder f ull d T
d li 1
The stall engine of the FDLXΠ design stalls the instruction in stage 2.0 if
RSRhaz 1. Of course, the preceding stages 0 and 1 are stalled as well.
Hardware Realization Figure 9.27 depicts the stall engine of the design
FDLXΠ . It is an obvious extension of the stall engine from the DLXΠ
design (figure 5.19). Like in the sequential design with FPU, the full flags
of the stages 2.0 to 3 are kept in a 5-stage RSR.
A more notable modification is the fact that we now use 3 instead of 2
clocks. This is due to the RSR hazards. As before, clock CE1 controls the
stages fetch and decode. The new clock CE2 just controls stage 2.0. Clock
CE3 controls the remaining stages; it is still generated as
Clock CE2 is the same as clock CE3 except that it is also disabled on an
RSR hazard:
CE3 full.4
CE3 ue.4
Stall engine of the FDLXΠ design without support for divisions
Scheduling Function Except for the execute stages, the FPU has no im-
pact on the scheduling function of the pipelined DLX design. The instruc-
tions are still fetched in program order and pass the stages 0 and 1 in lock
step mode:
i if ue0T 0
IΠ 0 T i IΠ 0 T 1
i1 if ue0T 1
IΠ 1 T i IΠ 0 T i 1
Except for stage 20, an instruction makes a progress of at most one stage
per cycle, given that no jump to the ISR occurs. Thus, IΠ k T i with
20 and JISRT 0 implies
k
k T 1 uekT 0
IΠ
IΠ k 1 T 1
if
if uekT 1 k 0 1 3
i
uekT k 2 j 21 22 23
IΠ
IΠ
2 j 1 T 1 if
3 T 1 if uekT
1
1 k 24
With respect to stage 2.0, the pipelined and the sequential scheduling func-
tion are alike, except that the instruction remains in stage 2.0 in case of an
RSR hazard. In case of JISR 0, an active flag RSRhaz disables the up-
date of stage 2.0, i.e., signal ue20 is inactive. Thus, for IΠ 20 T i and
'
*&
JISRT 0, we have
P IPELINED DLX
20 T 1 if ue20T 0
IΠ
IΠ 21 T 1 if ue20T 1 li 5
D ESIGN WITH FPU
i
23 T 1 if ue20T 1 li 3
IΠ
IΠ 3 T 1 if ue20T 1 li 1
Like in the pipelined design without FPU, the control comprises the mem-
ory controllers IMC and DMC, the memory interface control MifC, a cir-
cuit CE which generates the global clock signals, the stall engine, the pre-
computed control, a Mealy automaton for stage ID, and a Moore automa-
ton for the stages EX to WB. The parameters of these two automata are
'#
*
P IPELINED DLX Classification of the precomputed control signals
M ACHINE WITH
type x.0 x.1 x.2 x.3 x.4 y z
F LOATING P OINT
C ORE control signals 31 7 3 0 3 3 6
valid flags 2 2 2
listed in table B.16. Thus, the cost of the whole FDLX control can be
expressed as
the flags f v20, f v24 and f v3 for the floating point result.
The valid flags increase the signals of type x0, y and z by two signals
each (table 9.22). The RSR of the precomputed control now starts with
26 signals in stage 2.1 and ends with 13 signals in stage 3. The control
signals are precomputed by a Moore automaton which already provides
the buffering for stage 2.0. This does not include the valid flags; they
require 6 buffers in stage 2.0. In addition, an inverter and an AND gate are
used to generate the valid flags.
Since divisions iterate in the multiply divide circuit S IGF MD, the pre-
computed control is extended by circuit DivCon, like in the sequential de-
sign (figure 9.22). The cost and delay of control DivCon remain the same.
Without the automaton, the cost of the RSR and of the (extended) pre-
computed control can then be expressed as
'&
*&
-
The pipelined FDLX design uses three clock signals CE1 to CE3. These P IPELINED DLX
clocks depend on flags JISR, on the hazard flags dhaz and RSRhaz, and D ESIGN WITH FPU
on the busy flags busy and Ibusy.
The forwarding circuitry provides three data hazard flags: flag dhazFX
for the GPR operands, flag dhazS for the SPR operands and flag dhazFP
for the FPR operands. A data hazard occurs if at least one of these hazard
flags is active, thus
Flag dhaz can be obtained at the following cost and accumulated delay:
The FPU has no impact on the busy flags. They are generated like in the
pipelined design DLXΠ , at cost Cbusy and with delay Abusy . The JISR flags
are obtained as
The three clock signals are then generated at the following cost and delay
1
The core of the stall engine is the circuit depicted in figure 9.27 but with
an 21-stage RSR. In addition, the stall engine enables the update of the
registers and memories based on the update enable vector ue.
According to equation 9.2, the write signals Stallw of the 21-stage RSR
are directly taken from the precomputed control of stage 2.0. The core of
the stall engine therefore provides the update enable flags at the following
cost and delay
'(
*&
*&' -
P IPELINED DLX
It suffices to show the simulation theorem for cycles, when instructions are D ESIGN WITH FPU
in stages k PP.
for k PP and statements 1 (a) and (b) for signals S and output registers
R of stages k PP.
The arguments from the induction step of theorems 4.5, 4.7 and 4.11 have
to be extended for the execute environment. Two new situations must be
treated:
For the first case, let Ii be an instruction which jumps from stage 20 to
stage x with x 21 24 3, and let
i IΠ x T IΣ x T
IΠ 20 T 1 IΣ 20 T 1
Let Q out 20 be an output register of stage 2.0 which was updated
during cycle T 1. The induction hypothesis and the dateline lemma imply
¼
QTΠ Qi QTΣ
As in the proof of theorem 4.7, one argues that values RMi 1 and SRi 1 are
forwarded to stage x of machine DLXΠ in cycle T . It follows that
T ¼
SΠ SΣT
'/
*
For the second case, let Ii be a division instruction and let
P IPELINED DLX
M ACHINE WITH i IΠ 21 T IΣ 21 T
F LOATING P OINT IΠ 2016 T 1 IΣ 2016 T 1
C ORE IΠ 20 U IΣ 20 U
QUΠ1 QUΣ 1
¼
N THIS section, we analyze the impact of the floating point unit on the
cost and the performance of the pipelined DLX design. We also analyze
how the FPU impacts the optimal cache size (section 9.5.2).
In the following, we compare the cost and the cycle time of the designs
DLXΠ and FDLXΠ . Both designs use a split 4KB cache. The Icache and
the Dcache are of equal size, i.e., 2KB each. They are two way set as-
sociative with LRU replacement and implement the write allocate, write
through policy. With respect to the timing, we assume that the memory
interface has a bus delay of dbus 15 and a handshake delay of dMhsh 10
gate delays.
')
*'
Cost of the pipelined DLX data paths. DPM denotes the data paths E VALUATION
without the memory environments.
+ "
Except for the environments IRenv and IMenv, all parts of the data paths
and of the control had to be adapted to the floating point instruction set.
Significant changes occurred in the execute stage, in the register file envi-
ronment, in the forwarding hardware, and in the control (table 9.23).
The floating point unit itself is very expensive, its cost run at 104 kilo
gates (section 8.7). Compared to the FPU, the FXU is fairly inexpensive.
Thus, in the FDLX design, the execute environment is 28 times more ex-
pensive than in the DLX design. The FPU accounts for about 95% of the
cost of EXenv.
There is also a significant cost increase in the forwarding hardware, in
the buffers and in the register file environment. This increase is due to
the deeper pipeline and due to the additional floating point operands. The
remaining environments contribute at most 1kG (kilo gate) to the cost in-
crease. The memory environments become even slightly cheaper, due to
the simpler data memory interface. The data ports of the Dcache and of
environment DMenv have now the same width (64 bits); the patch of the
data ports therefore becomes obsolete.
In the DLXΠ design, the 4KB split cache is by far the single most expen-
sive unit; it accounts for 82% of cost. The FPU is about 9% more expensive
than the 4KB cache. Thus, in the FDLXΠ design, the 4KB cache only con-
tributes 40% to the cost of the data paths; environment EXenv contributes
another 46%. Adding the FPU roughly doubles the cost of the pipelined
data paths (factor 2.05). Without the caches, the FPU has even a stronger
cost impact, it increases the cost of the data paths roughly by a factor of 6.
'*
*
P IPELINED DLX Cost of the control of the pipelined DLX designs and with FPU.
M ACHINE WITH
MifC stall, CE preCon automata CON DLX
F LOATING P OINT
C ORE DLXΠ 943 165 202 952 2262 118960
FDLXΠ 1106 623 1440 2829 5898 245514
increase 6.7% 278% 613% 197% 161% 106%
Table 9.24 lists the cost of the different control environments and of the
whole DLX designs. Adding the FPU increases the cost of the control
by 160%. The cost of the memory interface control remains virtually the
same. Due to the deeper pipeline, the stall engine becomes about 4 times
as expensive.
The control automata become about three times as expensive. This is
largely due to the Moore automaton which precomputes the control signals
of the stages EX to WB. It now requires 44 instead of 17 states, and it
generates 48 instead of 16 control signals. The Moore control signals have
a 7 times higher accumulated frequency νsum (342 instead of 48).
The larger number of control signals also impacts the cost of the pre-
computed control, which passes these signals down the pipeline. Since
the pipeline is also much deeper, the precomputed control is 7 times as
expensive as before.
%
Table 9.25 lists the cycle time for each stage of the data paths. The cycle
time of the write back stage remains the same, despite of the additional
register file. The FPR register file consists of two RAM banks, each of
which only has half the size of the RAM used in the GPR register file.
Thus, time TW B is still dominated by the delay of the shifter SH4L and the
GPR register file.
Due to the aliasing of single and double precision registers, each word
of a floating point operand must be forwarded separately. Since all the
operands are fetched and forwarded in parallel, the floating point extension
has only a minor impact on the operand fetch time. The cycle time of stage
ID is still dominated by the PC environment.
The FPU is much more complex than the FXU. Thus, the cycle time of
the execute stage is increased by about 50%; the execute stage becomes
time critical. The cycle time of the control is also increased significantly
(16%). This is due to the non-uniform latency of the execute stage, which
requires the use of an RSR.
'
*'
Cycle times of the data paths of the designs DLX Π and FDLXΠ with E VALUATION
2KB, 2-way Icache and Dcache.
ID CON / stall
EX WB DP
operands PC max( , )
DLXΠ 72 89 66 33 89 79 46 dbus
FDLXΠ 74 89 98 33 98 92 48 dbus
Memory cycle times of the DLX designs with 2KB, 2-way Icache and
Dcache, assuming a bus and handshake delay of d bus 15 and d Mhsh 10.
Maccess
$read $if Mreq Mrburst
α4 α8
DLXΠ 55 47 42 51 379 707
FDLXΠ 53 47 42 51 379 707
The memory system remains virtually the same, except for one multi-
plexer which is saved in the Dcache interface and a modification of the
bank write signals. The latter has no impact on the delay of the memory
control. Thus, except for the cache read time T$read , the two DLX designs
with and without FPU have identical memory cycle times (table 9.26).
Like in sections 6.4.2 and 6.5.3, we now optimize the cache size of the
FDLXΠ design for performance and for a good performance cost ratio.
The optimization is based on a floating point workload.
+%
Table 9.27 lists the cost, the cycle time TFDLX of the CPU, and the memory
access times for the pipelined FDLX design. The total cache size varies
between 0KB and 32KB. The 64MB main memory uses DRAMs which
are 4 (8) times slower and denser than SRAM.
As before, doubling the cache size roughly doubles the cost of the mem-
ory environment. However, due to the expensive floating point unit, a
cache system of 1KB to 4KB only causes a moderate (25 - 65%) increase
of the total hardware cost. In combination with small caches, the FPU
'
*
P IPELINED DLX Cost, CPU cycle time and memory access time of the FDLX Π design
M ACHINE WITH
total CM CFDLX
F LOATING P OINT TFDLX TM 4 TM 8
cache [kG] [kG] [%]
C ORE
0KB 0 149 100 98 355 683
1KB 30 179 120 98 359 687
2KB 52 201 135 98 367 695
4KB 96 246 165 98 379 707
8KB 184 334 224 98 382 710
16KB 360 510 342 104 385 713
32KB 711 861 578 107 388 716
dominates the CPU cycle time. Beyond a total cache size of 16KB, the
detection of a cache hit becomes time critical.
The memory access time grows with the cache size; it is significantly
larger than the CPU cycle time. As before, the actual memory access is
therefore performed in W cycles with a cycle time of
τM TM W
τ maxτM TFDLX
"
In addition to the integer benchmarks of table 4.20, the SPEC92 suite also
comprises 14 floating point benchmarks (for details see [Sta, HP96]). On
average, this floating point workload SPECfp92 uses the instruction mix
listed in table 9.28; this table is derived from [Del97].
The non-uniform latency of the execute stage makes it very difficult (or
even impossible) to derive the CPI ratio of the pipelined FDLX design
in an analytic manner. In [Del97], the CPI ratio is therefore determined
by a trace based simulation. Assuming an ideal memory which performs
every access in a single cycle, the FDLX design achieves on the SPECfp92
workload a CPI ratio of
CPIideal f p 1759
'
*'
Instruction mix of the average SPECfp92 floating point workload E VALUATION
instruction FXU load store jump branch
frequency [%] 39.12 20.88 10.22 2.32 10.42
Memory access time of the FDLX design with cache memory (given
in CPU cycles)
The split cache system of the FDLX design has a non-uniform access
time which depends on the type of the access (table 9.29). Thus, a read
miss takes 1 S W cycles. In the FDLX design each cache line has S 4
sectors. The parameter W depends on the speed of the memory system; in
this framework, it varies between 3 and 16 cycles.
The whole pipeline is stalled in case of a slow data memory access.
On an instruction fetch miss, only the fetch and decode stage are stalled,
the remaining stages still proceed. However, these stages get eventually
drained since the decode stage provides no new instructions. Thus, an
instruction fetch miss will also cause a CPI penalty.
In order to keep the performance model simple, we assume that the
whole pipeline is stalled on every slow memory access. That gives us a
lower bound for the performance of the pipelined FDLX design. In anal-
ogy to equation 6.5 (page 312), the CPI ratio of the FDLXΠ design with
cache memory can then be modeled as
CPI f p CPIideal f p νstore 1 W
ν f etch pIm νload store pDm W S
by about 30%. This suggests that the data accesses require a larger work-
ing set than the instruction fetches, and that the instruction fetches have a
better locality. A larger cache improves the CPI ratio but with diminishing
returns. Since a larger cache also increases the cycle time, the 16KB cache
system even yields a worse performance than the 8KB system. Thus, with
respect to performance, a total of 8KB cache is optimal.
Without caches, every memory access takes 1 W cycles, and the pipe-
lined FDLX design then has a CPI ratio of
For a realistic quality measure, the parameter q lies in the range [0.2, 0.5].
Within this range, the design with a total cache size of 4KB is best. The
8KB system only wins, if much more emphasis is put on the performance
than on the cost.
'&
*(
Speedup and cost increase of the FDLX Π with a split 2-way cache E VALUATION
over the design without cache
3
1KB split
2.8 2KB split
quality ratio (alpha =4)
4KB split
2.6 8KB split
2.4 no cache
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q
4
1KB split
2KB split
quality ratio (alpha =8)
2.5
1.5
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6
quality paramter: q
Quality ratio of the design with a split 2-way cache relative to the
design without cache for two types of off-chip memory.
''
*
0 %&
P IPELINED DLX
M ACHINE WITH ) An arithmetical FPU instruction Ii updates the SPR register
F LOATING P OINT IEEEf by a read-modify-write access:
C ORE
IEEE f i IEEE f i 1 F f li
Unlike any other instruction updating the SPR register file, the input of
this write access is provided via the special write port Di6 and not via
the standard write port Din. That complicates the forwarding of the SPR
operand S. In order to keep the engine forwarding engine (section 9.4.2)
lean, the forwarding of the IEEEf flags generated by an arithmetical FPU
operation was omitted.
) Sketch the changes of the design required if we want to make
division fully pipelined (Conceptually, this makes the machine much sim-
pler). Estimate the extra cost.
'(
*(
) Evaluate the quality of the machines from exercises 9.4 and
9.5. Assume, that the cycle time is not affected. For the machine from E XERCISES
exercise 9.5 use your estimate for the cost. Compare with the machine
constructed in the text.
'/
Appendix
A
DLX Instruction Set
Architecture
6 5 5 16
I-type opcode RS1 RD immediate
6 5 5 5 5 6
R-type opcode RS1 RS2 RD SA function
6 26
J-type opcode PC offset
( The three instruction formats of the DLX design. The fields RS1 and
RS2 specify the source registers, and the field RD specifies the destination regis-
ter. Field SA specifies a special purpose register or an immediate shift amount.
Function field is an additional 6-bit opcode.
All three instruction formats (figure A.1) have a 6-bit primary opcode and
specify up to three explicit operands. The I-type (Immediate) format spec-
ifies two registers and a 16-bit constant. That is the standard layout for
instructions with an immediate operand. The J-type (Jump) format is used
for control instructions. They require no explicit register operand and profit
from the larger 26-bit immediate operand. The third format, R-type (Regis-
ter) format, provides an additional 6-bit opcode (function). The remaining
20 bits specify three general purpose registers and a field SA which spec-
ifies a 5-bit constant or a special purpose register. A 5-bit constant, for
example, is sufficient as shift amount.
'
( J-type instruction layout; sxt(imm) is the sign-extended version of the F LOATING -P OINT
26-bit immediate called PC Offset. E XTENSION
IR[31:26] mnemonic effect
Control Operation
hx02 j PC = PC + 4 + sxt(imm)
hx03 jal R31 = PC + 4; PC = PC + 4 + sxt(imm)
hx3e trap trap = 1; Edata = sxt(imm)
hx3f rfe SR = ESR; PC = EPC; DPC = EDPC
Since the DLX description in [HP90] does not specify the coding of the
instruction set, we adapt the coding of the MIPS R2000 machine ([PH94,
KH92]) to the DLX instruction set. Tables A.2 through A.4 specify the
instruction set and list the coding; the prefix “hx” indicates that the number
is represented as hexadecimal. The effects of the instructions are specified
in a register transfer language.
ESIDES THE fixed point unit, the DLX architecture also comprises a
floating point unit FPU, which can handle floating point numbers in
single precision (32-bits) or in double precision (64-bits). For both preci-
sions, the FPU fully conforms the requirements of the ANSI/IEEE standard
754 [Ins85].
," 8
The FPU provides 32 floating point general purpose registers FPRs, each
of which is 32 bits wide. In order to store double precision values, the
registers can be addressed as 64-bit floating point registers FDRs. Each of
the 16 FDRs is formed by concatenating two adjacent FPRs (table A.5).
Only even numbers 0 2 30 are used to address the floating point reg-
isters FPR; the least significant address bit is ignored. In addition, the FPU
provides three floating point control registers: a 1-bit register FCC for the
floating point condition code, a 5-bit register IEEEf for the IEEE exception
flags and a 2-bit register RM specifying the IEEE rounding mode.
'
DLX I NSTRUCTION ( R-type instruction layout. All instructions increment the PC by four.
S ET SA is a shorthand for the special purpose register SPRSA; sa denotes the 5-bit
A RCHITECTURE immediate shift amount specified by the bits IR[10:6].
The DLX machine uses two formats (figure A.2) for the floating point
instructions; one corresponds to the I-type and the other to the R-type of
the fixed point core. The FI-format is used for loading data from memory
'
( I-type instruction layout. All instructions except the control instruc- F LOATING -P OINT
tions also increment the PC by four; sxt a is the sign-extended version of a. E XTENSION
The effective address of memory accesses equals ea GPRRS1 sxt imm,
where imm is the 16-bit intermediate. The width of the memory access in bytes is
indicated by d. Thus, the memory operand equals m M ea d 1 M ea.
' #
DLX I NSTRUCTION ( Register map of the general purpose floating point registers
S ET
floating point
A RCHITECTURE floating point registers
general purpose registers
single precision (32-bit)
double precision (64-bit)
FPR3131 : 0 FDR3063 : 32
FDR3063 : 0
FPR3031 : 0 FDR3031 : 0
: :
FPR331 : 0 FDR263 : 32
FDR263 : 0
FPR231 : 0 FDR231 : 0
FPR131 : 0 FDR063 : 32
FDR063 : 0
FPR031 : 0 FDR031 : 0
6 5 5 16
FI-type Opcode Rx FD Immediate
6 5 5 5 3 6
FR-type Opcode FS1 FS2 / Rx FD 00 Fmt Function
( Floating point instruction formats of the DLX. Depending on the pre-
cision, FS1, FS2 and FD specify 32-bit or 64-bit floating point registers. RS
specifies a general purpose register of the FXU. Function is an additional 6-bit
opcode. Fmt specifies a number format.
into the FPU respectively for storing data from the FPU into memory. This
format is also used for conditional branches on the condition code flag
FCC of the FPU. The coding of those instructions is given in table A.6.
The FR-format is used for the remaining FPU instructions (table A.8). It
specifies a primary and a secondary opcode (Opcode, Function), a number
format Fmt, and up to three floating point (general purpose) registers. For
instructions which move data between the floating point unit FPU and the
fixed point unit FXU, field FS2 specifies the address of a general purpose
register RS in the FXU.
Since the FPU of the DLX machine can handle floating point numbers
with single or double precision, all floating point operations come in two
version; the field Fmt in the instruction word specifies the precision used.
In the mnemonics, we identify the precision by adding the suffix ‘.s’ (sin-
gle) or ‘.d’ (double).
' &
( FI-type instruction layout. All instructions except the branches also F LOATING -P OINT
increment the PC, PC += 4; sxt(a) is the sign extended version of a. The effective E XTENSION
address of memory accesses equals ea = RS + sxt(imm), where imm is the 16-bit
offset. The width of the memory access in bytes is indicated by d. Thus, the
memory operand equals m M ea d 1 M ea.
( Floating-Point Relational Operators. The value 1 (0) denotes that the
relation is true (false).
' '
DLX I NSTRUCTION
S ET
A RCHITECTURE ( FR-type instruction layout. All instructions execute PC += 4. The for-
mat bits Fmt = IR[8:6] specify the number format used. Fmt = 000 denotes single
precision and corresponds to the suffix ‘.s’ in the mnemonics; Fmt = 001 denotes
double precision and corresponds to the suffix ‘.d’. FCC denotes the 1-bit register
for the floating point condition code. The functions sqrt(), abs() and rem() denote
the square root, the absolute value and the remainder of a division according to
the IEEE 754 standard. Instructions marked with will not be implemented in
our FPU design. The opcode bits c3 : 0 specify a relation “con” according to
table A.7. Function cvt() converts the value of a register from one format into
another. For that purpose, FMT = 100 (i) denotes fixed point format (integer) and
corresponds to suffix ‘.i’ .
' (
Appendix
B
Specification of the FDLX
Design
IGURES 9.16, 9.17 and 9.18 depict the FSD of the FDLX design. In
section B.1, we specify for each state of the FSD the RTL instructions
and their active control signals. In section B.2 we then specify the control
automata of the FDLX design.
.,
In stage IF, the FDLX design fetches the next instruction I into the instruc-
tion register (table B.1). This is done under the control of flag f etch and
of clock request signal IRce. Both signals are always active.
.+
The actions which the FDLX design performs during instruction decode
depend on the instruction I held in register IR (table B.2). As for stage IF,
the clock request signals are active in every clock cycle. The remaining
control signals of stage ID are generated by a Mealy control automaton.
S PECIFICATION OF
THE FDLX D ESIGN ) RTL instructions of the stage IF
RTL instruction control signals
IR1 IM DPC fetch, IRce
) RTL instructions of stage ID; * denotes any arithmetical floating
point instruction with double precision.
' )
# 1=
RTL I NSTRUCTIONS
The execute stage has a non-uniform latency which varies between 1 and OF THE FDLX
21 cycles. The execute stage consists of the five substages 2.0, 2.1 to 2.4.
For the iterative execution of divisions stage 2.0 itself consists of 17 sub-
stages 2.0.0 to 2.0.16. In the following, we describe the RTL instructions
for each substage of the execute stage.
-A
In stage 2.0, the update of the buffers depends on the latency of the instruc-
tion I. Let
3 if I has latency of l 1
k
2231 ifif II has latency of l 3
has latency of l 5
stage 2.0 then updates the buffers as
IRk Cad k Sad k Fad k : IR2 Cad 2 Sad 2 Fad 2
PCk DPCk DDPCk : PC DPC DDPC
3 if k 24
k
2 j 1 if k 2 j 24
-
Tables B.3 and B.4 list the RTL instructions for the fixed point instructions
and for the floating point instructions with 1-cycle execute latency. From
stage 2.0, these instructions directly proceed to stage 3.
The operand FB is only needed in case of a floating point test operation
,. By f cc and Fc we denote the results of the floating point condition test
circuit FC ON as defined in section 8.5
Tables B.5 and B.6 list the RTL instructions which stage 2.0 performs
for instructions with an execute latency of more than one cycle.
' *
S PECIFICATION OF ) RTL instructions of the execute stages for the fixed point instructions.
THE FDLX D ESIGN
state RTL instruction control signals
alu MAR A op B ALUDdoe, Rtype, bmuxsel
opA, opB, MARce, lat1
aluo MAR A op B, overflow? like alu, ovf?
aluI MAR A op co ALUDdoe, opA, MARce, lat1
aluIo MAR A op co overflow? like aluI, ovf?
testI MAR A rel co ? 1 : 0 ALUDdoe, test, opA, MARce,
lat1
test MAR A rel B ? 1 : 0 like testI, Rtype, bmuxsel, opB
shiftI MAR shift A co4 : 0 SHDdoe, shiftI, Rtype,
opA, MARce, lat1
shift MAR shift A B4 : 0 like shiftI, bmuxsel, opB
savePC MAR link linkDdoe, MARce, lat1
trap MAR co trap 1 coDdoe, trap, MARce, lat1
Ill MAR A ill 1 ADdoe, ill, opA, MARce, lat1
ms2i MAR S SDdoe, MARce, lat1
rfe
mi2s MAR A ADdoe, opA, MARce, lat1
noEX
addrL MAR A co ALUDdoe, add, opA, MARce,
lat1
addrS MAR A co F f l 3 0 ALUDdoe, add, amuxsel, opA,
MDRw opB, store.2, MARce, MDRce,
cls B MAR1 : 0000 Ffl3ce, lat1, tfpRdoe
-
The execute substages 2.1 and 2.2 are only used by the arithmetic instruc-
tions , # ,$# ,$ and , +. The RTL instructions for the divisions are
listed in table B.6 and for the other three types of operations they are listed
in table B.7.
- # &
In these two stages the FPU performs the rounding and packing of the
result (table B.8). In order to keep the description simple, we introduce
the following abbreviations: By FPrdR and FXrdR, we denote the out-
put registers of the first stage of the rounders FP RD and FX RD, respec-
tively. The two stages of the floating point rounder FP RD compute the
'#
) RTL instructions of the execute stages for floating point instructions RTL I NSTRUCTIONS
with a single cycle latency. OF THE FDLX
state RTL instruction control signals
addrL.s MAR A co ALUDdoe, add, opA, MARce, lat1
addrL.d
addrSf MAR A co ALUDdoe, add, opA, MARce,
MDRw FB store.2, fstore.2, tfpRdoe,
Ff l 3 0 MDRwce, Ffl3ce, lat1, (amuxsel)
mf2i MAR FA31 : 0 opFA, tfxDdoe, MARce, lat1
mi2f MDRw B B opB, tfpRdoe, MDRwce,
Ff l 3 0 Ffl3ce, lat1
fmov.s MDRw FA opFA, fmov, tfpRdoe,
fmov.d Ff l 3 0 MDRwce, Ffl3ce, lat1
fneg.s MDRw Fc63 : 0 opFA, FcRdoe, MDRwce,
fneg.d Ff l 3 Fc68 : 64 Ffl3ce, lat1
fabs.s MDRw Fc63 : 0 opFA, FcRdoe, MDRwce, abs
fabs.d Ff l 3 Fc68 : 64 Ffl3ce, lat1
fc.s, MAR 031 f cc opFA, opFB, ftest, fccDdoe, MARce
fc.d Ff l 3 MDRw Fc FcRdoe, MDRwce, Ffl3ce, lat1
) RTL instructions of the execute substage 2.0 for instructions with a
latency of at least 3 cycles.
'#
S PECIFICATION OF ) RTL instructions of the iterative division for stages 2.0.1 to 2.2 (single
THE FDLX D ESIGN precision). In case of double precision (suffix ‘.d’), an additional control signal
dbr is required in each state. A multiplication always takes two cycles. Since the
intermediate result is always held in registers s and c, we only list the effect of the
multiplication as a whole.
β fa Eb 2 p1 fb
E 2 p2 ; if β 0
fd E
E 2 p2 ;; ifif ββ
0
0 fdiv,
Fr f lq sq eq fd FqFrdoe, Frce
functions FPrd1 and FPrd2 as specified in section 8.4. The the fixed
point rounder FX RD (page 427) also consists of two stages. They compute
the functions denoted by FXrd1 and FXrd2 .
& 5
Table B.9 lists the RTL instructions which the FDLX design performs in
stage M. In addition, stage M updates the buffers as follows:
IR4 Cad 4 Sad 4 Fad 4 : IR3 Cad 3 Sad 3 Fad 3
PC4 DPC4 DDPC4 : PC3 DPC3 DDPC3
'#
) RTL instructions of the substages 2.1 and 2.2, except for the divisions. RTL I NSTRUCTIONS
OF THE FDLX
state RTL instruction control signals
Mul1.s sq eq SigExpMD Fa21 Fb21 sqce, eqce,
Mul1.d f lq SpecMD Fa21 Fb21 nan21 flqce, sce, cce,
s c mul1 Fa21 Fb21 faadoe, fbbdoe
Add1.s ASr AS1 Fa21 Fb21 nan21 ASrce
Add1.d
Sub1.s ASr AS1 Fa21 Fb21 nan21 ASrce, sub
Sub1.d
Mul2.s f q mul2 s c
Mul2.d Fr f lq sq eq f q FqFrdoe, Frce
SigAdd.s Fr AS2 ASr FsFrdoe, Frce
SigAdd.d
'##
S PECIFICATION OF ) RTL instructions of the write back stage WB
THE FDLX D ESIGN
state RTL instruction control signals
sh4l GPRCad 4 GPRw, load.4
sh4l MDs MAR1 : 0000
sh4l.s FPRFad 4 MDs FPRw, load.4
sh4l.d FDRFad 4 MDRr FPRw, load.4, dbr.4
wb GPRCad 4 C4 GPRw
mi2sW SPRSad 4 C4 SPRw
fcWB like mi2sW, SPRw,
IEEE f IEEE f F f l 4 fop.4
WBs FPR Fad 4 FC 31 : 0 FPRw
flagWBs like WBs, FPRw,
IEEE f IEEE f F f l 4 fop.4
WBd FDR Fad 4 FC FPRw, dbr.4
flagWBd like WBd, FPRw,
IEEE f IEEE f F f l 4 fop.4
noWB (no update)
' ;
Table B.10 lists the RTL instructions which the FDLX design processes in
stage WB, given that no unmasked interrupt occurred. In case of a JISR,
the FDLX design performs the same actions as the the DLXΠ design (chap-
ter 5).
precomputed control.
- .+
According to table B.2, the clock request signals of stage ID are indepen-
dent of the instruction. Like in stage IF, they are always active. Thus,
the control automaton of stage ID only needs to generate the remaining
13 control signals. Since they depend on the current instruction word, a
Mealy automaton is used.
Table B.11 lists the disjunctive normal form for each of these signals.
The parameters of the ID control automaton are listed in table B.16 on
page 539.
'#'
S PECIFICATION OF ) Type x0 control signals to be precomputed during stage ID (part 1)
THE FDLX D ESIGN
signals states of stage 2.0
lat1 alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
trap, mi2s, noEX, ill, ms2i, rfe, addrL, addrS, addrL.s,
addrL.d, addrSf, mf2i, mi2f, fmov.s, fmov.d, fneg.s,
fneg.d, fabs.s, fabs.d, fc.s, fc.d
lat3 cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s, cvt.i.d
lat5 fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d
lat17 fdiv.s
lat21 fdiv.d
opA alu, aluo, aluI, aluIo, test, testI, shift, shiftI, mi2s,
noEX, ill, addrL, addrS, addrL.s, addrL.d, addrSf
opB alu, aluo, test, shift, addrS, mi2f
opFA fmov.s, fmov.d, fneg.s, fneg.d, fabs.s, fabs.d, fc.s,
fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s, cvt.i.d,
fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d, fdiv.s,
fdiv.d
opFB addrSf, fc.s, fc.d, fmul.s, fmul.d, fadd.s, fadd.d,
fsub.s, fsub.d, fdiv.s, fdiv.d
Although multiplications only use two of these tristate drivers, the precom-
puted control provides six enable signals
1 if I ,$# ,$
f aadoe f bbdoe
0 otherwise
Eadoe Aadoe xadoe xbdoe 0
'#/
S PECIFICATION OF ) Control signals of type x1 to z to be precomputed during stage ID
THE FDLX D ESIGN
signals states of stage 2.0
x.1 sub fsub.s, fsub.d
faadoe, fmul.s, fmul.d
fbbdoe
x.2 fdiv fdiv.s, fdiv.d
FqFrdoe fmul.s, fmul.d, fdiv.s, fdiv.d
FsFrdoe fadd.s, fadd.d, fsub.s, fsub.d
x.4 Ffl3ce, addrS, addrSf, mi2f, fmov.s, fmov.d, fneg.s, fneg.d,
MDRwce fabs.s, fabs.d, fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s,
cvt.d.i, cvt.i.s, cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d,
fsub.s, fsub.d, fdiv.s, fdiv.d
FpRdoe cvt.d.s, cvt.i.s, cvt.s.d, cvt.i.d, fadd.s, fadd.d, fsub.s,
fsub.d, fmul.s, fmul.d, fdiv.s, fdiv.d
FxRdoe cvt.s.i, cvt.d.i
y amuxsel, addrS, addrSf
Dmw,
store
MARce, alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
C4ce trap, mi2s, noEX, ill, ms2i, rfe, addrL, addrS, ad-
drL.s, addrL.d, addrSf, mf2i
FC4ce, mi2f, fmov.s, fmov.d, fneg.s, fneg.d, fabs.s, fabs.d,
Ffl4ce fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s,
cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d,
fdiv.s, fdiv.d
z DMRrce, addrL, addrL.s, addrL.d
Dmr, load
fop fc.s, fc.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i, cvt.i.s,
cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s, fsub.d,
fdiv.s, fdiv.d
dbr fmov.d, fneg.d, fabs.d, cvt.s.d, cvt.i.d, fmul.d,
fadd.d, fsub.d, fdiv.d
SPRw mi2s, rfe, fc.s, fc.d
GPRw alu, aluo, aluI, aluIo, test, testI, shift, shiftI, savePC,
ms2i, addrL, addrL.s, addrL.d, mf2i
FPRw addrL.s, addrL.d, mi2f, fmov.s, fmov.d, fneg.s,
fneg.d, fabs.s, fabs.d, cvt.s.d, cvt.s.i, cvt.d.s, cvt.d.i,
cvt.i.s, cvt.i.d, fmul.s, fmul.d, fadd.s, fadd.d, fsub.s,
fsub.d, fdiv.s, fdiv.d
'#)
) Types of the precomputed control signals C ONTROL
AUTOMATA OF THE
type x.0 x.1 x.2 x.3 x.4 y z
FDLX D ESIGN
number 31 7 3 0 3 3 6
) Parameters of the two control automata which govern the FDLX Π
design. Automaton id generates the Mealy signals for stage ID; automaton ex
precomputes the Moore signals of the stages EX to WB.
Except on divisions, the busses opa and opb are only used in stage 2.1.
Thus, together with signal sub (floating point subtraction), the FDLX de-
sign requires 7 type x1 control signals.
Tables B.17 and B.18 lists the disjunctive normal forms for the automa-
ton which controls the stages EX to WB. The parameters of this Moore
automaton are summarized in table B.16.
'#*
S PECIFICATION OF
THE FDLX D ESIGN
) Disjunctive normal forms of the precomputed control which governs
stages EX to WB (part 1)
'&
C ONTROL
AUTOMATA OF THE
FDLX D ESIGN
) Disjunctive normal forms used by the precomputed control (part 2)
state IR31 : 26 IR5 : 0 Fmt length
addrL.s 110001 ****** *** 6
addrL.d 110101 ****** *** 6
addrSf 111*01 ****** *** 5
fc.s 010001 11**** 000 11
fc.d 010001 11**** 001 11
mf2i 010001 001001 *** 12
mi2f 010001 001010 *** 12
fmov.s 010001 001000 000 15
fmov.d 010001 001000 001 15
fadd.s 010001 000000 000 15
fadd.d 010001 000000 001 15
fsub.s 010001 000001 000 15
fsub.d 010001 000001 001 15
fmul.s 010001 000010 000 15
fmul.d 010001 000010 001 15
fdiv.s 010001 000011 000 15
fdiv.d 010001 000011 001 15
fneg.s 010001 000100 000 15
fneg.d 010001 000100 001 15
fabs.s 010001 000101 000 15
fabs.d 010001 000101 001 15
cvt.s.d 010001 010000 001 15
cvt.s.i 010001 010000 100 15
cvt.d.s 010001 010001 000 15
cvt.d.i 010001 010001 100 15
cvt.i.s 010001 010100 000 15
cvt.i.d 010001 010100 001 15
accumulated length of the monomials 196
'&
Bibliography
'&/
Index