0% found this document useful (0 votes)
89 views8 pages

A. With: George

This document describes the hardware design of a division and square root board for a floating point accelerator. Key points: - The board contains 150 ICs and uses radix-4 division and radix-2 square root algorithms. Division takes 4.8-9.2 microseconds and square root takes 4.2-7.6 microseconds depending on precision. - The division algorithm selects the next quotient digit in parallel with calculating the next partial remainder using an 8-bit ALU to estimate remainder bits. - The square root algorithm obtains the correctly rounded result in about twice the time of division using minor extensions to the division hardware. - By producing twice as many bits per division step as square

Uploaded by

roy_shirshendu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views8 pages

A. With: George

This document describes the hardware design of a division and square root board for a floating point accelerator. Key points: - The board contains 150 ICs and uses radix-4 division and radix-2 square root algorithms. Division takes 4.8-9.2 microseconds and square root takes 4.2-7.6 microseconds depending on precision. - The division algorithm selects the next quotient digit in parallel with calculating the next partial remainder using an 8-bit ALU to estimate remainder bits. - The square root algorithm obtains the correctly rounded result in about twice the time of division using minor extensions to the division hardware. - By producing twice as many bits per division step as square

Uploaded by

roy_shirshendu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

COMPATIBLE HARDWARE FOR DIVISION AND SQUARE ROOT 1

George S. TayLor

Computer Science Division


University of California .
Berkeley. California 94720

ABSTRACT

Design Overview
Our division and square root board contains 150 lCs.
Division is limited to a single board because of constraints on the size of the entire accelerator and the
difficulty of passing wide operands between boards. 65
ICs on the addition/subtraction board also support the
division operation. All parts are Schottky TTL except
three' programmable array logic (PAL2 ) packages which
implement the quotient digit look-up table.
The accelerator supports three floating-point formats: single, double and (double) extended, with
signitlcand widths of 24, 53 and' 64 bits, respectively. We
use the term significand rather than fraction as a reminder that. the significant. digit field of normalized
numbers in all formats has one bit to the left of the
binary point. Single and double precision sig niflcands
are left justified with zero fill before reaching the division
board, so its data paths are designe d for B4-bit operands.
The operands are positive numbers because KCS uses
aign-magnitude representation.
Internal data paths and functional units are slightly
wider than the operands. Quotients in all formats are
developed with three more bits than the operands have
in order to allow' unbiased rounding with an error ~
ULP (unit in the Last place), as KeS requires. The three
bils are called Guard, Round and Sticky. The Guard bit
is used if the quotient is normalized by one bit before
rounding, the maximum normalization that KCS permits.
The Sticky bit is equal to zero only if the result is exact.
I.e., all subsequent bits in an infinite precision result
would be zero. Square roots have two more bits, Round
and Sticky, than the operand because there is no need
for normalizat.ion. ' The results are rounded after they
leave the division board.
Register to register operation times are given in
Table 1~ .Our division iteration produces t.wice as many
bits per cycle, but has, the same cycle time as the original VAX accelerator. The inner loop accounts for about
two-thirds of the time in each instruction.

Hardware for radix four division and radix two


squara f-oot is shared in a. processor designed to implement the proposed IEEE floating-point standard. The
~~si?n hardwar~ looks ahead to flnd the next quotient
digtt in parallel wIth the .next partial remainder. An a-bit
ALU .estim~~es the next remainders leading bits. The
quotient dlglt look-up table is addressed with a truncation of the estimate rather than a truncation of t.he full
partial remainder.. The estimation ALU and "the look.. up
table. are asy:nmetric for positive and negative
remainders. ThIS asymmetry reduces the width of, t.he
ALU and t.he number of minterrns in the logic equations
for t.h~ look-up table. The square rool algorithm obtains
the correctly rounded result in about two division times
using small extensions to the division hardware.

Introduction
An . IEEE Computer Society working group has
recommend.ed a standard for binary floating-point arithmetic based on the proposal by Kahan. Coonen and Stone
[1][2]. To investigate the feasibility of the KCS architecture. we are building a substitute tloating-point accelerator for the DEC VAX 11/780 minicomputer [3]. The proposed standard requires that an implementation provide
correctly rounded quotients and square roots. We found
that radix four division hardware provides high speed at
reasonable cost and, as a by-product, accommodates
square root with minor extensions. This paper describes,
the algorithms and hardware for both operations.

Antecedents
We use nonrestoring division with redundant quotient digits and an irredundant partial remainder. Selection of another digit is overlapped with calculation of a
partial remainder using the current digit, The theory
and general impleme.ntation of, higher radixnonrestoring
division are explained by Atkins [4][5][6]. based on the
work of Robertson [7]. The Illiac 111 was an early machine
which selected quotient digits using truncations of the
divisor and partial remainder [6]. Tan reports [9] that
certain IBM processors use a short precision' ALU to estimate the next remainder. Quotient selection which is
overlapped with the full width remainder iteration in this
way is classified as QS2 by Kalaycioglu [10]. Baron's
study [11] of several division schemes includes a radix
four method similar to ours. but she recommends more
redundancy in the quot.ient digit representation than we
found to be optimal.
1 This work was supported

Table 1. -. Accelerator Instruction Times (,usee)


Instruction
Divide .- single
Divide -- double
Divide -- extended
Square root ... single
Square root -.. double
Square root -- extended

by the U. S. Department of Energy

2.6

VAX.
4.2
8.8

4.2
7.6

-_...
--

Berkeley
4.. 8
5.6

9.2

under eomrect DE..AT08-76SFOOO34, project qreement DE-AS0379ERI0358. and by' theN"ational .Science Foundation under grant
11'88 supported by an NSF Graduate Fellow..
sl'rlp.
KCS7B-07291. The author

C~i1630-3/81/0000/0127$OO.75

e 1981" IEEE

127

PAL is

trademark of Monolithic Memories.

--

Radix Four
We had more board space available than the simplest radix two restoring division scheme requires. The
original VAX accelerator produced one quotient bit per
100 ns by this method. We wished to make division faster
in order to keep it in balance with addition and multiplication. Radix two with a redundant partial remainder
and carry save addition could run at less than 100 ns per
cycle. but the hardware cost would be substantial and
square root could not be accommodated easily, Radix
four at 100 ns per cycle provides equal or better pertermance at lower cost. Higher radix methods were unat..
tractive because they required divisor multiples which
could not be generated merely by shifting. Radix four
can be implemented with a two-input ALU and and a
two-input multiplexer in front of the divisor register.
instead of functional units with three or more inputs as
the higher radix methods would require.
We use a nonrestoring method rather than 8. restoring one so that the remainder iteration requires only a
single data path and no backtracking. Backtracking
would waste the gain made through lookahead quotient
digit selection. The cost is that negative quotient digits
must be combined in an ALU with the ones previously
accumulated. Due to the low degree of redundancy in
our quotient digit representation, this ALU must be the
full width of "the quotient rather than the width ofone
quotient digit. Since the ALU on the acceler-ator-s addition board can be shared for this purpos-e. the division
board contains only one full precision ALU.

The VAX 11/780 microinstruction cycle time is 200


ns, with minor cycles at 50 na Intervals, For this reason.
our inner loop times had to be 50, 100 or 200 ns, The
67-bitALU in the main partial remainder data path takes
_64 ns because we use small 74-8381 parbs. This lead us to
design simple data paths in order to achieve a 100 ns
step time. The microprogrammer can request either
one or two division or square root steps per microin..
struction cycle.
The hardware necessary for division alone is shown
in Figure 1. while that for square root alone is shown in
Figure 2. Shared functional units and data paths are
readily identified. Some small data paths which ease the
tasks of loading the operands and generating the Sticky
bit are not included.
Division
The division procedure 'Was chosen through a series
of de Ci5io05:
(1) the radix -four
(2) divisor multiples -- one and two. but not three
(3) parallelism _.. overlapped quotient and remainder
(4) width of remainder 'estimate _.. eight bits
(5~ estimation ALU operation --asymmetric
The choices made imply that the quotient selection logic
must observe seven remainder bits and four divisor bits
(one known implicitly). Table" 2 is a PID plot of the logic.
where '9;'4-1 (to be defined later)" plays the role of the partial remainder, The choices are explained next.

rSTIHATfD

\'S'1
....

o-r

NEXT AMAINDER

ji+t

~l~~~~~"~~~~~~~~~~~~~~~q~q~
.OC--oo--aO--OQ--ac--O O--

~._---OOoc----QOCC-----~ooo

~Icooc--_-----co~ooooo---
~

e-

_---_-_-----oo~oOO~~~~cO

-J.

d
Crositive)

'. 0 I 0

I, 0 l ,
I. 100

r,

... ~

-J -:2- -J

, tOO I
DIV/~OR

-2

'A -1 -1 e:;

(2f 1 1
-1 -1 (2) 0 1 1
-~ -:2 r~ -;). -1 -1 D (0 0 1 1
-2 -:; -J -~ '8 -.1 -1 D 0 (2) 1- 1
-2 -;;. -2 --2 :L "1 1, 0' 0 0 1
-1. -).. -;;. -:2 -J. -1 "1 -1 a e5 e 0 1
"J. -2 -~ "1 -j B -1 -1 -1 0 ~ 0 0 1
-J. -l -). .. ~ -:2 -1 ~1 "1 1 e> 0" e 0 1

, .000

0'

1. 110
( .. I I (

.[3

:2 :2
C J. J. ~
i :l ~ :J :l
1 J. .:2 :2- 2
1 C ::2 2 ~ -J
1 1 2 2 ~ ~ ~
1 i ;2 z :2 :2 :2
1 1 1 2. 2 z ~
~

~i+-2

A
(3

D
E

TABLE

- ( :l -

- (;2 -

!p. 31)
!]l )

- (1-j2
j )
j2

plD

fLQT FOR

128

QUOTIENT rELEcTION

LOGIC

~
r

A+B

A-B

--,

r-----I

I SHIFT
OR
J LOAD"

...

ZERO

>e

qi+ld

_--I

r- -

tv

\0

65

-/--_-1
I

OR
65 REM

64

65 SQUARE ROOT

SQUARE ROOT

STICKY

DATA BUS

FIGURE I -DIVISION DATA PATHS

DATA BUS

FIGURE 2 -SQUARE ROOT DATA PATHS

A square root algorithm shifts its remainder by


twice the number of result bits found per iteration. A
division algorithm, by contrast, shifts its remainder by
the same number of bits as' are generated during the
cycle. Division and square root hardware can be merged
if twice as many bits are produced per division step as
per square root' sbep. Thus radix four division hardware
has the advantage of convenient reuse for square root;

In order to know Q'+l at the beginning of a cycle. it is calculated in parallel during the previous one. The GALU in
Figure 1 uses truncations of Pi. and q'+l to guess quickly
the leading bits of P"+1' Then the quotient digit logic
equations are evaluated using the guess. For later reference, define ii, and Q"+ld. as the truncated inputs to
GALU, and .L Ui +- 1 as its output. The quotient selection
r
table is addressed wit.h the leading bits of g"+1 and d,
which we denote by 9,+1 and d. 9'+1 is not a truncation
of P(+1 since, it may differ by one unit in its last place
from the corresponding bits of 1'''+1'
The worst case delay around the selectionloop is:

Redundancy and Simple Divisor llultiples


A redundant quotient digit representation, permits
lookahead logic to select the next digit before the full
precision- next remainder is determined. For maximum
redundancy in radix four. quotient digits could be
selected from a set containing up to seven values: I-S, -2,
-1, O. 1, 2, 3J. Because the multiple of three times the
divisor would be costly to generate. quotient digits are
selected instead fr om the set ~-21 -1, O. 1, 21. The cost is
more complicated quotient selection hardware, but programmable logic limits the .increase to a few les.

Read

Parallelism
In the algortthm' s inner loop. a quotient digit is
selected and that multiple of the divisor is subtracted
from the shifted previous remainder. Using notation
suggested .by Atkins and Kalaycioglu [12],
1
;:-1'(+1

= Pi - q"+ld for i

=D.o ,m-1

qj+l from flip-flop QFF


Select qJ+ld through MUX
Add or subtract in GALU
Pre diction logic
Setup time for q" +2 tlip-tlop QFF

18 ns
35 ns
30 ns
3 ns

Total

95 ns

The Algorithm
Before examining the guess ALU in more detail, 'We
need to explain the division algorithm. The signiflcands
of the initial division operands lie in the range

(la)

o ~ dividend Po < 2

where

1 s; divisor d

Pi. = partial remai~der afteri th iteration

d.

(2)

<2

(3)

because the divisor must be normalized. The partial


remainders p, are two's complement, while the divisor d.
is always positive.

=dividend
r =radix =4

Po

q~

9 DS

After step i;

212
-~
~ 'r Pi + ~ ~

:: i th quotient digit

(4)

= divisor

Consequently. at the beginning of the next step, after


!-P\+l has been shifted l~~ft. to multiply by r,

m is the number of radix 1'" digits in Qm' the last' quotient


before rounding.. Qm has the form ql .q2
qm, with
the binary point.between qt and qe.
0

-~SPHIS ~

In a logical sense. the quotient is accumulated dur...


ing the iteration by resolving the negative quotient
digits. Using Q, for the partia.l quotient after the ith
iteration,
1
;-QI,+l = QI. + Q'+l
(lb)
'Where QQ = O. An.. equivalent procedure saves hardware
in our design. The positive and negativeq, 's are held in
separate shift registers. At the end of the iteration. the
negative register is subtracted from the positive one.
It is not possible first to select qi+l and then carry
out equation (la) in 100 DSM ]f qi+l were known immediately, the worst case delay to form the next fun-width
remainder would be:

Re ad qi +1 from flip-flop QFF


Select. qj+la through MUX
Add or subtract in DALU
Setup time for Pi+l in RR

9 ns
18 ns
64 ns
5 ns

Total

96 ns

Pi+l

(5)

can be driven back into the-Interval of equation (4)

by the appropriate subtraction or additron of zero, cl or

2d. The process is illustrated in Figur-e

3~

-~

r--t-::l,--t--t-'7f--+.--+..-wq-+--f.~IC-i--~Jt:--a-::."

.J:: -, ...>,

r--

~:... ~I:'O

~:t

FIGURE 3. ,~.,.

130

..

;r;.,

:<'---1~1

<'

---.,->

D1V{stON ALGORlrHM

p~,

'I~he setup step for the algorithm selects q 1. which is


used in the first iteration t.o move the dividend from the
range of equation (2) into t.he range of equation (4).
Since the dividend is st.rictly less than 2d, ultimately q 1
contributes one bit. to quotient Qm rather than two. If
q 1 :;,: 2 then P 1 'will be negative and the adjustment to '1 1
durmg the second iteration of equation (lb) will be subty'actJ:')H..
Consequently. Qrn. has an odd number of
sigriinc ant bits.

9i+l

d + 1 ULP

8-bit Next Remainder Prediction ALU


The gue ss ALU's width is chosen to satisfy the
conflicting demands of high speed and simple quotient
selection logic. Meeting an 8-bit boundary is desirable
for design with 4-bit Al,U slices. To determine the
minimum reasonable width. we contruct a table for the
quotient '. selection logic. Inspection shows that five
remainder bits and three divisor bits (plus the first bit
which is always one) are enough to determine q"... 2 except
in a few cases. Table 3 shows our quotient selection logic
organized by these eight bits. In the exceptional cases,
either one or two more bits of gi+l~ must be observed.
The fifth and sixth columns of Table 3 contain the

. . P chopped = ':Pi + (-1.0]

d.

where

and one ULP of iI


tion of

Pi+l

+ (-2

--'il~-.------

Pi
~ifi d
~
r

---~--~l,----

-v---~

1T#
FIGU~

4; -

A!J an example, assume that g"+1


d binary 1.000.

(9a)

10]

+ (-1,1) if GALU performs A-B but

(9b)

:11\+1

P.. +1

+ (-2

(90)

IO)

carcula-

if'

(8)

Pi+l

= ~. Figure 4 illustrates the

(6)

~=
r
r

--e------

< 1.0 + ~.~25 _

1.3333

:=

.( l1b)

91.+1

--0------

Case -:

2e

Ui+l'"

Case +:
91.+1
--=
r

for.rns

-= Pi. lq,,+l
r

(ita)

2e

The bounds c an be evaluated once the widths 0 f fJi + 1


and d are chosen. In our design, 9i+l has eight bits and
_
.
1.
--1
d has four. so one ULP of 9"+1
16. one ULP of ,g\+l ;;: 8

(7)

= 1 ULP of Pi+l and d.

.intervals are in units of the least significant


bit of GALU. Depending on the sign of qi+l' the GALU per" ' - . -'-~
d

+ 1 ULP of g(+l
+ 1 ULP of d. - e
~ 1"+1

wheI'e~,hf!!

9t+l

,~

1 ULP of 9\+1 -

"f.

1 U'L? of ~
d.

inputs.

Pi

bounds depend on the relationship between gi+l and Pi+l'


The GALU's inputs are truncations of the main ALUs

I 9i+l d I + {-liD]

1 ULP of g:;;~

Pi+l

d.

< O.

for 9;'+1

bounds on .P~l which can be set by observing the bits of


if shown in columna two and four. q-L+2 can be
selected only if the minimum and the maximum ratios in
11 given row are within ~ units of the same integer. The

chopped I =

il d

g1.";'~ and

I q'U'ld 1 = Iqi+l d

9;'+1

:oS:

of

if GALU performs A..E..1.

~ P~l

1.5

= binary 0001.. 1 and

+ 0.5 ~.g.6Z5 -

< 2.0625

Asymmetric GALU
The advantage of equation (10) over its counterpart
for a symmetric GALU is that g'+1 wiggles in only one

Since a particular 9'+1 may result from either addition or subtraction.


Pi+l
gi+l + [0.2) ULPs of 9'1.+1
(10)
for the asymmetric GALU which performs A+B or A~B-1.
The quotient select.ion logic addressed by g~+l8.nd d can

direction. 19;'+11 is never lar-ger than IPi+l!' Many

9i+!

and

9"':1
d

ratios are multiples of .;;-: If a


d. + 1 ULP of cl
~~
" t e d rmmmum
."
JJ
PH-l
pre d ic
magmt u d e Lor
-rJ- equa1s 1 or
::I:

Pi.+l
bound. -d.-by:

i;

or

Il.

predicted maximum magnitude equals

3" or

:t: 3' then moreLhan ave bits of 9'1.+1 must be observed.


To avoid looking til more bits when the maximum ratio i!
~ for example, t.he predicted minimum ratio would have

t.o be

131

::t<

~. But Table 3 shows that the dif[erence

Remainder
KCS defines a remainder operation whose result, has
magnitude no greater than half the divisor's magnitude.
To produce this result, a fixup step is required after division. It is conveniei.t to change RR from a register to a
shift register so that the last partial remainder can be
shifted back to the right by two bits in order to align it
with the divisor.

between the predicted bounds is never as small as 1:...


.
3
unit, Since 91.+ 1 is uncertain in only one direction rather
than two, there is sufficient information without observing another bit in approximately half of the boundary
cases.
Our quotient selection logic implements Table 3
using 39 minterms. An earlier design based on a 9 ...bit
symmetric GALU would have required 56 minterms. An
B-bit symmetric ALU 'Would have re~ed =ven more
minterms and at least one more bit in g'+l or d.
Although asymmetry decreases the size of both the
ALU and the programmable logic, it ~t not simplify a
RAM implementation. The width of 9'+1 remains seven
bits rather than six because of one bad case: see
(9<&+1' d)= (1110.0xx, 1.000) in Table 3. A single level
RAM would require ten address bits. However, a two level
RAM implementation, such as th~ one suggested ~an
[ 'f]. could trade one more bit of d. for one less bit 9(+1'

Square Root
The restoring square root algorithm produces one
result bit per step. The accumulated partial result after
any step is the truncation of the infinitely precise
answer. so the bits may be collected in a shift register.
The algorithm consists of "completing the square."
Two bits of the operand are brought into the calculation
during each cycle. Imagine that before each cycle the
remainder and partial result are aligned so that
(ar)2 ~ operand

Verification
The quotient selection table was tested by simulation with all pairs of 12.-bit dividends and divisors. No
error was found and no part of the unimplemented
region in Table 3 was accessed. Random modifications to
the table caused errors to be detected.

ar = the truncated result already found


r

= radix

= 2

a bare integ ers

Division Step by Step


The division operation proceeds in fourvateps. Refer
to Figure 1.
Step 1
Load the divisor into DR. Load the dividend into RR
through MUX and DALU. The MUX shifts the dividend
right by four bits and there is a wired left shift by
two bits at RR. The net efi'ect is to shift the dividend
right by two bits. The dividend is loaded by a roundabout path in order to save the space and delay
which a multiplexer in front of RR would cost.
Step 2
Put ql into QFF by adding zero to RR in the ALUs
and reloading RR. This leaves the dividend in RR
with its binary point in the same relative position as
the divisors binary point occupies in DR. GALU's
go
Po

output equals - . Consequently. the quotient

= (ar + b)2 + c < (ar + r)2 (it)

where

c is a real number

and we seek b in O:s; b :!: r-l to minimize c ~ O. The


current
remainder
(aT + b)2 + C - (ar)2
= 2arb + b 2 + c. b is either 0 or 1. To find t.he next
result bit. assume b
1 and subtract 4a. + 1 from the
current remainder. The next result bit is one if this .
difference is ~ O. and zero otherwise.
The position of the binary point within the operand
imposes only one restriction. Pairs of operand bits
brought into the calculation must lie on the same side of
the binary point. Thus if the exponent's value is even
and the signiflcand' s value is between 1 and 2. only one
bit. will be used during the first iteration. The sigriiflcand
is shifted left by one bit if the exponent is odd. This .may
raise the significand's value in the first iteration to
between 2 and 4. so that two bits are used.
The hardware previously described for division and
remainder is extended in three ways for square root.
Shift register SQR holds the op erand until it is introduced into the computation. RR becomes a two bit at a
time left shift register. (The remainder tlxup step
already requires it to be a two bit at a time right shift
register.) DR is changed from a register to a one bit at a
time left shift register in order to hold the developing
B quare root result.
As used in square root. RR and DALU are three bits
wider than the operand. DR is one bit wider than the
operand because the point at which result bits are
inserted into DR is two bits left of the least significant
end of RR and DALU. SQR is one bit narrower than the
operand because the .first two bits of it load directly into
RR during the intitialization step.

selection logic chooses q 1 by comparing the dividend and divisor with their binary points correctly
aligned.
Step 3
Repeat equations (lab) 34 times. The sign bit of QFF
controls the ALU operation. The other two bits control the MUX. At the end of each cycle, clock QFF
into the POS and NEG registers. Pi+l into RR. and
Q"+2 into QFF.
Step 4Subtract NEG from POS to form Qm' If Pm (in RR) is
negative, then subtract one more ULP from Qm. The
Sticky bit is zero if Pm = 0 and one otherwise.
For the purpose of division. DR is a register of the
same length as the operands. RR is a register three bits
longer than the operands. DALU is one bit wider than the
operands. POS and NEG are shift registers four bits
longer than the operands.

ANote on Software Square Root


. Vi. Kahan has shown that software square root algorlthr~15 can fin~ ~he corre.ctly rounded result using intermedi rte quanttties no Wider than the precision of the
oper e.nd [13]. The calculation is simpler if the machine
can chop quotients and round sums. Software methods
can be expected to take between six and fifteen divide
times. depending on the size of the processor. The

132

larger the processor, the greater the ratio. If hardware


s~uare root takes between one and two divide times, it
wIll. be about ten times faster than software. The choice
of Implementation depends on the importance of the
square ro at operation and its incremental cost in the
total hardware design.

Square Root Step by Step


The square root operation proceeds in .five steps.
Refer to Figure 2.
Step 1

for a RAM implementation, especially a two level one..


The cost of resolving the redundant quotient representation is low because registers and an ALU elsewhere in the
accelerator can be shared for this purpose. Hardware
square root is an inexpensive extension to our division
scheme. The extra hardware is a shift register to hold
the operand and a shift register to hold the result.

Acknowledgement
W.. Kahan has offered encouragement and valuable
suggestions throughout the course of this project.

Load t.he operand into DR. The operand should be


normalized in order to avoid wasted cycles at the

beginning of the iteration.


Step 2
Set QFF to one if the operand's unbiased exponent is
even. Set QFF to two if the exponent is odd. In the
latter case, the operand will be shifted left by one
bit during the next step.
Step 3
Move the operand from DR through the MUX into RR
and the square root register SQR. 65 bits (not
including the sign which is known to be positive)
corne out of the MUX. The two high order bits. which
are conceptually to the left of the binary point, go
into the least significant bits of RR. Clear the
rem.aining bits of RR.
The 63 bits which
are conceptually to the right of the binary point go
into SQR. Clear DR to prepare for shifting in the
result bits.

Step 4

Repeat 85 times: Subtract DR plus one from RH. If


t.he difference is non-negative. then shift it left by
two bits and store it in RR. Shift DR left by one bit
and carry in a logic one. 5 If the difference is negative, then shift the old contents of RR left by two
bits and shift DR left by one bit with a carry-in of
zero. In either case, shift SQR left by two bits and
fill in the rightmost two bits of RR with the bits
shifted out of SQR.
Step 5
Move the 65-bit result from SQR to the normalization and rounding logic by clearing RR and adding in
the DALU. The Sticky bit is the 66th bit of the
result.. It is formed by the logical OR of the DALU
carry-out and the bits of RR during the last iteration
of Step 4." The sticky bit is latched at the end of
Step 4 so that information is not lost when RR is
cleared.

References

[1] IEEE Computer Society Microprocessor Standards

Committee Task P754. "A Proposed Standard for


Binary Floating Point Arithmetic, Draft B.O. Computer 14, No.3. March. 1981, pp 52-63.
[2] J. Coenen, "An Implementation Guide to a Proposed
Standard for Floating Point Arithmetic," Computer
13, No.1. January, 1980.
[3] G. Taylor and D. Patterson, "VAX Hardware for the
Proposed IEEE Floating Point Standard," Fifth IEEE
Symposium on Computer Arithmetic, May, 1981.
[4] D. Atkins, "The Theory and Implementation of SRT
Division," Report 230, Dept. of Computer Science,
University of Illinois. Urbana. June, 1967.
[5] D. Atkins. "Higher-Radix Division Using Estimates of
the Divisor and Partial Remainders, II IEEE Trtmsactions on Computers 17, No. 10, October. 1966, pp.
925-934.
[6] D. Atkins. "A Study of Methods for Selection of Quotient Digits during Digital Division." Ph.D. dissertation. Report 397. Dept. of Computer Science, University of Illinois, Urbana. June, 1970.
[7] J. Robertson.' "Methods of Selection of Quotient
Digits during Digital Divtston,' File 663, Dept. of
Computer Science, University of Illinois. Urbana.
June. 1965.
[8] D. Atkins. "Design of the Arithmetic Units of Illiac Ill:
Use of Redundancy and Higher Radix Methods,"
Report 333, Dept. of Computer Science. University
of Illinois, Urbana, May, 1969.
[9] K. Tan, "The Theory and Implementations of HighRadix Division," Fourth IEEE Symposium on Computer Arithmetic. October, 1978 pp. 154-163.
[10] U. Kalaycioglu, "Analysis and Synthesis of Generalized Radix Additive Normalization Division Techniques." Ph.D. dissertation. SEL Report B8 Dept. of
Electrical and Computer Engineering. University of
Michigan. Ann Arbor, May, 1975.
[11] J. Baron, "Implementation Study of Generalized
Radix. .Non-Restoring Division Techniques." SEL
Report 102, Dept. of Electrical and Computer
Engineering. University of Michigan. Ann Arbor. September. 1977.
[12] D. Atkins andU. Kalaycioglu, "Concurrency in Generalized Radix Non-Restoring Division", Twelfth Allerton Conference on Circuit and Switching Theory,
University of Illinois, Urbana. October. 1974. pp,
628-640.
[13] W. Kahan. "Software Square Root for the Proposed
IEEE Floating Point Standard." Computer Science
Division. University of California. Berkeley. August.
1980. submitted to IEEE Transactions on Mathematical Software.
U

Conclusions
Radix four division otters us t.he most cost-effective
improvement (in the same technology) to radix two restoring division. Ra~ix four. uses the same hardware
structure in the partial remainder lo~p except for ~ u1tiplexer co produce a second multiple of the divisor.
Since the ALU delay dominates the IO.op, r~~x four ~as
the same step time as radix: t.wo. Quotient dlglt selection
is the limiting task. so 'We reduce the width of the guess
ALU to eight bits in order to spee~ ~hat path. ~ur
t deatfs benefit a programmable lOgIC implemerrtation
~athe look-up table. Different choices could be better

:n

S The carry-out from DALU is tied directly to DR's left shift input.
" If the last iteration produces 8. one . then .the square root has an
infinite number of nonzero bits and the StICky bit ~houl~ be a one. The
ALU's carry-out is a one in this case. If ~e last It~ratlon produces Ii
zero, then the Sticky bit is a one if the preVlO11B remainder was nonzero.

133

----,.
F=======,-=.":"='='='===-=-= -_
Ii ....

1trilt 15-7 bits of

8111Umated next

remamder (2'8comp)
decimal
-S.5
-:3.0
-2.0
-2.000
-1.8?5
~ 1.750
-1.825

-j.s

-1.0
-tJ.5

0.0

O.l'S
1.0
1.15
2.0
2.5

-5.5
-3.0

-2.5

-2.00
-1.75
-1.5
-1.0
-0.15

0.0
0.0
1.0
1.50
1.75
2.0
2.5
5.0

-4.0
-5.5
-5.0
-2.5
-2.0
-U5

-1.00

-o.rs
-0.5
0.0
0.5
1.0
1.5
2.0

2.G

S.O
5.5
-4.5
-4.0
-S.5
-S.O
-2.50
-2.25
-2.0
'-1.t5

....1.00
-O.'1~

-0.5

Q,D
0.5

l.tl

jU,

2.0
1.5
1!),O

ae
-01.5

-'.0
-::J.tS

-8,0
-tU5
-2.0

bin
1100.1

nOLO

11011
1110.000
1110.001
1110.0lC
1110.011
1110.1
111 1.0
1111.1
0000.0
0000.1
0001.0
0001.1
0010.0
0010.1

1100.1
1101.0
1101.1
1110.00
1110.01
1110.1
1111.0
1111.1
0000.0
0000.1
0001.0
0001.10
0001.11
0010.0
0010.1
0011.0

ftrst. -4bits
uf divisor
(positive)

decimal
LOOO
1.000
1.000
1.000
1.000
1.000

LOCO
1.000
1.000
1.000
1.000
1.000

1,000
1.000
1.000
1.000
1.125

1.12Ci
1.125
1.125
1.125

1.125
1.125
1.125
1.125
1.125
1.1'25
1.125
1.125

1.125
1.12:5
1.1215
1.200

1101.0

1.250
1.250
1.250
1.250
1.250

1110.0
1110.1
1111.00
1111.01

1111.1
0000.0
0000.1
0001.0
0001.1
0010.0
0010.1
0011.0
0011.1
1011.1
1100.0
1100.1
1101.0
1101.10
1101.11

1110.0
1110.1
1111.00
1111.01
111Ll
0000.0
0000.1
0001.0
0001.1
0010.0
COI0.1
DOl 1.0
0011 ..1

1011.1
1100.0
1100.1
1101.0
~101 ..1
1110.0

'-~,----

, _~-

Vl'-l'bhl 3. _.. Quotient SltllecUon Logic:


AlIYmmef,rJ.c B~bu. Nezt Remainder Prediction AI.U

.-

-(i

1100.0
1100.1
1101.1

_...

1.250

1.250
1.250

1.250
1.250
1.250

1.250
1.250
1.250
1.2150
1.250
1.870
1.375
1.975
1.375
1.975
1.375
1.375
1.3715
1.370
1.9"15
l.S1'f5
1.S7C
1.875
l.S?'f5
1.370
t.:I?fS
l.S?'~

1.9715
1.9'15
1.600

1.500
1.1500

1,1500
1.fjOO

1.500

binar
1.000

1.000
1.'00
1.000
1.000
1.000
1.000

1.000
1.000
1.000
LOOO
1.000
1.000

..atio of !'lhUted
full rcrnl~~:I:del' Pi.,..
to full dIVIsor d
minImum

-a.m n

-2.:;.iJ(}7
~'1. ?.~~:2
'1.i1111

-t.sooo

-1.~;;a89

-1.2778
-0.8333
-0.3889
0.0625
0.0000
0.4444

o.eaas

maximum
-3.5000
-8.0000
-2,5000
-2.0000
-1.8750
-1.7500
-1.6250
-1.0000
-1.0000
-0.5000
a.f5625

1.062fi
1.t5825

-,

qif"
ne:at.~u(Jtient

-.Mo~

-~-!

-~~

-1
-1

,-,1

-0
+0

2.2222

3.0625
-S.1111

-2

-1.0500
-l.tH500

-2.6667
-2.2222
-1.7778

-2
-2

1.010
1.010
1.010
1.010
1.010

1.010
1.010
1.010
1.010
1.010
1.010
1.010
1.010
1.010
1.010
1.010
1.010

1.011
1.011

1.011
1.011
1.011
1.011
1.011
1.011

1.Dl1

1.011
1.011
1.011
1.011
1.011
1.011
1.011
1.011
1.011
1.011

1.100
1.100
1.100
l.100
1.100
1.100

-1.31500
-1.1500

-0.7500
-0.3600
0.0556
0.0000
0.4000

o.aooo

1.2000
1.4000
1.6000
2.0000
2.4000

-2.5000

-2.1384
-1.7?'27
-1.4091
-1.040fj
-0.6818
-0.5000
-0.3182
0.0:500

0.0000

o.eess

0.7273
1.0909
1.4545

1.81B2
2.1818
2.54f5~

-2.62f50
-2.2917
-1.9583
-1.62150
-1.4583
-1.2917
-0.9583
._~O_.62CiO

-0.4589
-0.2917
O.045fi

C.DOOO
0.5555
0,6667
1.0000
1.98S3
I.Baa?
e.OOt,-~

2.S~15S

-1.0556
-1.8338
-O.BBfaJ
--0.44440;6000
0.94441.3860

1.0111
1.8383
2.2778

""2
--1
-~
_.-.

+1
+1
+2
..2

+2
+2

'-3.2000
-2.8000
-2.4000
-2.0000
-1.8000
-1.2000
-0.8000
-0.0000
-0.4000
0.4000

-2
-2
-2

0.B:500-

1.2500
1.61500
2.0500
2.4:500
2.6500
3.2500

-5.2727
2.9091
-2.0455
-2.1818

-1.8182

-1.0nP34
-1.45<1-:5
-L090g

-0.72"5
-0.154-55
-D.3636
0.4001
0.')727
1.1984
1.15000

1.8036
2.2279
2.15909
2.915415

-2
-1

..

" ~)

-0
+0
+1
+1

1.500

0011.1

1.1500
1.1500

4.0

0100.0

-15.0
-4.5
-4.0
-fU5
-S.O
-2.5
-2.0
-US
-1.0

1011.0
1011.1
1100.0
1100.1
1101.0
1101.1
;.110.0
H10.1

+2
+2

1.5

0001.1

2.0
2.G
3.0
3.5
4.0
4.5

0010.0
0010.1
0011.0
0011.1
0100.0
0100.1

-IUS
-ltD
-4.15
-4.0

1010.1

-2.7~

-2.15

-2.0
-1.5
-1.0

2.5
3.0

3.15
4.0

-2
".!.

4.f5

._~

-1

-15.3

~-i

-15.0

'M1
-1

-4.15

-4.0
-8.5
-3.0
-2.5
-2.0
-1.3
-1.0
-0.5 '
0.0
0.5
1.0

-0
-0

+0
+1

+1

.J 1
o

+2
+2

+2
+2

i.e

-2.4261
-I. 1US 4-

.,.. - '

2.0
2.5
3.0
3.5
4.0
4.5
5.0

134

1111.1
0000.0

0000.1
0001.0

2.0

-2
-2

n n.o

1.0

0.0
O.CS
1.0
1.15

+2
+2

-S.OOOO
-2
,~.;l
-2.0667
-2.SSSS
-I. eo'!?
-2
-1.0000
-2.0000
2
-1.6667
-1
-1.192B
l
-1
-0.8846
_ _-1.3583
_ _-..i.--,_
_

0011.0

-0.5

+1

1.500

5.0

-9.15
-5.00

-1
-1

decimal
1.1500

2.00
2.25
2.15

C.O
0.5

+0
+1

bmerv
1110.1
1111.0
1111.1
0000.0
OOOtll0

0000.11
0001.0
0001.1
0010.00
0010.01
0010.1

-O.l~

-{)

2,7222
5.1667

tlrst 4 bits
of divisor
(positive)

1.500
1.1500
1.500
1.500
1.1500
1.GOO
1.500
1.300
1.500

~.O

+1

-2.SriOO

~ ~a25

-1.0
-0.5
0.0
0.50
O.7f5
1.0
1.5

--::~

1.000

1.3333
1.777'8

decimal
-I.e;

'A

1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001
1.001

1.000

2.0625

tinlt t5~7 bits of


lftatJmated next
remainder (2'B camp)

uJt.nmag)

+1
+2
;2
+2

1.000

"Ii+;

t~]g1t

1011.0
1011.1
,1100.0

irco.i

1101.00
1101.01
1101.1
1110J;
1110.1
1111.0
1111.1
0000.0
0000.1
0001.0
0001.1
0010.0
0010.1
0011.0
0011.1
0100.0
0100.1
1010.1
1011.0
1011.1
1100.0
1100.1
1101.0

ura.!

1110.0
1110.1

1111.0

1111.1
0000.0
OOl~:).l

0001.0

0001.1
OOUJ.O
0010.1
0011.0
0011.1
0100.0
OlOO.1
OIOlU

1.62::;
i.625
1.625
1.625

1.625
1.625
1.625
1.625
1.625
1.6215
1.825
1.625

1.625
1.625
1.625
1.625
1.625
1.625
1.625
1.625

1.750
1.7r50

1.'150
l.7tSO
1.7~O

1.7150
1."150
1.750
1.750

1.7150
1.750
1.750
1.7150
1.700
1.'150
1.7150
1.7150
1.750
1."150
1.750
1.7GO
1.7150
1.B715
1.BnS
1,875
1.8715
1.87f5
1.875

1.870
1.875
1.875
1.875

1.875
1.8715
1.8'15
1J..87fi
1.8715
1.875
1.875
1,875
1.875

LanS

l.e't~"'i
~

,e?'ti

i~:L

next quotient

ratio of shifted
full remainder Pif.,
to full dlvieor d

digit
(sign mnr.)

binary
R.lOO
1.100
:Ll00
1.100
LIOO
LIOO
L100
1.100
1.100
1.100
1.100
1.100
1.100
1.100

m1n:imum

maximum

-0.0769

-1.0000
-0.6667
-0.3335
0.57150
O.C5417
0.7083
1.0417
1.5750
1.3417
1.7063
2.0417
2.5750
2.7063
3.0417

-1

1.101

-2.5557
-2.2500
-1.Q645
-1.6766

-,5.0769
-2.7692

-s

1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.101
1.110

1.110
1.110
1.110
1.110
1.110

LllO

1.110
1.110
1.110
1.UD
1.110
1.110
1.110
1.110
1.110
1.110
1.110
1.110

1.110
1.110
1.110
1.111
1.111
1.111

1.111
1.111
1.111
1.111
1.111
1.U1
1.111
1.111
1.111

t.n i

I.Ul
1.111
1.111
1~1l1.

1.111
1,111

1.111
1.111
1.111

-0.2692
0.0417
0.0000
0.3077

0.4615
0.6154-

0.9231
1.2308
1.3846
1.t5SB5
1.8462
2.1fi38

2.4615

-i.ssas

-0

...,0

+0
+0
.;:1

+1
+1
.ff"Z

+1?+~~
'H~

0s-.;?

-,2

-2

-2,4615
-2.1538
-1.8462
-1.53B5

-2
-2
--I
-1

-'0.Q231
-0.6154

-0

0.2857

0.3462
0.6538

+0
+0

0.5714-

C.G6leS

-O.2150n
0.0385

c.oooo
0.8571
1.1429
1.4286
1.7143
2.0000
2.2857
2.5714

-2.8533
-2.3667
-2.1000
-1.8333
-1.5667
-1.4395
-1.5000
-1.038S
-0.7067
-0.5000
-0.2555
0.05t57
' 0.0000
0.2667
0.5359
0.8000
1.0667
1.3555
1.6000
I,fJ667
2.1333
2.4000

-2.4688
-2.21Sa
-1.9668
-1.7168
-1.4688
-1.218a
-0.98aa

-0.7188
-0.4S8a
-0.21Ba
0.0533
0.0000
0.2500
O.CSCOO
0.7500
1.0000

1.2500
1.C5000
1.71500
2.0000
2.21500
2.5000

-1

-o.sorr
1.2692
1.57'en
l.SS4U

2. 192t1
2.5000
2.80'1'7
3.1164

-9.1429
-2.8f571
-IUS? 14

-2.28157
-2.0000
-1.7143
-I.e?l.
-1.4286
-1.1429
-0.8571
-0.0714
-0.28157
0.8214a.eO?1
0.8929
1.1786
1.4849
1.71500
2.0907
2.3214
2.6071
2.8929

-2.9333
-2.6667
-2.4000

-2.isBS

-1.8667
-1.6000
-1.3939
-1.0667
-O.BOOO
-0.5353
-0.2887
'0.5000
0.5887
0.8335
1.1000
1.5887
1.8339
1.'9000
2.1667
2.4993
2.'1000
2.9667

+1

-~1.2306

-1.1071
-0.8214
-0.5557

---1

I
I

I
1

II

,-,0

I
I

+1

+1
+1

+2
+2
+2
+2
+2

-2
c,..2
-2
-2

-2
-2
-'rl
-1
-1
-1
-0

-0

+0
+0

+1

+1
+1
' +2
+2

+2

+2
+2

-2
-2
-2
-2

-2
-1

--1
-1
.... 1
-0
-0

+0
+0
+1
+1
+1
+1
+2

+2
+2
+2

+2

You might also like