Divider Implementation - 2
Divider Implementation - 2
Divider Implementation
ALGORITHM
The division of two unsigned integer numbers 𝐴⁄𝐵 (where 𝐴 is the dividend and 𝐵 the divisor), results in a quotient 𝑄 and
a remainder 𝑅. These quantities are related by 𝐴 = 𝐵 × 𝑄 + 𝑅.
For the implementation, we follow the hand-division method. We grab bits of A one by one and comparing it with the divisor.
If the result is greater or equal than B, then we subtract B from it. On each iteration, we get one bit of Q. Fig. 1 shows the
algorithm as well as an example: A = 10001100; B = 1001
00001111 Q
ALGORITHM
B 1001 10001100 A
1001 R = 0
for i = n-1 downto 0
10001 left shift R (input = ai)
1001 if R B
10000 qi = 1, R R-B
1001 else
qi = 0
1110 end
1001 end
101 R
Figure 1. Division Algorithm
For hardware implementation, we consider restoring dividers (i.e., those that keep the actual residue value at every step).
value of 𝐴.
The 2’s complement of 𝐴 is given by: 𝑃 = 𝑛𝑜𝑡(𝐴) + 1. 𝑃 = 𝑝𝑛−1 𝑝𝑛−2 … 𝑝0
If 𝑃 and 𝐴 are thought as 𝑛-bit unsigned numbers, i.e.: 𝐴 = ∑𝑛−1
𝑖=0 𝑎𝑖 2 , 𝑃 = ∑𝑖=0 𝑝𝑖 2 then: 𝑃 = 2 − 𝐴.
𝑖 𝑛−1 𝑖 𝑛
Fig. 3 shows the operation 𝑅 − 𝐵 by using: 𝑅 + 𝐾, where 𝐾 = 𝑛𝑜𝑡(𝐵) + 1. Recall that we let 1 be held by 𝑐𝑖𝑛. Note that if 𝐵 =
0 → 𝐾 = 2𝑛+1 (here 𝐾 is represented by the second operator as well as 𝑐𝑖𝑛 = 1)
1 Daniel Llamocca
ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, OAKLAND UNIVERSITY
Digital Library: Arithmetic Cores RECRLAB@OU
1 cin
R: 0rn-1rn-2...r0 - R: 0rn-1rn-2...r0 +
B: 0bn-1bn-2...b0 K: 1kn-1kn-2...k0
Figure 3. Operation 𝑅 − 𝐵 ≡ 𝑅 + 𝐾 = 𝑅 + 𝑛𝑜𝑡(𝐵) + !
𝐾 𝑘𝑛 𝑘𝑛−1 𝑘𝑛−2 … 𝑘0 𝑘𝑛
2𝑛 100…0
𝐵≠0 2𝑛 + 1 100…1
𝑘𝑛 = 1
(or 𝐵 > 0) … …
2𝑛+1 − 1 111…1
𝐵=0 2𝑛+1 1000…0 𝑘𝑛 = 0
𝑅 − 𝐵 ≡ 𝑅 + 𝐾 = ∑ 𝑟𝑖 2𝑖 + ∑ 𝑘𝑖 2𝑖 = ∑ 𝑟𝑖 2𝑖 + 𝑘𝑛 2𝑛 + ∑ 𝑘𝑖 2𝑖
𝑖=0 𝑖=0 𝑖=0 𝑖=0
𝑛−1 𝑛−1
𝑅 + 𝐾 = 𝑅 + 2𝑛+1 − 𝐵 = ∑ 𝑟𝑖 2𝑖 + 2𝑛+1 − ∑ 𝑏𝑖 2𝑖
𝑖=0 𝑖=0
𝑅 − 𝐵 < 0:
Since 𝑅 ≥ 0 → 𝐵 > 0 → 𝑘𝑛 = 1
𝑅 + 2𝑛+1 − 𝐵 = ∑𝑛−1 𝑖
𝑖=0 𝑟𝑖 2 + 2
𝑛+1 − ∑𝑛−1 𝑏 2𝑖 < 2𝑛+1
𝑖=0 𝑖
𝑅 + 𝐾 = ∑𝑖=0 𝑟𝑖 2 + 𝑘𝑛 2 + ∑𝑛−1
𝑛−1 𝑖 𝑛 𝑖
𝑖=0 𝑘𝑖 2 < 2
𝑛+1 → ∑𝑛−1 𝑟 2𝑖 + ∑𝑛−1 𝑘 2𝑖 < 2𝑛
𝑖=0 𝑖 𝑖=0 𝑖
o The (𝑛 + 1)-bit sum (considering the operation as unsigned) of R and K is lower than 2𝑛+1 . Then, there is no overflow
in the (𝑛 + 1)- bit unsigned sum. Thus 𝑐𝑛+1 = 0.
o The 𝑛-bit sum (considering the operations as unsigned) of 𝑅 and 𝑘𝑛−1 𝑘𝑛−2 … 𝑘0 is lower than 2𝑛 . Thus, there is no
overflow of the 𝑛-bit unsigned sum. Thus 𝑐𝑛 = 0.
𝑅 − 𝐵 ≥ 0:
𝑅 + 2𝑛+1 − 𝐵 = ∑𝑛−1 𝑖
𝑖=0 𝑟𝑖 2 + 2
𝑛+1 − ∑𝑛−1 𝑏 2𝑖 ≥ 2𝑛+1
𝑖=0 𝑖
𝑅 + 𝐾 = ∑𝑖=0 𝑟𝑖 2 + 𝑘𝑛 2 + ∑𝑛−1
𝑛−1 𝑖 𝑛 𝑖
𝑖=0 𝑘𝑖 2 ≥ 2
𝑛+1 → ∑𝑛−1 𝑟 2𝑖 + ∑𝑛−1 𝑘 2𝑖 ≥ 2𝑛+1 − 𝑘 2𝑛
𝑖=0 𝑖 𝑖=0 𝑖 𝑛
o The (𝑛 + 1)-bit sum (considering the operation as unsigned) of R and K is greater or equal than 2𝑛+1 . Then, there is
overflow of the (𝑛 + 1)-bit unsigned sum. Thus 𝑐𝑛+1 = 1.
o For the n-bit sum of R and 𝑘𝑛−1 𝑘𝑛−2 … 𝑘0 , we have two cases:
𝐵 > 0 → 𝑘𝑛 = 1. Then ∑𝑛−1 𝑖 𝑛−1 𝑖
𝑖=0 𝑟𝑖 2 + ∑𝑖=0 𝑘𝑖 2 ≥ 2
𝑛+1 − 2𝑛 → ∑𝑛−1 𝑟 2𝑖 + ∑𝑛−1 𝑘 2𝑖 ≥ 2𝑛
𝑖=0 𝑖 𝑖=0 𝑖
𝐵 = 0 → 𝑘𝑛 = 0. Then ∑𝑖=0 𝑟𝑖 2 + ∑𝑖=0 𝑘𝑖 2 ≥ 2𝑛+1
𝑛−1 𝑖 𝑛−1 𝑖
In both cases, the n-bit sum (considering the operands as unsigned) of 𝑅 and 𝑘𝑛−1 𝑘𝑛−2 … 𝑘0 is greater of equal than 2𝑛 .
So, there is overflow of the 𝑛-bit unsigned sum. Thus 𝑐𝑛 = 1 when 𝑅 ≥ 𝐵.
2’s complement operation 𝑅 − 𝐵 with 𝑛 + 1 bits: There is no overflow of the subtraction as 𝑐𝑛 = 𝑐𝑛−1 .
For 𝑅 − 𝐵 ≥ 0: The result 𝑇 = 𝑅 − 𝐵 is a positive number, thus 𝑇𝑛 = 0. Therefore 𝑡𝑛−1 𝑡𝑛−2 … 𝑡0 contains 𝑅 − 𝐵 in unsigned
representation.
In conclusion:
2 Daniel Llamocca
ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, OAKLAND UNIVERSITY
Digital Library: Arithmetic Cores RECRLAB@OU
𝐴, 𝐵: positive integers in unsigned representation. 𝐴 = 𝑎𝑁−1 𝑎𝑁−2 … 𝑎0 with 𝑁 bits, and 𝐵 = 𝑏𝑀−1 𝑏𝑀−2 … 𝑏0 with 𝑀 bits, with
the condition that 𝑁 ≥ 𝑀. 𝑄 = 𝑞𝑢𝑜𝑡𝑖𝑒𝑛𝑡, 𝑅 = 𝑟𝑒𝑠𝑖𝑑𝑢𝑒. 𝐴 = 𝐵 × 𝑄 + 𝑅.
M bits
In this parallel implementation, the result of every stage is called
the remainder 𝑅𝑖 . Y0
Fig. 4 depicts the parallel algorithm with 𝑁 stages. For each stage Stage 0 ...
𝑖, 𝑖 = 0, … , 𝑁 − 1, we have: R0
𝑅𝑖 : output of stage 𝑖. Remainder after every stage. Y1
𝑌𝑖 : input of stage 𝑖. It holds the minuend.
Stage 1 ...
For the next stage, we append the next bit of 𝐴 to 𝑅𝑖 . This becomes R1
𝑌𝑖+1 (the minuend): Y2
𝑌𝑖+1 = 𝑅𝑖 &𝑎𝑁−1−𝑖 , 𝑖 = 0, … , 𝑁 − 1
Stage 2 ...
At each stage 𝑖, the subtraction 𝑌𝑖 − 𝐵 is performed. If 𝑌𝑖 ≥ 𝐵 then
R2
𝑅𝑖 = 𝑌𝑖 − 𝐵. If 𝑌𝑖 < 𝐵, then 𝑅𝑖 = 𝑌𝑖 .
Y3
# of
Stage 𝑌𝑖 Computation of 𝑅𝑖
𝑅𝑖 bits
Stage 3 ...
𝑅0 = 𝑌0 − 𝐵, 𝑖𝑓 𝑌0 ≥ 𝐵
...
...
0 𝑌0 = 𝑎𝑁−1 1
𝑅0 = 𝑌0 , 𝑖𝑓 𝑌0 < 𝐵
𝑅1 = 𝑌1 − 𝐵, 𝑖𝑓 𝑌1 ≥ 𝐵
RM-2
1 𝑌1 = 𝑅0 &𝑎𝑁−2 2 YM-1
𝑅1 = 𝑌1 , 𝑖𝑓 𝑌1 < 𝐵
𝑅2 = 𝑌2 − 𝐵, 𝑖𝑓 𝑌2 ≥ 𝐵
2 𝑌2 = 𝑅1 &𝑎𝑁−3
𝑅2 = 𝑌2 , 𝑖𝑓 𝑌2 < 𝐵
3 Stage M-1 ...
RM-1
… … … …
YM
𝑅𝑀−1 = 𝑌𝑀−1 − 𝐵, 𝑖𝑓 𝑌𝑀−1 ≥ 𝐵
M-1 𝑌𝑀−1 = 𝑅𝑀−2 &𝑎𝑀−𝑁 M
𝑅𝑀−1 = 𝑌𝑀−1 , 𝑖𝑓 𝑌𝑀−1 < 𝐵 Stage M ...
RM
Since 𝐵 has 𝑀 bits, the operation 𝑌𝑖 − 𝐵 requires 𝑀 bits for both
YM+1
operands. To maintain consistency, we let 𝑌𝑖 be represented with
𝑀 bits. Stage M+1 ...
RM+1
𝑅𝑖 : output of each stage. For the first 𝑀 stages, 𝑅𝑖 requires 𝑖 + 1
bits. However, for consistency and clarity’s sake, since 𝑅𝑖 might be YM+2
the result of a subtraction, we let 𝑅𝑖 use M bits.
Stage M+2 ...
For stages 0 𝑡𝑜 𝑀 − 2:
...
𝑅𝑖 is always transferred onto the next stage. Note that we transfer ...
RN-2
𝑅𝑖 with 𝑀 − 1 least significant bits. There is no loss of accuracy YN-1
here since 𝑅𝑖 at most requires M-1 bits for stage M-2. We need 𝑅𝑖
with M-1 bits since 𝑌𝑖+1 uses 𝑀 bits. Stage N-1 ...
RN-1
Stages 𝑀 − 1 𝑡𝑜 𝑁 − 1:
Starting from stage 𝑀 − 1, 𝑅𝑖 requires 𝑀 bits. We also know that M+1 bits
the remainder requires at most 𝑀 bits (maximum value is 2𝑀 − 2). Figure 4. Parallel implementation algorithm
So, starting from stage M-1 we need to transfer 𝑀 bits.
As 𝑌𝑖+1 now requires 𝑀 + 1 bits, we need 𝑀 + 1 units starting from stage 𝑀.
To implement the operation 𝑌𝑖 − 𝐵 we use a subtractor. When 𝑌𝑖 ≥ 𝐵 → 𝑐𝑜𝑢𝑡𝑖 = 1, and when 𝑌𝑖 < 𝐵 → 𝑐𝑜𝑢𝑡𝑖 = 0. This 𝑐𝑜𝑢𝑡𝑖
becomes a bit of the quotient: 𝑄𝑖 = 𝑐𝑜𝑢𝑡𝑁−1−𝑖 . This quotient Q requires N bits at most.
Also, the final remainder is the result of the last stage. The maximum theoretical value of the remainder is 2𝑀 − 2, thus the
remainder 𝑅 requires 𝑀 bits. 𝑅 = 𝑅𝑁−1 .
Also, note that we should avoid a division by 0. If B=0, then, in our circuit: 𝑄 = 2𝑁 − 1 and R = 𝑎𝑀−1 𝑎𝑀−2 … 𝑎0 .
3 Daniel Llamocca
ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, OAKLAND UNIVERSITY
Digital Library: Arithmetic Cores RECRLAB@OU
Fig. 5 shows the hardware of this array divider for N=8, M=4. Note that the first M=4 stages only require 4 units, while the next
stages require 5 units. This is fully combinatorial implementation.
Each level computes 𝑅𝑖 . It first computes 𝑌𝑖 − 𝐵. When 𝑌𝑖 ≥ 𝐵 → 𝑐𝑜𝑢𝑡𝑖 = 1, and when 𝑌𝑖 < 𝐵 → 𝑐𝑜𝑢𝑡𝑖 = 0. This 𝑐𝑜𝑢𝑡𝑖 is used
to determine whether the next 𝑅𝑖 is 𝑌𝑖 − 𝐵 or 𝑌𝑖 .
Each Processing Unit (PU) is used to process 𝑌𝑖 − 𝐵 one bit at a time, and to let a particular bit of either 𝑌𝑖 − 𝐵 or 𝑌𝑖 be
transferred on to the next stage.
b3 0 b2 0 b1 0 b0 a7
x03 x02 x01 x00 b a
c04 c03 c02 c01 c00 PU
q7 PU PU PU PU 1
y03 y02 y01 y00 a6
x13 x12 x11 x10
c14 c13 c12 c11 c10
q6 PU PU PU PU 1
y13 y12 y11 y10 a5
x23 x22 x21 x20 cout FA cin
c24 c23 c22 c21 c20
q5 PU PU PU PU 1
y23 y22 y21 y20 a4 1 0
s
x33 x32 x31 x30
c34 c33 c32 c31 c30
q4 PU PU PU PU 1
r
y33 y32 y31 y30 a3
0
x44 x43 x42 x41 x40
c45 c44 c43 c42 c41 c40
q3 PU PU PU PU PU 1
y44 y43 y42 y41 y40 a2
x54 x53 x52 x51 x50
c55 c54 c53 c52 c51 c50
q2 PU PU PU PU PU 1
y54 y53 y52 y51 y50 a1
x64 x63 x62 x61 x60
c65 c64 c63 c62 c61 c60
M N
q1 PU PU PU PU PU 1
y64 y63 y62 y61 y60 a0
A N N Q x74 x73 x72 x71 x70
ARRAY
c75 c74 c73 c72 c71 c70
B M DIVIDER M q0 PU PU PU PU PU 1
R
y74 y73 y72 y71 y70
r3 r2 r1 r0
Figure 5. Fully Combinatorial Array Divider architecture for N=8, M=4
Fig. 6 shows the hardware core of the fully pipelined array divider with its inputs, outputs, and parameters.
M N
A N N Q
M M
B R
ARRAY
E DIVIDER v
resetn
clock
Figure 6. Fully pipelined IP core for the array divider
4 Daniel Llamocca
ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, OAKLAND UNIVERSITY
Digital Library: Arithmetic Cores RECRLAB@OU
Fig. 7 shows the internal architecture of this pipelined array divider for N=8, M=4. Note that the first M=4 stages only require
4 units, while the next stages require 5 units. Note that the enable input ‘E’ is only an input to the shift register on the left,
which is used to generate the valid output 𝑣. This way, valid outputs are readily signaled. If E=’1’, the output result is computed
in N cycles (and v=’1’ after N cycles).
E b3 0 b2 0 b1 0 b0 a7 a6 a5 a4 a3 a2 a1 a0
v q7 q6 q5 q4 q3 q2 q1 q0 r3 r2 r1 r0
Figure 7. Fully Pipelined Array Divider architecture for N=8, M=4
5 Daniel Llamocca
ELECTRICAL AND COMPUTER ENGINEERING DEPARTMENT, OAKLAND UNIVERSITY
Digital Library: Arithmetic Cores RECRLAB@OU
Fig. 8 shows the iterative hardware architecture as well as the state machine. Here, 𝑅𝑖 is always held at register R. The subtractor
computes 𝑌𝑖 − 𝐵. This requires 𝑀 + 1 bits in the worst case.
If 𝑌𝑖 ≥ 𝐵 then 𝑅𝑖 = 𝑌𝑖 − 𝐵. Yi here is the minuend. 𝑌𝑖 − 𝐵 is loaded onto register R. Note that only M bits are needed.
If 𝑌𝑖 < 𝐵, then 𝑅𝑖 = 𝑌𝑖 . Here only 𝑌𝑖 is loaded onto register R. This is done by just shifting 𝑎𝑁−1 into register R
Note that R requires M bits since it holds the remainder at every stage. Also, since we always shift 𝑐𝑜𝑢𝑡𝑖 onto register A, the
quotient Q is held at A in the last iteration.
E DA DB resetn=0
S1
N M
sclrR 1, ER1
C 0
L E
LEFT SHIFT w
E REGISTER
REGISTER 0
E
LAB
EA
A B 1
M
Y 0 LAB, EA 1
RM-1RM-2...R0aN-1
M+1
1
cout LR 1
M
0
sclrR sclr
FSM LR L LEFT SHIFT aN-1
ER w
E REGISTER no
C=N-1 C C+1
M+1 M yes
M S3
aN-1 RM-1RM-2...R0 done 1
N M
0 1
E
done Q R
Figure 8. Iterative Divider
6 Daniel Llamocca