Arithmetic Unit 5
Arithmetic Unit 5
Arithmetic
1
Text Books:
Carl Hamacher, Zvonko Vranesic, Safwat Zaky:
Computer Organization, 5th Edition, Tata
McGraw Hill, 2002.
Reference Books:
Carl Hamacher, Zvonko Vranesic, Safwat Zaky,
Naraig Manjikian : Computer Organization and
Embedded Systems, 6th Edition, Tata McGraw
Hill, 2012.
William Stallings: Computer Organization &
Architecture, 9th Edition, Pearson, 2015.
2
Objective
Number and character representations
Addition and subtraction of binary numbers
Adder and subtractor circuits
High-speed adders based on carry-lookahead logic
circuits
The Booth algorithm for multiplication of signed
numbers
High-speed multipliers based on carry-save addition
Logic circuits for division
Arithmetic operations on floating-point numbers
conforming to the IEEE standard
3
Arithmetic
Operations, and
Characters
4
Signed Integer
3 major representations:
Sign and magnitude
One’s complement
Two’s complement
Assumptions:
4-bit machine word
16 different values can be represented
Roughly half are positive, half are
negative
5
Sign and Magnitude Representation
6
One’s Complement Representation
-0 +0
-1 1111 0000 +1
1110 0001
-2 +2 +
1101 0010
-3 1100 0011 +3 0 100 = + 4
Sign and
b3 b2b1b0 magnitude 1' s complement 2' s complement
0 1 1 1 +7 +7 + 7
0 1 1 0 +6 +6 + 6
0 1 0 1 +5 +5 + 5
0 1 0 0 +4 +4 + 4
0 0 1 1 +3 +3 + 3
0 0 1 0 +2 +2 + 2
0 0 0 1 +1 +1 + 1
0 0 0 0 +0 +0 + 0
1 0 0 0 - 0 -7 - 8
1 0 0 1 - 1 -6 - 7
1 0 1 0 - 2 -5 - 6
1 0 1 1 - 3 -4 - 5
1 1 0 0 - 4 -3 - 4
1 1 0 1 - 5 -2 - 3
1 1 1 0 - 6 - 1 - 2
1 1 1 1 - 7 -0 - 1
10
2’s-Complement Add and
Subtract Operations ( + 4)
(a) 0010 ( + 2) (b) 0100
+ 0011 ( + 3) + 1010 (- 6)
0101 ( + 5) 1110 (- 2)
(c) 1011 (- 5) (d) 0111 ( + 7)
+ 1110 (- 2) + 1101 ( - 3)
1001 (- 7) 0100 ( + 4)
(e) 1101 (- 3) 1101
- 1001 (- 7) + 0111
0100 ( + 4)
(f) 0010 ( + 2) 0010
- 0100 ( + 4) + 1100
1110 ( - 2)
(g) 0110 ( + 6) 0110
- 0011 ( + 3) + 1101
0011 ( + 3)
(h) 1001 ( - 7) 1001
- 1011 (- 5) + 0101
1110 ( - 2)
(i) 1001 (- 7) 1001
- 0001 ( + 1) + 1111
1000 ( - 8)
(j) 0010 ( + 2) 0010
- 1101 ( - 3) + 0011
0101 ( + 5)
5 + 3 = -8 -7 - 2 = +7
12
Overflow Conditions
0111 1000
5 0101 -7 1001
3 0011 -2 1100
-8 1000 7 10111
Overflow Overflow
0000 1111
5 0101 -3 1101
2 0010 -5 1011
7 0111 -8 11000
No overflow No overflow
Overflows can occur when the sign of the two operands is the same.
Overflow occurs if the sign of the result is different from the sign of the
operands.
Overflow when carry-in to the high-order bit does not equal carry out 13
Sign Extension
Task:
Given w-bit signed integer x
Convert it to w+k-bit integer with same value
Rule:
Make k copies of sign bit:
X = xw–1 ,…, xw–1 , xw–1 , xw–2 ,…, x0
w
k copies of MSB X • • •
• • •
X • • • • • •
k w
14
Sign Extension Example
15
Addition and Subtraction
of
Signed Numbers
16
Addition/subtraction of signed Numbers
xi yi Carry-in ci Sum s i Carry-out c i+1
At the ith stage:
0 0 0 0 0 Input:
0 0 1 1 0 ci is the carry-in
0 1 0 1 0
0 1 1 0 1 Output:
1 0 0 1 0 si is the sum
1 0 1 0 1
1 1 0 0 1 ci+1 carry-out to (i+1)st
1 1 1 1 1 state
si = xi yi ci + xi yi ci + xi yi ci + xi yi ci = x i Å yi Å ci
ci +1 = yi c i + x i ci + x i y i
Example:
X 7 0 1 1 1 Carry-out xi Carry-in
+ Y = +6 = +00 1 1 1 1 0 0 0 yi
ci+1 ci
Z 13 1 1 0 1 si
17
Addition logic for a single stage
Sum Carry
yi
c
i
xi
xi
yi si c
c i +1
i
ci
x
xi yi i
yi
ci + 1 Full adder ci
(FA)
s
i
18
n-bit adder
Cascade n full adder (FA) blocks to form a n-bit adder.
Carries propagate or ripple through this cascade, n-bit ripple carry adder.
xn - 1
yn - 1 x1 y1 x0 y0
cn - 1
c1
cn FA FA FA c0
sn - 1
s1 s0
Most significant bit Least significant bit
(MSB) position (LSB) position
19
K n-bit adder
K n-bit numbers can be added by cascading k n-bit adders.
xk n - 1 yk n - 1 x2n - 1 y2n - 1
xn y n xn - y
1 n- 1
x0 y0
cn
n-bit n-bit n-bit c
c kn 0
adder adder adder
s s( s s s s
kn - 1 k - 1) n 2n - 1 n n- 1 0
20
n-bit subtractor
• Recall X – Y is equivalent to adding 2’s complement of Y to X.
• 2’s complement is equivalent to 1’s complement + 1.
xn - 1
yn - 1 x1 y1 x0 y0
cn - 1
c1
cn FA FA FA 1
sn - 1
s1 s0
Most significant bit Least significant bit
(MSB) position (LSB) position
21
n-bit adder/subtractor (contd..)
y y y
n- 1 1 0
Add/Sub
control
x x x
n- 1 1 0
c n-bit adder
n c
0
s s s
n- 1 1 0
22
Detecting overflows
Overflows can only occur when the sign of the
two operands is the same.
Overflow occurs if the sign of the result is
different from the sign of the operands.
Recall that the MSB represents the sign.
xn-1, yn-1, sn-1 represent the sign of operand x, operand y
and result sum s respectively.
Circuit to detect overflow can be implemented by
the following logic expressions:
Overflow cn cn 1
23
Computing the add time
x0 y0
Consider 0th stage:
• c1 is available after 2 gate delays.
• s1 is available after 1 gate delay.
c1 FA c0
s0
Sum Carry
yi
c
i
xi
xi
yi si c
c i +1
i
ci
x
i
yi
24
Computing the add time (contd..)
Cascade of 4 Full Adders, or a 4-bit adder
x0 y0 x0 y0 x0 y0 x0 y0
FA FA FA FA c0
c4 c3 c2 c1
s3 s2 s1 s0
25
Fast
addition
Recall the equations:
xi yi
si xi yi ci
ci 1 xi yi xi ci yi ci
Second equation can be written as:
ci
ci 1 xi yi ( xi yi )ci
We can write: B cell
ci 1 Gi Pi ci
where Gi xi yi and Pi xi yi Gi Pi si
Gi is called generate function and Pi is called propagate function
Gi and Pi are computed only from xi and yi and not ci, thus they can be computed
in one gate delay after X and Y are applied to the inputs of an n-bit adder.
Ci+1 = 1 only when Generate= 1 and Propagate = 0 or 1
So we can modify the Pi
ci 1 Gi Pi ci
where Gi xi yi and Pi xi yi 26
Carry lookahead
ci 1 Gi Pi ci
ci Gi 1 Pi 1ci 1
ci1 Gi Pi (Gi 1 Pi 1ci 1 )
continuing
ci1 Gi Pi (Gi 1 Pi 1 (Gi 2 Pi 2 ci 2 ))
until
ci1 Gi PiGi 1 Pi Pi 1 Gi 2 .. Pi Pi 1 ..P1G0 Pi Pi 1 ...P0 c 0
• All carries can be obtained 3 gate delays after X, Y and c0 are
applied.
-One gate delay for Pi and Gi
-Two gate delays in the AND-OR circuit for ci+1
• All sums can be obtained 1 gate delay after the carries are
computed.
• Independent of n, n-bit addition requires only 4 gate delays.
• This is called Carry Lookahead adder. 27
Carry-lookahead adder
x y x y x y x y
3 3 2 2 1 1 0 0
c4
c
3
c
2
c
1
. c
4-bit
carry-lookahead
B cell B cell B cell B cell 0
adder
s s s s
3 2 1 0
G3 P3 G2 P2 G P G P
1 1 0 0
Carry-lookahead logic
xi yi
. .
. c
i
Gi P i
si
28
Carry lookahead adder
(contd..)
Performing n-bit addition in 4 gate delays independent of n
is good only theoretically because of fan-in constraints.
29
4-bit carry-lookahead
Adder
30
Blocked Carry-Lookahead adder
Carry-out from a 4-bit block can be given as:
c4 G3 P3 G2 P3 P2 G1 P3 P2 P1 G0 P3 P2 P1P0 c0
Rewrite this as:
I
P0 P3 P2 P1 P0
G0I G3 P3 G2 P3 P2 G1 P3 P2 P1G0
Subscript I denotes the blocked carry lookahead and identifies the block.
c16 G3I P3I G2I P3I P2I G1I P3I P2I P10 G0I P3I P2I P10 P00 c0
31
Blocked Carry-Lookahead adder
x15-12 y15-12 x11-8 y11-8 x7-4 y7-4 x3-0 y3-0
G3 I P3 I G2 I P 2I G 1I P 1I G 0I P0 I
Carry-lookahead logic
32
Multiplication
33
Multiplication of unsigned
numbers
34
Multiplication of unsigned
numbers (contd..)
We added the partial products at end.
Alternative would be to add the partial products at each
stage.
Rules to implement multiplication are:
If the ith bit of the multiplier is 1, shift the multiplicand
and add the shifted multiplicand to the current value of
the partial product.
Hand over the partial product to the next stage
Value of the partial product at the start stage is 0.
35
Multiplication of unsigned
numbers
Typical multiplication cell
36
Typical multiplication
cell
The main component in each cell is a full adder, FA.
The AND gate in each cell determines whether a
multiplicand bit, mj , is added to the incoming partial-
product bit, based on the value of the multiplier bit, qi.
Each row i, where 0 ≤ i ≤ 3, adds the multiplicand
(appropriately shifted) to the incoming partial product,
PPi, to generate the outgoing partial product, PP(i + 1),
if qi = 1.
If qi = 0, PPi is passed vertically downward unchanged.
PP0 is all 0s, and PP4 is the desired product.
The multiplicand is shifted left one position per row by
the diagonal signal path.
37
Combinatorial array multiplier
Combinatorial array multiplier
Multiplicand
0 m3 0 m2 0 m1 0 m0
(PP0)
q0
0
PP1 p0
q1
0
r
ie
PP2
pl
p1
ti
ul
q2
M
0
PP3 p2
q3
0
,
p7 p6 p5 p4 p3
38
Combinatorial array multiplier
(contd..)
Combinatorial array multipliers are:
Extremely inefficient.
Have a high gate count for multiplying numbers of practical
size such as 32-bit or 64-bit numbers.
Perform only one function, namely, unsigned integer product.
39
Sequential multiplication
Recall the rule for generating partial
products:
If the ith bit of the multiplier is 1, add the appropriately shifted
multiplicand to the current partial product.
Multiplicand has been shifted left when added to the partial
product.
However, adding a left-shifted multiplicand to
an unshifted partial product is equivalent to
adding an unshifted multiplicand to a right-
shifted partial product.
40
Sequential Circuit
Multiplier
Register A (initially 0)
Shift right
C a a q q
n - 1 0 n - 1 0
Multiplier Q
Add/Noadd
control
n-bit
Adder
MUX Control
sequencer
0 0
m m
n - 1 0
Multiplicand M
41
Sequential Circuit Multiplier
Registers A and Q are shift registers, concatenated as shown. Together, they hold
partial product PPi while multiplier bit qi generates the signal Add/No add.
This signal causes the multiplexer MUX to select 0 when qi = 0, or to select the
multiplicand M when qi = 1, to be added to PPi to generate PP(i + 1).
The product is computed in n cycles. The partial product grows in length by one bit
per cycle from the initial vector, PP0, of n 0s in register A.
The carry-out from the adder is stored in flip-flop C, shown at the left end of register
A.
At the start, the multiplier is loaded into register Q, the multiplicand into register M,
and C and A are cleared to 0. At the end of each cycle, C, A, and Q are shifted right
one bit position to allow for growth of the partial product as the multiplier is shifted
out of register Q.
Because of this shifting, multiplier bit qi appears at the LSB position of Q to generate
the Add/No add signal at the correct time, starting with q0 during the first cycle, q1
during the second cycle, and so on.
After they are used, the multiplier bits are discarded by the right-shift operation. Note
that the carry-out from the adder is the leftmost bit of PP(i + 1), and it must be held in
the C flip-flop to be shifted right with the contents of A and Q.
After n cycles, the high-order half of the product is held in register A and the low-
order half is in register Q.
42
Sequential multiplication (contd..)
M
1 1 0 1
Initial configuration
0 0 0 0 0 1 0 1 1
C A Q
0 1 1 0 1 1 0 1 1 Add
Shift First cycle
0 0 1 1 0 1 1 0 1
1 0 0 1 1 1 1 0 1 Add
Shift Second cycle
0 1 0 0 1 1 1 1 0
0 1 0 0 1 1 1 1 0 No add
Shift Third cycle
0 0 1 0 0 1 1 1 1
1 0 0 0 1 1 1 1 1 Add
Shift Fourth cycle
0 1 0 0 0 1 1 1 1
Product
43
Signed
Multiplication
44
Signed Multiplication
Considering 2’s-complement signed operands, what will
happen to (-13)(+11) if following the same method of
unsigned multiplication?
1 0 0 1 1 ( - 13)
0 1 0 1 1 ( + 11)
1 1 1 1 1 1 0 0 1 1
1 1 1 1 1 0 0 1 1
Sign extension is
shown in blue 0 0 0 0 0 0 0 0
1 1 1 0 0 1 1
0 0 0 0 0 0
1 1 0 1 1 1 0 0 0 1 ( - 143)
46
Booth Algorithm
In general, in the Booth scheme, -1 times the shifted
multiplicand is selected when moving from 0 to 1, and
+1 times the shifted multiplicand is selected when
moving from 1 to 0, as the multiplier is scanned from
right to left.
0 0 1 0 1 1 0 0 1 1 1 0 1 0 1 1 0 0
0 +1 -1 +1 0 - 1 0 +1 0 0 - 1 +1 - 1 + 1 0 - 1 0 0
47
Booth Algorithm
1 1 0 0
-4 x 0 1 0 1
x 5 +1-1+1-1
-20
0 0 0 0 0 1 0 0
1 1 1 1 1 0 0
0 0 0 1 0 0
+1 1 1 0 0
1 1 1 0 1 1 0 0
0 0 0 1 0 1 0 0 = -20
48
Booth Algorithm
0 1 0 1
5 x 1 1 0 0
x-4 0-1 0 0
-20
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
1 1 1 0 1 1
+0 0 0 0 0
1 1 1 0 1 1 0 0
0 0 0 1 0 1 0 0 = -20
49
Booth
Algorithm
Consider in a multiplication, the multiplier is
positive 0011110, how many appropriately
shifted versions of the multiplicand are added
in a standard procedure?
0 1 0 1 1 0 1
0 0 +1 +1 + 1+1 0
0 0 0 0 0 0 0
0 1 0 1 1 0 1
0 1 0 1 1 0 1
0 1 0 1 1 0 1
0 1 0 1 1 0 1
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 0 0 0 1 1 0
50
Booth Algorithm
Since 0011110 = 0100000 – 0000010, if we
use the expression to the right, what will
happen?
0 1 0 1 1 0 1
0 +1 0 0 0 -1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
2's complement of
1 1 1 1 1 1 1 0 1 0 0 1 1
the multiplicand
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 1 0 0 0 1 1 0
51
Booth Algorithm
0 1 1 0 1 ( + 13) 0 1 1 0 1
X1 1 0 1 0 (- 6) 0 - 1 +1 - 1 0
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 0 0 1 1
0 0 0 0 1 1 0 1
1 1 1 0 0 1 1
0 0 0 0 0 0
1 1 1 0 1 1 0 0 1 0 ( - 78)
52
Booth Algorithm
Multiplier
V ersion of multiplicand
selected by biti
Bit i Bit i -1
0 0 0 XM
0 1 + 1 XM
1 0 1 XM
1 1 0 XM
1 1 0 0 0 1 0 1 1 0 1 1 1 1 0 0
Ordinary
multiplier
0 -1 0 0 +1 - 1 +1 0 - 1 +1 0 0 0 -1 0 0
0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1
Good
multiplier
0 0 0 +1 0 0 0 0 -1 0 0 0 +1 0 0 -1 54
Fast Multiplication
55
Bit-Pair Recoding of
Multipliers
Bit-pair recoding halves the maximum
number of summands (versions of the
multiplicand).
Sign extension Implied 0 to right of LSB
1 1 1 0 1 0 0
0 0 1 +1 1 0
0 1 2
56
Bit-Pair Recoding of
Multipliers
1 1 0 0
-4 x 0 1 0 1
x 5 +1-1+1-1
-20
+1 +1
1 1 1 1 1 1 0 0
+ 1 1 1 1 0 0
1 1 1 0 1 1 0 0
0 0 0 1 0 1 0 0 = -20
57
Bit-Pair Recoding of
Multipliers
Multiplier bit-pair Multiplier bit on the right Multiplicand
i+1 selected at position i
i i1
0 0 0 0 X M
0 0 1 +1 X M
0 1 0 +1 X M
0 1 1 +2 X M
1 0 0 2 X M
1 0 1 1 X M
1 1 0 1 X M
1 1 1 0 X M
58
Bit-Pair Recoding of
Multipliers 0 1 1 0 1
0 - 1 +1 - 1 0
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 0 0 1 1
0 0 0 0 1 1 0 1
1 1 1 0 0 1 1
0 1 1 0 1 ( + 13) 0 0 0 0 0 0
´ 1 1 0 1 0 (- 6) 1 1 1 0 1 1 0 0 1 0 ( - 78 )
0 1 1 0 1
0 -1 -2
1 1 1 1 1 0 0 1 1 0
1 1 1 1 0 0 1 1
0 0 0 0 0 0
1 1 1 0 1 1 0 0 1 0
P7 P6 P5 P4 P3 P2 P1 60P0
Carry-Save Addition of
Summands (Cont.,)
P7 P6 P5 P4 P3 P2 P1 P0
61
Carry-Save Addition of
Summands (Cont.,)
Consider the addition of many summands, we
can:
Group the summands in threes and perform carry-save
addition on each of these groups in parallel to generate a
set of S and C vectors in one full-adder delay
Group all of the S and C vectors into threes, and perform
carry-save addition on them, generating a further set of S
and C vectors in one more full-adder delay
Continue with this process until there are only two vectors
remaining
They can be added in a RCA or CLA to produce the desired
product
62
Carry-Save Addition of Summands
1 0 1 1 0 1 (45) M
X 1 1 1 1 1 1 (63) Q
1 0 1 1 0 1 A
1 0 1 1 0 1 B
1 0 1 1 0 1 C
1 0 1 1 0 1 D
1 0 1 1 0 1 E
1 0 1 1 0 1 F
1 0 1 1 0 0 0 1 0 0 1 1 (2,835) Product
Figure 6.17. A multiplication example used to illustrate carry-save addition as shown in Figure 6.18.
63
1 0 1 1 0 1 M
x 1 1 1 1 1 1 Q
1 0 1 1 0 1 A
1 0 1 1 0 1 B
1 0 1 1 0 1 C
1 1 0 0 0 0 1 1 S
1
0 0 1 1 1 1 0 0 C
1
1 0 1 1 0 1 D
1 0 1 1 0 1 E
1 0 1 1 0 1 F
1 1 0 0 0 0 1 1 S
2
0 0 1 1 1 1 0 0 C
2
1 1 0 0 0 0 1 1 S1
0 0 1 1 1 1 0 0 C
1
1 1 0 0 0 0 1 1 S2
1 1 0 1 0 1 0 0 0 1 1 S
3
0 0 0 0 1 0 1 1 0 0 0 C3
0 0 1 1 1 1 0 0 C2
0 1 0 1 1 1 0 1 0 0 1 1 S4
+ 0 1 0 1 0 1 0 0 0 0 0 C
4
1 0 1 1 0 0 0 1 0 0 1 1 Product
Figure 6.18. The multiplication example from Figure 6.17 performed using
carry-save addition. 64
Integer Division
65
Manual Division
21 10101
13 274 1101 100010010
26 1101
14 10000
13 1101
1 1110
1101
1
66
Longhand Division
Steps
Position the divisor appropriately with
respect to the dividend and performs a
subtraction.
If the remainder is zero or positive, a
quotient bit of 1 is determined, the remainder
is extended by another bit of the dividend, the
divisor is repositioned, and another
subtraction is performed.
If the remainder is negative, a quotient bit of
0 is determined, the dividend is restored by
adding back the divisor, and the divisor is
repositioned for another subtraction. 67
Circuit Arrangement
Shift left
an an-1 a0 qn-1 q0
Dividend Q
A Quotient
Setting
0 mn-1 m0
Divisor M
69
Examples Initially 0
Shift
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1 0 0 0
0 0 0
Subtract 1 1 1 0 1 First cycle
Set q0 1 1 1 1 0
Restore 1 1
0 0 0 0 1 0 0 0 0
1 0 Shift 0 0 0 1 0 0 0 0
1 1 1 0 0 0 Subtract 1 1 1 0 1
1 1 Set q0 1 1 1 1 1 Second cycle
Restore 1 1
1 0 0 0 0 1 0 0 0 0 0
Shift 0 0 1 0 0 0 0 0
Subtract 1 1 1 0 1
Set q0 0 0 0 0 1 Third cycle
Shift 0 0 0 1 0 0 0 0 1
Subtract 1 1 1 0 1 0 0 1
Set q0 1 1 1 1 1 Fourth cycle
Restore 1 1
0 0 0 1 0 0 0 1 0
Remainder Quotient
Shift
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1 0 0 0
0 0 0 First cycle
Subtract 1 1 1 0 1
Set q 0 1 1 1 1 0 0 0 0 0
Shift 1 1 1 0 0 0 0 0
Add 0 0 0 1 1 Second cycle
Set q 1 1 1 1 1 0 0 0 0
0
Shift 1 1 1 1 0 0 0 0
1 1 1 1 1 Add 0 0 0 1 1 Third cycle
Restore
0 0 0 1 1 Set q 0 0 0 0 1 0 0 0 1
remainder 0
Add 0 0 0 1 0
Remainder Shift 0 0 0 1 0 0 0 1
Subtract 1 1 1 0 1 Fourth cycle
Set q 1 1 1 1 1 0 0 1 0
0
Quotient
A nonrestoring-division example. 72
Floating-Point Numbers
and
Operations
73
Fractions
If b is a binary vector, then we have seen that it can be interpreted as
an unsigned integer by:
V(b) = b31.231 + b30.230 + bn-3.229 + .... + b1.21 + b0.20
Suppose if the binary vector is interpreted with the implicit binary point is
just left of the sign bit:
implicit binary point .b31b30b29....................b1b0
74
Range of fractions
The value of the unsigned binary fraction is:
V(b) = b31.2-1 + b30.2-2 + b29.2-3 + .... + b1.2-31 + b0.2-32
0 V (b) 1 2 32 0.9999999998
0 V (b) 1 2 n
75
Scientific notation
• Previous representations have a fixed point. Either the point is
to the immediate right or it is to the immediate left. This is
called Fixed point representation.
• Fixed point representation suffers from a drawback that the
representation can only represent a finite range (and quite
small) range of numbers.
A more convenient representation is the scientific
representation, where
the numbers are represented in the form:
x m1 .m2 m3 m4 b e
Components of these numbers are:
76
Significant digits
A number such as the following is said to have 7 significant digits
e
x 0.m1 m2 m 3 m4 m5 m6 m7 b
78
A sample representation
1 7 24
79
Normalization
Consider the number: x = 0.0004056781 x 1012
the number is shifted so that as many significant digits are brought into
available slots:
x = 0.4056781 x 109 = 0.0004056 x 1012
0001101000(10110) x 28 = 1101000101(10) x 25
80
Normalization (contd..)
A floating point number is in normalized form if the most significant
in the mantissa is in the most significant bit of the mantissa.
All normalized floating point numbers in this system will be of the form:
0.1xxxxx.......xx
81
Normalization, overflow and
underflow
The procedure for normalizing a floating point
number is:
Do (until MSB of mantissa = = 1)
Shift the mantissa left (or right)
Decrement (increment) the exponent
by 1
end do
Applying the normalization procedure to: .000111001110....0010 x 2-62
82
Changing the implied
base
So far we have assumed an implied base of 2, that is our floating point
numbers are of the form:
x = m 2e
x = m 16e
Then:
y = (m.16) .16e-1 (m.24) .16e-1 = m . 16e = x
83
Excess notation
• Rather than representing an exponent in 2’s complement form, it
turns out to be more beneficial to represent the exponent in
excess notation.
• If 7 bits are allocated to the exponent, exponents can be
represented in the range -64of<=-64
e <=to63+63, that is:
Exponent can also be represented using the following coding called
as excess-64:
E’ = Etrue + 64
In general, excess-p coding is represented as:
E’ = Etrue + p
84
IEEE notation
IEEE Floating Point notation is the standard representation in use.
There are two representations:
- Single precision.
- Double precision.
Both have an implied base of 2.
Single precision:
- 32 bits (23-bit mantissa, 8-bit exponent in excess-127 representation)
Double precision:
- 64 bits (52-bit mantissa, 11-bit exponent in excess-1023 representation)
Fractional mantissa, with an implied binary point at immediate left.
85
IEEE notation
Represent 1259.12510 in single precision
Step 1 :Convert decimal number to binary format
1259(10)=10011101011(2)
Fractional Part
0.125 (10)=0.001 (2)
Binary number = 10011101011+0.001 =10011101011.001
Step 2: Normalize the number
10011101011.001=1.0011101011001 x 210
Step3: Single precision format:
For a given number S=0,E=10 and M=0011101011001
Bias for single precision format is = 127
E’=E+127=10+127=137 (10) =10001001 (2)
• Number in single precision format
0 10001001 0011101011001….0
86
Peculiarities of IEEE
notation
• Floating point numbers have to be represented in a normalized
form to
maximize the use of available mantissa digits.
• In a base-2 representation, this implies that the MSB of the
mantissa is
always equal to 1.
• If every number is normalized, then the MSB of the mantissa is
always 1.
We can do away without storing the MSB.
• IEEE notation assumes that all numbers are normalized so that
the MSB
of the mantissa is a 1 and(+,-)
does1.Mnot
x 2(Estore
- 127) this bit.
• So the real MSB of a number in the IEEE notation is either a 0
• The
or a hidden
1. 1 forms the integer part of the mantissa.
•• Note that excess-127
The values and excess-1023
of the numbers represented(not excess-128
in the or
IEEE single
excess-1024)
precision are used to represent the exponent.
notation are of the form:
87
Exponent field
the IEEE representation, the exponent is in excess-127 (excess-1023)
otation.
he actual exponents represented are:
is is because the IEEE uses the exponents -127 and 128 (and -1023 and
24), that is the actual values 0 and 255 to represent special conditions:
- Exact zero
- Infinity
88
Floating point arithmetic
Addition:
3.1415 x 108 + 1.19 x 106 = 3.1415 x 108 + 0.0119 x 108 = 3.1534 x 108
Multiplication:
3.1415 x 108 x 1.19 x 106 = (3.1415 x 1.19 ) x 10(8+6)
Division:
3.1415 x 108 / 1.19 x 106 = (3.1415 / 1.19 ) x 10(8-6)
89
Floating point arithmetic:
ADD/SUB rule
Choose the number with the smaller
exponent.
Shift its mantissa right until the exponents of
both the numbers are equal.
Add or subtract the mantissas.
Determine the sign of the result.
Normalize the result if necessary and
truncate/round to the number of mantissa
bits.This does not consider the possibility of overflow/underflow.
Note:
90
Floating point arithmetic: MUL
rule
Add the exponents.
Subtract the bias.
Multiply the mantissas and determine the
sign of the result.
Normalize the result (if necessary).
Truncate/round the mantissa of the result.
91
Floating point arithmetic: DIV rule
Subtract the exponents
Add the bias.
Divide the mantissas and determine the sign
of the result.
Normalize the result if necessary.
Truncate/round the mantissa of the result.
92
Guard bits
While adding two floating point numbers with 24-bit mantissas, we shift
he mantissa of the number with the smaller exponent to the right until
he two exponents are equalized.
This implies that mantissa bits may be lost during the right shift (that is,
bits of precision may be shifted out of the mantissa being shifted).
To prevent this, floating point operations are implemented by keeping
guard bits, that is, extra bits of precision at the least significant end
of the mantissa.
The arithmetic on the mantissas is performed with these extra bits of
precision.
After an arithmetic operation, the guarded mantissas are:
- Normalized (if necessary)
- Converted back by a process called truncation/rounding to a 24-bit
mantissa.
93
Truncation/rounding
Straight chopping:
The guard bits (excess bits of precision) are dropped.
94
Rounding
Rounding is evidently the most accurate
truncation method.
However,
Rounding requires an addition operation.
Rounding may require a renormalization, if the addition
95
Implementing Floating-Point
Operations-1
The hardware
implementation of floating-
point operations designed
using logic circuitry. and
can also be implemented by
software routines.
In either case, the computer
must be able to convert
input and output from and
to the user’s decimal
representation of numbers.
In many general-purpose
processors, floating-point
operations are available at
the machine-instruction
level, implemented in
hardware.
96
Implementing Floating-Point
Operations-2
Step 1: Compare the exponent for sign bit using 8bit
subtractor Sign is sent to SWAP unit to decide on which
number to be sent to SHIFTER unit.
Step2: The exponent of the result is determined in two
way multiplexer depending on the sign bit from step1
Step3: Control logic determines whether mantissas are
to be added or subtracted. Depending on sign of the
operand. There are many combinations are possible
here, that depends on sign bits, exponent values of the
operand.
Step4: Normalization of the result depending on the
leading zeros, and some special case like 1.xxxxx
operands. Where result is 1x.xxx and X = -1, therefore
will increase the exponent value.
97
Implementing Floating-Point
Operations-3
Example
Add single precision floating point numbers A and B, where A=44900000H and B =
42A00000H.
Solution
Step 1 : Represent numbers in single precision format
A = 0 1000 1001 0010000….0
B = 0 1000 0101 0100000….0
Exponent for A = 1000 1001 =137
Therefore actual exponent = 137-127(Bias) =10
Exponent for B = 1000 0101 = 133
Therefore actual exponent = 133-127(Bias) = 6
With difference 4. Hence its mantissa is shifted right by 4 bits as shown below
Step 2: Shift mantissa
Shifted mantissa of B = 0 0 0 0 0 1 0 0…0
Step 3: Add mantissa
Mantissa of A = 00100000…0
Mantissa of B = 00000100…0
Mantissa of result = 00100100…0
As both numbers are positive, sign of the result is positive
Result =0100 0100 1001 0010 0…0 =44920000H
98