HW 4 Sol
HW 4 Sol
1. How would you test for overflow, the result of an addition of two 8-bit operands if the
operands were (i) unsigned (ii) signed with 2s complement representation.
Add the following 8-bit strings assuming they are (i) unsigned (ii) signed and represented
using 2’s complement. Indicate which of these additions overflow.
(i) unsigned addition
A. 0110 1110 + 1001 1111 = 1 0000 1101 [overflow since there is carry over with the
MSB]
B. 1111 1111 + 0000 0001 = 1 0000 0000 [overflow since there is carry over with the
MSB]
B. 1111 1111 + 0000 0001 = 1111 1111 +1 = 0000 0000; [-1] + [1] = 0-bit
string result of addition of operands is ‘0’ [carry over from MSB is
discarded since this is signed addition] No overflow
D. 0111 0001 + 0000 1111 = 1111 1111; [113] + [15] = +128 Here, the
sign bit of the result is different from the sign bit of the operands.
Overflow has occurred. +128 is outside the range of 2s complement
representation using 8 bits
2. One possible performance enhancement is to do a shift and add instead of an actual
multiplication. Since 9×6, for example, can be written (2×2×2+1)×6, we can calculate
9×6 by shifting 6 to the left three times and then adding 6 to that result. Show the best
way to calculate 0xABhex × 0xEFhex using shifts and adds/subtracts. Assume both inputs
are 8-bit unsigned integers.
= 27 + 25 + 23 + 21 + 20
We can shift 0xEF left 7 places = 111 0111 1000 0000 = 778016
then add 0xEF shifted left 5 places = 1 1101 1110 0000 = 1DE016
then add 0xEF shifted left 3 places = 0111 0111 1000 = 77816
then add 0xEF shifted once 0000 1 1101 1110 = 1DE16
and then add 0xEF = 1110 1111 = EF16
7780 + 1DE0 + 778 + 1DE + EF = 0x9FA516.
5 shifts, 4 adds.
DEADBEEF16 = 373592855910
in binary as a 32-bit string:
1101 1110 1010 1101 1011 1110 1110 1111
in IEEE 754 single precision:
1 10111101 01011011011111011101111
Exponent field = 1011 1101 = BD = 18910
Value of exponent = E – 127 = 62
1.01011011011111011101111 x 262
6.259853398708 x 1018
4. Write down the binary representation of the decimal number 78.75 assuming the
IEEE 754 single precision format. Write down the binary representation of the decimal
number 78.75 assuming the IEEE 754 double precision format
78.75 × 100 = 1001110.11 × 20
To normalize, move binary point six places to the left:
1.00111011 × 26
sign = positive, exp = 127 + 6 = 133
Final bit pattern: 0 1000 0101 0011 1011 0000 0000 0000 000
= 0100 0010 1001 1101 1000 0000 0000 0000 = 0x429D8000
5. Write down the binary representation of the decimal number 78.75 assuming it was
stored using the single precision IBM format (base 16, instead of base 2, with 7 bits of
exponent).
78.7510 in decimal, equal to 0100 1110.1100 00002 in binary and when represented
in hex = 4E.C016
we normalize the hex by shifting right 1 hex digit (4 bits) at a time until the leftmost digit
is 0: 0.4EC0 x 162
(a) Write down the bit pattern to represent −1.3625 ×10−1 Comment on how the
range and accuracy of this 16-bit floating point format compares to the single precision
IEEE 754 standard.
−1.3625 × 10−1 = − 0.13625 × 100
= − 0. 0010001100 × 20
To normalize, move the binary point three to the right= −1.00011 × 2−3
Sign bit = 1
fraction = 0.0001100000
(b) Calculate the sum of 1.6125 ×101 (A) and 3.150390625 ×10−1 (B) by hand,
assuming operands A and B are stored in the 16- bit half precision described in problem
a. above Assume 1 guard, 1 round bit, and 1 sticky bit, and round to the
nearest even. Show all the steps.
IEEE 754 SP has the same number of bits in the Exponent Field (8)
as bfloat16 and thus has a comparable range (marginally higher
range)
both IEEE 754 SP and bfloat16 have a larger range than fp16 or
Half-precision IEEE which has only 5 bits for the exponential field
IEEE 754 SP has a higher precision than both fp16 and bfloat16
since it has a larger fractional field of 23 bits compared to fp16
(10 bits) or bfloat16 (7 bits)
8. Suppose we have a 7-bit computer that uses IEEE floating-point arithmetic where a
floating point number has 1 sign bit, 3 exponent bits, and 3 fraction bits. All of the bits in
the hardware work properly.
Recall that denormalized numbers will have an exponent of 000, and the bias for a 3-
bit exponent is
23-1 – 1 = 3.
(a) For each of the following, write the binary value and the corresponding decimal value
of the 7-bit floating point number that is the closest available representation of the
requested number. If rounding is necessary use round-to-nearest. Give the decimal values
either as whole numbers or fractions. The first few lines are filled in for you.
Number Binary Decimal
0 0 000 000 0.0
-0.125 1 000 000 -0.125
Smallest positive
normalized number
largest positive normalized
number
Smallest positive
denormalized number > 0
largest positive
denormalized number > 0
First row:
The number 0 is represented by all 0s in the Exponent field and
all 0s in the Fraction field.
Second row: -0.125: N = -0.125 in decimal corresponds to -0.0012 in binary
[ since 2-3 = 0.125] Normalizing -0.0012 by shifting the binary point right 3 places,
we get, N = -1.0002 x 2-3 Given, 1 sign bit, 3 exponent field bits, 3
fraction field bits and bias of 3,
Sign bit = 1 (Exponent – bias) = value of exponent in normalized form (Exponent – 3)
= value of exponent = -3 [since N in normalized form = -1.0002 x 2-3] So, Exponent
= 0 i.e., exponent field = 000
Fraction field = 000 [ since N in normalized form = -1.0002 x 2-3]
So, Binary representation with S E F fields is: 1 000 000
(b) The associative law for addition says that a + (b + c) = (a + b) + c. This holds for
regular arithmetic, but does not always hold for floating-point numbers. Using the 7-bit
floating-point system described above, give an example of three floating-point numbers
a, b, and c for which the associative law does not hold, and show why the law does not
hold for those three numbers.
0101 0000 0000 0000 0000 0000 [24 bits for fraction field –
representing the absolute value of the fraction: 0.101]
exponent = −2,
using the 8 bits for the exponent field with right most bit
corresponding to the sign bit: 0000 0101