0% found this document useful (0 votes)
219 views29 pages

Computer Organization and Architecture: UNIT-2

The document discusses arithmetic units and various techniques for addition, subtraction, and multiplication of binary numbers. It covers: 1) Ripple carry adders/subtractors and their disadvantages due to long delay times. 2) Carry lookahead adders which speed up carry generation to reduce delay times. 3) Array and sequential multipliers for unsigned and signed multiplication, including Booth recoding to handle negatives. 4) Fast multiplication techniques like bit-pair recoding and carry-save addition to further reduce delay times.

Uploaded by

Cloud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views29 pages

Computer Organization and Architecture: UNIT-2

The document discusses arithmetic units and various techniques for addition, subtraction, and multiplication of binary numbers. It covers: 1) Ripple carry adders/subtractors and their disadvantages due to long delay times. 2) Carry lookahead adders which speed up carry generation to reduce delay times. 3) Array and sequential multipliers for unsigned and signed multiplication, including Booth recoding to handle negatives. 4) Fast multiplication techniques like bit-pair recoding and carry-save addition to further reduce delay times.

Uploaded by

Cloud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Computer Organization and Architecture

UNIT-2
Arithmetic Unit

Addition and Subtraction of Signed and Unsigned numbers:

Figure 1: Logic specification for a stage of binary addition.

The truth table for the sum and carry-out functions for adding equally weighted bits xi and yi in
two numbers X and Y. The figure also shows logic expressions for these functions, along with an
example of addition of the 4-bit unsigned numbers 7 and 6. Note that each stage of the addition
process must accommodate a carry-in bit. We use ci to represent the carry-in to stage i, which is
the same as the carry-out from stage (i − 1). The logic expression for si in Figure-1 can be
implemented with a 3-input XOR gate, used in Figure .2a as part of the logic required for a single
stage of binary addition. The carry-out function, ci+1, is implemented with an AND-OR circuit, as
shown. A convenient symbol for the complete circuit for a single stage of addition, called a full
adder (FA), is also shown in the figure.

Acascaded connection of n full-adder blocks can be used to add two n-bit numbers, as shown in
Figure 2b. Since the carries must propagate, or ripple, through this cascade, the configuration is
called a ripple-carry adder.
Figure .2 Logic for addition of binary numbers.

Ripple Carry Adder/Subtractor:


The n-bit adder can be used to add 2’s-complement numbers X and Y, where the xn−1 and yn−1
bits are the sign bits. The carry-out bit cn is not part of the answer. It occurs when the signs of the
two operands are the same, but the sign of the result is different. Therefore, a circuit to detect
overflow can be added to the n-bit adder.
Therefore, a simpler circuit for detecting overflow can be obtained by implementing the expression
cn ⊕ cn−1 with an XOR gate.
In order to perform the subtraction operation X − Y on 2’s-complement numbers X and Y, we form
the 2’s-complement of Y and add it to X. The logic circuit shown in Figure .3 can be used to
perform either addition or subtraction based on the value applied to the Add/Sub input control line.
This line is set to 0 for addition, applying Y unchanged to one of the adder inputs along with a
carry-in signal, c0, of 0. When the Add/Sub control line is set to 1, the Y number is 1’s-
complemented (that is, bit-complemented) by the XOR gates and c0 is set to 1 to complete the 2’s-
complementation of Y. Recall that 2’s-complementing a negative number is done in exactly the
same manner as for a positive number. An XOR gate can be added to Figure .3 to detect the
overflow condition cn ⊕ cn−1.

Disadvantage in Ripple carry adder/subtractor:


All sum bits, is available 2n-1 gate delays.
The final carry-out, cn, is available after 2n gate delays.
To overcome this we use Fast Adder.
Carry Look Ahead Adder (Fast Adder):
A fast adder circuit must speed up the generation of the carry signals. The logic expressions for
si (sum) and ci+1 (carry-out) of stage i are
si = xi ⊕ yi ⊕ ci
and
ci+1 = xiyi + xici + yici
Factoring the second equation into
ci+1 = xiyi + (xi ⊕ yi )ci
we can write
ci+1 = Gi + Pici
where
Gi = xiyi and Pi = xi ⊕ yi
The expressions Gi and Pi are called the generate and propagate functions for stage i.
Each bit stage contains an AND gate to form Gi, an OR gate to form Pi, and a three-input XOR
gate to form si .
Pi = xi ⊕ yi , carry propagate only when xi or yi is equal to 1.
Gi = xiyi , carry generate only when both xi and yi is equal to 1

Thus, all carries can be obtained three gate delays after the input operands X , Y , and c0 are applied
because only one gate delay is needed to develop all Pi and Gi signals, followed by two gate delays
in the AND-OR circuit for ci+1. After a further XOR gate delay, all sum bits are available. In total,
the n-bit addition process requires only four gate delays, independent of n.

Let us consider the design of a 4-bit adder. The sum and carries can be implemented as
When i=0
S0 = x0 ⊕ y0 ⊕ c0
c1 = G0 + P0c0
When i=1
S1 = x1 ⊕ y1 ⊕ c1
c2 = G1 + P1G0 + P1P0c0
When i=2
S2 = x2 ⊕ y2 ⊕ c2
c3 = G2 + P2G1 + P2P1G0 + P2P1P0c0
When i=3
S3 = x3 ⊕ y3 ⊕ c3
c4 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0c0
The complete 4-bit adder is shown in Figure 9.4b. The carries are produced in the block labeled
Carry-look ahead logic. An adder implemented in this form is called a carry-look ahead adder.
Delay through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits. In
comparison, a 4-bit ripple-carry adder requires 7 gate delays for s3 and 8 gate delays for c4.

Multiplication of Unsigned Numbers:


The usual product of two, unsigned, n-digit numbers can be accommodated in 2n digits, so the
product of the two 4-bit numbers in this example is accommodated in 8 bits, as shown.

The product is computed one bit at a time by adding the bit columns from right to left and
propagating carry values between columns.

Array Multiplication:

 Binary multiplication of unsigned operands can be implemented in a combinational, two


dimensional, logic array.
 The main component in each cell is a full adder, FA. TheAND gate in each cell
determines whether a multiplicand bit, mj , is added to the incoming partial-product bit,
based on the value of the multiplier bit, qi .
 The multiplicand is shifted left one position per row by the diagonal signal path. We note
that the row-by-row addition done in the array circuit differs from the usual hand addition
described previously, which is done column-by-column.
 Assuming that there are two gate delays from the inputs to the outputs of a full-adder
block, FA, the critical path has a total of 6(n − 1) − 1 gate delays, including the initial
AND gate delay in all cells, for an n × n array.
Sequential Circuit Multiplier:

• The combinational array multiplier just described uses a large number of logic gates for
multiplying numbers of practical size, such as 32- or 64-bit numbers.
• Multiplication of two n-bit numbers can also be performed in a sequential circuit that uses
a single n-bit adder.
• Registers A and Q are shift registers. Together, they hold partial product Ppi.
• Multiplier bit qi generates the signal Add/Noadd.
• This signal causes the multiplexer MUX to select 0 when qi = 0, or to select the
multiplicand M when qi = 1, to be added to PPi to generate PP(i + 1).
• The product is computed in n cycles.
• The carry-out from the adder is stored in flip-flop C.
• At the start, the multiplier is loaded into register Q, the multiplicand into register M, and C
and A are cleared to 0.
• At the end of each cycle, C, A, and Q are shifted right one bit position to allow for growth
of the partial product as the multiplier is shifted out of register Q.
• After n cycles, the high-order half of the product is held in register A and the low-order
half is in register Q.
Multiplication of Signed Numbers:

• The general strategy is still to accumulate partial products by adding versions of the
multiplicand as selected by the multiplier bits.
• First, consider the case of a positive multiplier and a negative multiplicand. When we add
a negative multiplicand to a partial product, we must extend the sign-bit value of the
multiplicand to the left as far as the product will extend.
• For a negative multiplier, a straightforward solution is to form the 2’s-complement of
both the multiplier and the multiplicand and proceed as in the case of a positive multiplier.
• A technique that works equally well for both negative and positive multipliers, is Booth
algorithm.

Booth Algorithm:

• The Booth algorithm generates a 2n-bit product and treats both positive and negative
2’scomplement n-bit operands uniformly.
• The multiplier is converted into Booth recoded multiplier.
• The Multiplicand is multiplied with the recoded multiplier.
• The MSB of the result indicates the sign of the result.
• If sign of the result is 1(Negative), the magnitude contained in the remaining bits is in 2’s
complement form.

Booth recoded multiplier

• The case when the least significant bit of the multiplier is 1 is handled by assuming that an
implied 0 lies to its right.
• The Booth multiplier recoding table is shown below.
Example 1:

Multiply 45 and 30
• Given multiplier is 0 1 1 1 1 0 (30)
• Booth recoded multiplier is formed by appending a 0 to the right of the multiplier and
using Booth recoded table as +1 0 0 0 -1 0
• The multiplicand is multiplied with Booth recoded multiplier.
Example 2:

Multiply +13 and -6


• Given multiplier is 1 1 0 1 0 (-6)
• Booth recoded multiplier is formed by appending a 0 to the right of the multiplier and
using Booth recoded table as 0 -1 +1 -1 0
• The multiplicand is multiplied with Booth recoded multiplier.

Advantages of Booth algorithm:

 The Booth algorithm has two attractive features. First, it handles both positive and
negative multipliers uniformly.
• Second, it achieves some efficiency in the number of additions required when the
multiplier has a few large blocks of 1s.

Drawbacks:

• The number of summands is equal to size of the number (n-bits).


• Takes more time to multiply.
Drawbacks are overcome using Fast Multipliers

Fast Multipliers:

Two techniques for speeding up the multiplication operation are


1. Bit-Pair Recoding of Multipliers -A technique that guarantees the maximum
number of summands (versions of the multiplicand) that must be added is n/2 for n-bit
operands.
2. Carry-Save Addition of Summands - A technique that leads to adding the summands
in parallel.

Bit-Pair Recoding of Multipliers:

• Bit-pair recoding of the multiplier results in using at most one summand for each pair of
bits in the multiplier. It is derived directly from the Booth algorithm.
• Group the Booth-recoded multiplier bits in pairs
• If the Booth-recoded multiplier is examined two bits at a time, starting from the right, it
can be rewritten in a form that requires at most one version of the multiplicand to be
added to the partial product for each pair of multiplier bits.
• An example of bit-pair recoding of the multiplier
Problem1: Compute the product of -14 and +12 using Bit-pair recoding

-14 X +12 10010 X 01100 10010X +1 -1 0


1 0 0 1 0 X +1 -1 0
----------------------------------
0000000000
00001110
110010
---------------------------
1 1 0 1 0 1 1 0 0 0 (-168)
Problem2: Compute the product of +15 and -13 using Bit-pair recoding
+15 X -13 01111X01100 0 1 1 1 1 X -1 +1 -1
0 1 1 1 1 X -1 +1 -1
-----------------------------------
1111110001
00001111
110001
---------------------------
1 1 1 0 0 1 1 1 1 0 1 (-195)
Problem3: Compute the product of -17 and -20 using Bit-pair recoding
-17 X -20 101111X101100 1 0 1 1 1 1 X 0 -1 -1 0

1 0 1 1 1 1 X -1 +1 -1
-----------------------------------
000000000000
0000010001
00010001
---------------------------
0 0 0 1 0 1 0 1 0 1 0 0 (+340)

Carry-Save Addition of Summands

 Multiplication requires the addition of several summands. A technique called carry-save


addition (CSA) can be used to speed up the process.
 Consider the 4 × 4 multiplication array shown in Figure 9.16a. This structure is in the form of
the array shown in Figure 9.6, in which the first row consists of just the AND gates that
produce the four inputs m3q0,m2q0, m1q0, and m0q0.
 Instead of letting the carries ripple along the rows, they can be “saved” and introduced into
the next row, at the correct weighted positions, as shown in Figure 9.16b. This frees up an
input to each of three full adders in the first row.
 These inputs can be used to introduce the third summand bits m2q2, m1q2, and m0q2.
 Now, two inputs of each of three full adders in the second row are fed by the sum and carry
outputs from the first row.
 The third input is used to introduce the bits m2q3, m1q3, and m0q3 of the fourth summand.
 The high-order bits m3q2 and m3q3 of the third and fourth summands are introduced into the
remaining free full-adder inputs at the left end in the second and third rows.
 The saved carry bits and the sum bits from the second row are now added in the third row,
which is a ripple-carry adder, to produce the final product bits.
 The delay through the carry-save array is somewhat less than the delay through the ripple-
carry array. This is because the S and C vector outputs from each row are produced in parallel
in one full-adder delay.
A more significant reduction in delay can be achieved when dealing with longer operands
than those considered.
We can group the summands in threes and perform carry-save addition on each of these
groups in parallel to generate a set of S and C vectors in one full-adder delay. Here, we will
refer to a full-adder circuit as simply an adder.
Next, we group all the S and C vectors into threes, and perform carry-save addition on them,
generating a further set of S and C vectors in one more adder delay.
We continue with this process until there are only two vectors remaining.
The adder at each bit position of the three summands is called a 3-2 reducer, and the logic
circuit structure that reduces a number of summands to two is called a CSA tree
The final two S and C vectors can be added in a carry-lookahead adder to produce the desired
product.

A multiplication example used to illustrate carry-save addition

The multiplication example from above Figure performed using carry-save addition.
Tree Schematic representation of the carry-save addition operations
Integer Division:
Manual Division:
Longhand division examples:

A circuit that implements division by this longhand method operates as follows:


• It positions the divisor appropriately with respect to the dividend and performs a
subtraction.
• If the remainder is zero or positive, a quotient bit of 1 is determined, the remainder is
extended by another bit of the dividend, the divisor is repositioned, and another
subtraction is performed.
• If the remainder is negative, a quotient bit of 0 is determined, the dividend is restored by
adding back the divisor, and the divisor is repositioned for another subtraction.
Restoring Division and Non-Restoring Division Circuit Arrangement:

Circuit arrangement for binary division.


• An n-bit positive divisor is loaded into register M and an n-bit positive dividend is loaded
into register Q at the start of the operation.
• Register A is set to 0.
• After the division is complete, the n-bit quotient is in register Q and the remainder is in
register A.
• The required subtractions are facilitated by using 2’s-complement arithmetic.
• The extra bit position at the left end of both A and M accommodates the sign bit during
subtractions.
Algorithm to perform restoring division:
• Do the following three steps n times:
1. Shift A and Q left one bit position.
2. Subtract M from A, and place the answer back in A.
3. If the sign of A is 1, set q0 to 0 and add M back to A (i.e., restore A); otherwise, set q0 to 1.
Flow chart:
A restoring division example.

Non-Restoring Division:
The restoring division algorithm can be improved by avoiding the need for restoring A after an
unsuccessful subtraction. Subtraction is said to be unsuccessful if the result is negative.
Consider the sequence of operations that takes place after the subtraction operation in the
preceding algorithm.
• If A is positive, we shift left and subtract M, that is, we perform 2A− M.
• If A is negative, we restore it by performing A+ M, and then we shift it left and subtract
M. This is equivalent to performing 2A+ M.
• The q0 bit is appropriately set to 0 or 1 after the correct operation has been performed.
• summarizing the above discussion is an algorithm for non-restoring division.
Algorithm to perform Non-Restoring division:
Stage 1:
Do the following two steps n times:
1. If the sign of A is 0, shift A and Q left one bit position and subtract M from A; otherwise,
shift A and Q left and add M to A.
2. Now, if the sign of A is 0, set q0 to 1; otherwise, set q0 to 0.
Stage 2:
If the sign of A is 1, add M to A.
Flow chart:

A non-restoring division example.


Floating Point number representation:

A floating-point number (or real number) can represent a very large (1.23×10^88) or a
very small (1.23×10^-88) value. It could also represent very large negative number (-
1.23×10^88) and very small negative number (-1.23×10^88), as well as zero, as
illustrated:
A floating-point number is typically expressed in the scientific notation, with a fraction (M),
and an exponent (E) of a certain radix (r), in the form of M×r^E. Decimal numbers use
radix of 10 (M×10^E); while binary numbers use radix of 2 (M×2^E).
Representation of floating point number is not unique. For example, the
number 55.66 can be represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so
on. The fractional part can be normalized. In the normalized form, there is only a single
non-zero digit before the radix point. For example, decimal number 123.4567 can be
normalized as 1.234567×10^2; binary number 1010.1011B can be normalized
as 1.0101011B×2^3.
It is important to note that floating-point numbers suffer from loss of precision when
represented with a fixed number of bits (e.g., 32-bit or 64-bit). This is because there
are infinite number of real numbers (even within a small range of says 0.0 to 0.1). On the
other hand, a n-bit binary pattern can represent a finite 2^n distinct numbers. Hence, not
all the real numbers can be represented. The nearest approximation will be used instead,
resulted in loss of accuracy.
It is also important to note that floating number arithmetic is very much less efficient than
integer arithmetic. It could be speed up with a so-called dedicated floating-point co-
processor. Hence, use integers if your application does not require floating-point numbers.
In computers, floating-point numbers are represented in scientific notation of fraction (M)
and exponent (E) with a radix of 2, in the form of M×2^E. Both E and M can be positive as
well as negative.
Modern computers adopt IEEE 754 standard for representing floating-point
numbers. There are two representation schemes: 32-bit single-precision and 64-bit
double-precision.

A binary floating-point number can be represented by


• A sign for the number
• Some significant bits
• A signed scale factor exponent for an implied base of 2

Normalised and Un-normalised number:


Normalised:
 The sign bit represents the sign of the number, with S=0 for positive and S=1 for negative number.
 In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because
we need to represent both positive and negative exponent. With an 8-bit E, ranging from 0 to 255,
the excess-127 scheme could provide actual exponent of -127 to 128.
 For 1 ≤ E ≤ 254, N = (-1)^S × 1.M × 2^(E-127). These numbers are in the so-
called normalized form. The sign-bit represents the sign of the number. Fractional part (1.M) are
normalized with an implicit leading 1. The exponent is bias (or in excess) of 127, so as to represent
both positive and negative exponent. The range of exponent is -126 to +127.

Un-normalised:

 De-normalized form was devised to represent zero and other numbers.


 For E=0, the numbers are in the de-normalized form. An implicit leading 0 (instead of 1) is used for
the fraction; and the actual exponent is always -126. Hence, the number zero can be represented
with E=0 and M=0 (because 0.0×2^-126=0).
 We can also represent very small positive and negative numbers in un-normalized form with E=0.
For example, if S=1, E=0, and M=011 0000 0000 0000 0000 0000 . The actual fraction
is 0.011=1×2^-2+1×2^-3=0.375D. Since S=1, it is a negative number. With E=0, the actual exponent
is -126. Hence the number is -0.375×2^-126 = -4.4×10^-39, which is an extremely small negative
number (close to zero).
 For E = 0, N = (-1)^S × 0.M × 2^(-126). These numbers are in the so-called Un-
normalized form. The exponent of 2^-126 evaluates to a very small number. Un-normalized form
is needed to represent zero (with F=0 and E=0). It can also represents very small positive and
negative number close to zero.

32-bit single-precision floating-point number:

A binary floating-point number can be represented by


 The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for negative numbers.
 The following 8 bits represent exponent (E).
 The remaining 23 bits represents fraction (F).
In the basic IEEE format a 32-bit representation, the leftmost bit represents the sign, S, for
the number. The next 8 bits, E, represent the signed exponent of the scale factor (with an
implied base of 2), and the remaining 23 bits, M, are the fractional part of the significant
bits. The full 24-bit string, B, of significant bits, called the mantissa, always has a leading
1, with the binary point immediately to its right

Examples:

1. Convert (3.5)10 to IEEE 754 single precision floating point number.


(3.5)10 = (11.1)2 2 3 =11 0.5 x2 = 0.1
Normalized form is 1.11x21 1-1 1.0
±1.M x 2E’-127
E= E’-127 E’= E+127 E’ = 1+127=(128)10 = (1000000)2
S= 0 because it is positive number
M= 1100000…….0
0 10000000 11000000000000000000000
Sign Exponent Mantissa
2. Suppose that IEEE-754 32-bit floating-point representation pattern is
0 10000000 110 0000 0000 0000 0000 0000.

Equation is N = (-1)s X (1+M) X 2E’+127

Sign bit S = 0 ⇒ positive number


E = 1000 0000B = 128D (in normalized form)
Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75
The number is +1.75 × 2^(128-127) = +3.5
3. Convert (-0.75)10 to IEEE 754 single precision floating point number.
-(0.75)10 = -(0.11)2 0.75 x2 = (0.11)2
Normalized form is 1.1x2-1 1.50 x2
±1.M x 2E’-127 1.00
E= E’-127 E’= E+127 E’ = -1+127=(126)10 = (01111110)2
S= 1 because it is negative number
M= 1000000…….0
1 01111110 10000000000000000000000
Sign Exponent Mantissa
4. Suppose that IEEE-754 32-bit floating-point representation pattern
is 1 01111110 100 0000 0000 0000 0000 0000.
Equation is N = (-1)s X (1+M) X 2E’+127

Sign bit S = 1 ⇒ negative number


E = 0111 1110B = 126D (in normalized form)
E=E’+127; E’= E-127 ; E’= 126-127= -1
Fraction is 1.1B (with an implicit leading 1) = 1 + 2^-1 = 1.5D
The number is -1.5 × 2^(126-127) = -0.75D

64-bit Double Precision IEEE 754 standard format for Floating point representation:

The representation scheme for 64-bit double-precision is similar to the 32-bit single-
precision:
 The most significant bit is the sign bit (S), with 0 for positive numbers and 1 for
negative numbers.
 The following 11 bits represent exponent (E).
 The remaining 52 bits represents fraction (M).
The value (N) is calculated as follows:
 Normalized form: For 1 ≤ E ≤ 2046, N = (-1)^S × 1.M × 2^(E-1023).
 Un-normalized form: For E = 0, N = (-1)^S × 0.M × 2^(-1022). These are in the Un-normalized
form.
 For E = 2047, N represents special values, such as ±INF (infinity), NaN (not a number).

We note two basic aspects of operating with floating-point numbers.


• if a number is not normalized, it can be put in normalized form by shifting the binary
point and adjusting the exponent.
• Second, as computations proceed, a number that does not fall in the representable
range of normal numbers might be generated. In single precision, this means that its
normalized representation requires an exponent less than −126 or greater than +127. In
the first case, we say that underflow has occurred, and in the second case,we say that
overflow has occurred.
Examples:

1. Convert (3.5)10 to IEEE 754 double precision floating point number.

(3.5)10 = (11.1)2 2 3 =11 0.5 x2 = 0.1


Normalized form is 1.11x21 1-1 1.0
±1.M x 2E’-1023
E= E’-1023 E’= E+1023 E’ = 1+1023=(1024)10 = (10000000000)2
S= 0 because it is positive number
M= 1100000…….0
0 10000000000 11000000000000000000000…….0
Sign Exponent Mantissa
2. Suppose that IEEE-754 32-bit floating-point representation pattern is
0 10000000000 110 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000.

Equation is (-1)s X (1+M) X 2E’+1023

Sign bit S = 0 ⇒ positive number


E = 100 0000 0000B = 1024D (in normalized form)
E=E’+1023; E’= E-1023 ; E’= 1024-1023= 1
Fraction is 1.11B (with an implicit leading 1) = 1 + 1×2^-1 + 1×2^-2 = 1.75
The number is +1.75 × 2^(1024-1023) = +3.5

3. Convert (-0.75)10 to IEEE 754 single precision floating point number.

-(0.75)10 = -(0.11)2 0.75 x2 = (0.11)2


Normalized form is 1.1x2-1 1.50 x2
±1.M x 2E’-127 1.00
E= E’-1023 E’= E+1023 E’ = -1+1023=(1022)10 = (011 1111 1110)2
S= 1 because it is negative number
M= 1000000…….0
1 011 1111 1110 10000000000000000000000…..0
Sign Exponent Mantissa
4. Suppose that IEEE-754 32-bit floating-point representation pattern is
1 01111111110 100 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000.

Equation is (-1)s X (1+M) X 2E’+1023

Sign bit S = 1 ⇒ negative number


E = 011 1111 1110B = 1022D (in normalized form)
E=E’+1023; E’= E-1023 ; E’= 1022-1023= -1
Fraction is 1.1B (with an implicit leading 1) = 1 + 2^-1 = 1.5D
The number is -1.5 × 2^(1022-1023) = -0.75D
Arithmetic Operations on Floating-Point Numbers

Add/Subtract Rule:
1. Choose the number with the smaller exponent and shift its mantissa right a number of steps equal
to the difference in exponents.
2. Set the exponent of the result equal to the larger exponent.
3. Perform addition/subtraction on the mantissas and determine the sign of the result.
4. Normalize the resulting value, if necessary.

Multiply Rule:
1. Add the exponents and subtract 127 to maintain the excess-127 representation.
2. Multiply the mantissas and determine the sign of the result.
3. Normalize the resulting value, if necessary.

Divide Rule:
1. Subtract the exponents and add 127 to maintain the excess-127 representation.
2. Divide the mantissas and determine the sign of the result.
3. Normalize the resulting value, if necessary.

You might also like