0% found this document useful (0 votes)
39 views5 pages

4.16. Floating Point

Floating point numbers use a format similar to scientific notation, with a mantissa, exponent, and radix. This allows representing a wider range of numbers than fixed point, but with less precision. The document describes a sample 32-bit floating point format, covering how to represent and calculate with values in this format. It also discusses costs of floating point arithmetic like lower precision than integers and higher computation time.

Uploaded by

Ishan Jawa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

4.16. Floating Point

Floating point numbers use a format similar to scientific notation, with a mantissa, exponent, and radix. This allows representing a wider range of numbers than fixed point, but with less precision. The document describes a sample 32-bit floating point format, covering how to represent and calculate with values in this format. It also discusses costs of floating point arithmetic like lower precision than integers and higher computation time.

Uploaded by

Ishan Jawa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

4.16.

Floating Point
Prev Chapter 4. Data Representation Next

4.16. Floating Point


4.16.1. The Basics

Floating point gets around the limitations of fixed point by using a format similar to scientific
notation.

3.52 x 103 = 1520

A scientific notation number, as you probably know, consists of a mantissa (3.52 in the example
above) a radix (always 10), and an exponent (3 in the example above). Hence, the general format of
a scientific notation value is:

mantissa x radixexponent

The normalized form always has a mantissa greater than or equal to 1.0, and less than 10.0. We can
denormalize the value and express it in many other ways, such as 35.2 x 102, or 0.00325 x 100. For
each position we shift the digits of the mantissa relative to the decimal point, we increase or
decrease the value of the mantissa by a factor of 10. To compensate for this, we simply increase or
decrease the exponent by 1. Denormalizing is necessary when adding scientific notation values:

3.52 x 103
+ 1.97 x 105
-----------------------------------------

1
0.0352 x 105
+ 1.97 x 105
-----------------------------------------
2.0052 x 105

Adjusting the mantissa and exponent is also sometimes necessary to normalize results. For
example, 9.9 x 102 + 9.9 x 102 is 19.8 x 102, which must be normalized to 1.98 x 103.

A binary floating system stores a signed binary mantissa and a signed binary exponent, and usually
uses a radix of 2. Using a radix of 2 (or any power of 2) allows us to normalize and denormalize by
shifting the binary digits in the mantissa and adjusting the integer exponent on the radix of 2. (
Shifting binary digits in the mantissa n bits to the left or right multiplies or divides the mantissa by
2n. )

000102 x 23 = 010002 x 21.


The standard floating point formats are defined by the IEEE society. The IEEE formats are slightly
more complex that necessary to understand floating point in general, so we will start with a simpler
example here.

4.16.2. A Simple Floating Point Format

Suppose a 32-bit floating point format has a 24-bit two's complement mantissa, an 8-bit two's
complement exponent, and a radix of 2. The general structure is:

mantissa x 2exponent

Where mantissa is a 24-bit two's complement integer, and exponent is an 8-bit two's complement
integer.

The binary format is as follows:

Table 4.5. Floating Point Format

Mantissa Exponent
24 bits, two's compliment 8 bits, two's compliment

1. What is the value of the following number?


000000000000000000010010 11111100

The mantissa is 000000000000000000010010, or +(2 + 16) = +18.

The exponent is 11111100 = -(00000011 + 1) = -00000100 = -4.

The value is therefore +18 x 2-4

2. What is the largest positive value we can represent in this system?

The largest positive value will consist of the largest positive mantissa and the largest positive
exponent.

The largest mantissa is 011111111111111111111111, which in two's complement is +223-1


(+8388607). The largest exponent is 01111111, which in two's complement is +27-1 (+127).

Hence, the largest positive value is +8388607 x 2+127 = 1.42 x 1045.

3. What is the second largest positive value? What is the difference between the largest and
second largest?

4. What is the smallest positive value?


To find the smallest positive value in the form mantissa x radixexponent, we choose the
smallest positive mantissa, and the smallest negative exponent (the negative exponent with
the greatest magnitude).

Since the mantissa is an integer, the smallest positive value possible is 1.

Since the exponent is an 8-bit two's complement value, the smallest negative exponent is
100000002, of -27 = -128.

Hence the smallest positive value is 1 x 2-128, or 2.93873587706 x 10-39.

5. What is the second smallest positive value? What is the difference between the smallest and
second smallest?
6. Represent -2.75 in this floating point system.

a. Convert the number to fixed point binary using the methods described in previous
sections:

-2.7510 = -(10.112)

b. Multiply by radixexponent equal to 1:

-2.7510 = -(10.112) x 20

c. Shift the binary point to make the mantissa a whole number: -(10112)

By moving the binary point two places to the right, we multiply the mantissa by 22.
We therefore must divide (radixexponent) by the same factor:

-2.7510 = -(10112) x 2-2

d. Convert the mantissa and exponent into the specified formats (two's complement in
this case):

Mantissa: -(000000000000000000001011) = 111111111111111111110101

Exponent: -210 = 11111110

Binary representation = 11111111111111111111010111111110

7. How many different values can this system represent?

4.16.3. Overflow and Underflow

Overflow occurs when the result of a floating point operation is larger than the largest positive
value, or smaller than the smallest negative value. In other words, the magnitude is too large to
represent.

Underflow occurs when the result of a floating point operation is smaller than the smallest positive
value, or larger than the largest negative value. In other words, the magnitude is too small to
represent.

The example 32-bit format above cannot represent values larger than about 1045 or smaller than
about 10-39.

One technique to avoid overflow and underflow is to alternate operations that increase and
decrease intermediate results. Rather than do all the multiplications first, which could cause
overflow, or all the divisions first, which could cause underflow, we could alternate multiplications
and divisions to moderate the results along the way. Techniques like these must often be used in
complex scientific calculations.

4.16.4. Cost of Floating Point

Everything has a cost. The increased range and ability to represent non-whole numbers is no
exception.

Precision

There are only 232 patterns of 32 0's and 1's. Hence, there are only 232 unique numbers that we can
represent in 32 bits, regardless of the format.

So how is it we can represent numbers up to 1045?

Obviously, we must be sacrificing something in between. What floating point does for us is spread
out the limited number of binary patterns we have available to cover a larger range of numbers.
The larger the exponent, the larger the gap between consecutive numbers that we can accurately
represent.

Close to 0, we can represent many numbers in a small range. Far from zero, there will be a whole
range of whole numbers that cannot be represented.

The precision of a 32-bit floating point value is less than the precision of a 32-bit integer. By using
8 bits for the exponent, we sacrifice those 8 bits of precision. Hence, our example format has the
same precision as a 24-bit signed integer system.

Performance

Arithmetic on floating point is several times slower than on integers. This is an inherent property of
the format.

Consider the process of adding two scientific notation values.


1. Equalize the exponents
2. Add the mantissas
3. Normalize the result

Each of these operations take roughly the same amount of time in a computer as a single integer
addition. Since floating point is stored like scientific notation, we can expect floating point addition
to take about three times as long as integer addition. In reality, a typical PC takes about 2.5 times as
long to execute a floating point arithmetic instruction as it does to do the same integer instruction.

Note that this applies only to operations that can be carried out using either a single integer
instruction or a single floating point instruction. For example, suppose a program is running on a
32-bit computer, and there is no way to represent the data within the range of a 32-bit integer. In
this case, multiple integer instructions will be necessary to process integer values of more than 32
bits, and the speed advantage of integers does not apply.

It is also possible in some systems that floating point and integer operations could occur at the
same time, and hence utilizing the floating point hardware could result in better performance than
performing additional integer operations while the floating point unit sits idle. This is the case with
graphics rendering that occurs using floating point on the graphics processing unit (GPU) rather
than the CPU. It would not make sense to move the rendering calculations to the CPU in order to
use integers, as this would only increase the workload for the CPU and allow the power of the GPU
to go to waste.

If hardware has floating point support built-in, then common operations like floating point
addition, subtraction, etc. can each be handled by a single instruction. If hardware doesn't have a
floating point unit (common in embedded processors), floating point operations must be handled
by software routines. Hence, adding two floating point values will require dozens of instructions to
complete instead of just one. These will be hundreds of times slower than integers, and will
consume a big chunk of available program memory.

Most algorithms can be implemented using integers with a little thought. Use of floating point is
often the result of sheer laziness. Don't use floating point just because it's intuitive.

More power consumption. CPUs achieve their maximum power consumption when doing intensive
floating point calculations. This is not usually noticeable on a desktop PC, but can become a
problem on large grids consisting of hundreds of PCs, since the power grid they are attached to
may not be designed to provide for their maximum draw. It can also be a problem when running a
laptop on battery while doing intensive computations. Battery life while doing intensive floating
point computations could be a small fraction of what it is while reading email, browsing the web,
or editing a document in OpenOffice.

Prev Up Next
4.15. Hex and Octal with Signed 4.17. IEEE Floating Point Formats
Home
Numbers

You might also like