3 Fixed and Floating Point DSP
3 Fixed and Floating Point DSP
1
Problems in storing and recalling the numbers
2. A classic computational error results from the addition of
two numbers with very different values, for example, 1
and 0.00000001. We would like the answer to be
1.00000001, but the computer replies with 1.
To avoid these errors, we should understand of how
computers store and manipulate numbers.
2
These problems arise because a fixed number of bits are
allocated to store each number,
Usually 8, 16, 32 or 64.
For example, consider the case where eight bits are used
to store the value of a variable.
Since there are 28 = 256 possible bit patterns, the variable
can only take on 256 different values.
This is a fundamental limitation of the situation, and there is
nothing we can do about it.
3
Remedies and constraints
We can declare each bit pattern to represent.
In the simplest cases, the 256 bit patterns might represent
the integers from 0 to 255, 1 to 256, -127 to 128, etc.
In a more unusual scheme, the 256 bit patterns might
represent 256 exponentially related numbers: 1, 10, 100,
1000, …, 10254, 10255. Everyone, accessing the data, must
understand what value each bit pattern represents.
This is usually provided by an algorithm or formula for
converting between the represented value and the
corresponding bit pattern, and back again. 4
Many encoding schemes are possible and available.
Two general formats have become common
fixed point (also called integer numbers)
and floating point (also called real numbers).
Range the largest and smallest numbers they can
represent
Precision the size of the gaps between numbers
5
Fixed Point Representation: Unsigned integer
7
Fixed Point Representation: Offset Binary
9
Fixed Point Representation: Sign and Magnitude
Sign and magnitude is another simple way of
representing negative integers.
The far left bit is called the sign bit, and is made
a zero for positive numbers, and a one for
negative numbers.
The other bits are a standard binary
representation of the absolute value of the
number.
This results in one wasted bit pattern, since there
are two representations for zero, 0000 (positive
zero) and 1000 (negative zero). This encoding
scheme results in 16 bit numbers having a range
10
of -32,767 to 32,767.
Fixed Point Representation: Sign and Magnitude
11
Fixed Point Representation: Two’s Complement
Two's complement is the format most commonly used by
hardware engineers, and is how integers are usually
represented in computers.
The decimal number zero, which corresponds to a binary
zero, 0000. As we count upward, the decimal number is
simply the binary equivalent (0 = 0000, 1 = 0001, 2 = 0010, 3
= 0011, etc.). Now, remember that these four bits are stored
in a register consisting of 4 flip-flops. If we again start at
0000 and begin subtracting, the digital hardware
automatically counts in two's complement: 0 = 0000, -1 =
1111, -2 = 1110, -3 = 1101, etc.
12
Fixed Point Representation: Two’s Complement
13
Fixed Point Representation: Two’s Complement
For the forward direction, it changes: 00000, 00001, 00002,
00003, and so on.
While for backwards direction, it changes: 00000, 99999,
99998, 99997, etc.
Using 16 bits, two's complement can represent numbers
from -32,768 to 32,767.
The left most bit is a 0 if the number is positive or zero, and
a 1 if the number is negative.
Consequently, the left most bit is called the sign bit, just as
in sign & magnitude representation.
14
Fixed Point Representation: Two’s Complement
Converting between decimal and two's complement is
straightforward for positive numbers, a simple decimal to
binary conversion.
For negative numbers, the following algorithm is often
used:
(1)take the absolute value of the decimal number,
(2)convert it to binary,
(3)complement all of the bits (ones become zeros and
zeros become ones),
(4)add 1 to the binary number.
For example: -5 → 5 → 0101 → 1010 → 1011. 15
Floating Point: Real Numbers
17
Floating Point: Single Precision
31 30 23 22 21 1 0
Exponent Mantissa
8 bits 23 bits
Sign bit
18
Floating Point: Single Precision
Equation for converting a bit pattern into a floating point number.
The number is represented by V, S is the value of the sign bit, M
is the value of the mantissa, and E is the value of the exponent
V = ( – 1 )S x M x 2 E – 127
The term: (-1)S, means that the sign bit, S, is 0 for a positive
number and 1 for a negative number.
The variable, E, is the number between 0 and 255 represented
by the eight exponent bits. Subtracting 127 from this number
allows the exponent term to run from to In other words, the
exponent is stored in offset binary with an offset of 127.
The mantissa, M, is formed from the 23 bits as a binary fraction.
19
Floating Point: Real Numbers Single Precision
The decimal fraction: 2.783, is interpreted: 2 + 7/10 + 8/100 +
3/1000. The binary fraction: 1.0101, means: 1 + 0/2 + 1/4 + 0/8 +
1/16.
Floating point numbers have only one nonzero digit left of the
decimal point (called a binary point in base 2)
Since the only nonzero number that exists in base two is 1, the
leading digit in the mantissa will always be a 1, and therefore
does not need to be stored.
Removing this redundancy allows the number to have an
additional one bit of precision. The 23 stored bits, referred to by
the notation: m22,m21,m21,…,m0, form the mantissa according to:
20
Floating Point: Single Precision
21
Floating Point: Single Precision
Besides these special classes, there are bit patterns that are not
assigned a meaning, commonly referred to as NANs (Not A
Number).
22
Floating Point: Double Precision
The IEEE standard for double precision simply adds more bits to
the single precision format. Of the 64 bits used to store a double
precision number, bits 0 through 51 are the mantissa, bits 52
through 62 are the exponent, and bit 63 is the sign bit.
As before, the mantissa is between one and just under two, i.e., M
= 1 +m512-1 +m502-2 + m492-3….
The 11 exponent bits form a number between 0 and 2047, with an
offset of 1023, allowing exponents from 2-1023 to 21024.
The largest and smallest numbers allowed are ± 1.8 x 10308 and ±
2.2 x 10-308, respectively. These are extremely large and small
numbers.
Single precision is adequate enough for most of applications.
23
Double precision is adequate for all the applications.