Computer Number Format - Wikipedia
Computer Number Format - Wikipedia
A computer number format is the internal representation of numeric values in digital device
hardware and software, such as in programmable computers and calculators.[1] Numerical values
are stored as groupings of bits, such as bytes and words. The encoding between numerical values
and bit patterns is chosen for convenience of the operation of the computer; the encoding used by
the computer's instruction set generally requires conversion for external use, such as for printing
and display. Different types of processors may have different internal representations of numerical
values and different conventions are used for integer and real numbers. Most calculations are
carried out with number formats that fit into a processor register, but some software systems allow
representation of arbitrarily large numbers using multiple words of memory.
Computers represent data in sets of binary digits. The representation is composed of bits, which in
turn are grouped into larger sets such as bytes.
A bit is a binary digit that represents one of two states. The concept of a bit Table 1: Binary to octal
can be understood as a value of either 1 or 0, on or off, yes or no, true or
Binary string Octal value
false, or encoded by a switch or toggle of some kind.
000 0
While a single bit, on its own, is able to represent only two values, a string of 001 1
bits may be used to represent larger values. For example, a string of three 010 2
bits can represent up to eight distinct values as illustrated in Table 1. 011 3
100 4
As the number of bits composing a string increases, the number of possible
101 5
0 and 1 combinations increases exponentially. A single bit allows only two
110 6
value-combinations, two bits combined can make four separate values,
three bits for eight, and so on, increasing with the formula 2n. The amount 111 7
Groupings with a specific number of bits are used to represent varying things and have specific
names.
A byte is a bit string containing the number of bits needed to represent a character. On most modern
computers, this is an eight bit string. Because the definition of a byte is related to the number of bits
composing a character, some older computers have used a different bit length for their byte.[2] In
many computer architectures, the byte is the
Table 2: Number of values for a bit string.
smallest addressable unit, the atom of
Length of bit string (b) Number of possible values (N)
addressability, say. For example, even though 64-bit
processors may address memory sixty-four bits at 1 2
Octal and hexadecimal encoding are convenient ways to represent binary numbers, as used by
computers. Computer engineers often need to write out binary quantities, but in practice writing out
a binary number such as 1001001101010001 is tedious and prone to errors. Therefore, binary
quantities are written in a base-8, or "octal", or, much more commonly, a base-16, "hexadecimal"
(hex), number format. In the decimal system, there are 10 digits, 0 through 9, which combine to form
numbers. In an octal system, there are only 8 digits, 0 through 7. That is, the value of an octal "10" is
the same as a decimal "8", an octal "20" is a decimal "16", and so on. In a hexadecimal system, there
are 16 digits, 0 through 9 followed, by convention, with A through F. That is, a hexadecimal "10" is
the same as a decimal "16" and a hexadecimal "20" is the same as a decimal "32". An example and
comparison of numbers in different bases is described in the chart below.
When typing numbers, formatting characters are used to describe the number system, for example
000_0000B or 0b000_00000 for binary and 0F8H or 0xf8 for hexadecimal numbers.
Converting between bases
Each of these number systems is a positional system, but while Table 3: Comparison of values in
decimal weights are powers of 10, the octal weights are powers different bases
of 8 and the hexadecimal weights are powers of 16. To convert Decimal Binary Octal Hexadecimal
from hexadecimal or octal to decimal, for each digit one 0 000000 00 00
multiplies the value of the digit by the value of its position and
1 000001 01 01
then adds the results. For example:
2 000010 02 02
3 000011 03 03
4 000100 04 04
5 000101 05 05
6 000110 06 06
7 000111 07 07
8 001000 10 08
9 001001 11 09
10 001010 12 0A
11 001011 13 0B
12 001100 14 0C
13 001101 15 0D
14 001110 16 0E
15 001111 17 0F
Fixed-point numbers
The eight's bit is followed by the four's bit, then the two's bit, then the one's bit. The fractional bits
continue the pattern set by the integer bits. The next bit is the half's bit, then the quarter's bit, then
the ⅛'s bit, and so on. For example:
1
This form of encoding cannot represent some values in binary. For example, the fraction 5 , 0.2 in
decimal, the closest approximations would be as follows:
1
Even if more digits are used, an exact representation is impossible. The number 3 , written in
decimal as 0.333333333..., continues indefinitely. If prematurely terminated, the value would not
1
represent 3 precisely.
Floating-point numbers
While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not
enough to handle all the range of numbers a calculator can handle, and that's not even including
fractions. To approximate the greater range and precision of real numbers, we have to abandon
signed integers and fixed-point numbers and go to a "floating-point" format.
In the decimal system, we are familiar with floating-point numbers of the form (scientific notation):
1.1030402E5
which means "1.1030402 times 1 followed by 5 zeroes". We have a certain numeric value
(1.1030402) known as a "significand", multiplied by a power of 10 (E5, meaning 105 or 100,000),
known as an "exponent". If we have a negative exponent, that means the number is multiplied by a 1
that many places to the right of the decimal point. For example:
The advantage of this scheme is that by using the exponent we can get a much wider range of
numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller
than the range. Similar binary floating-point formats can be defined for computers. There is a
number of such schemes, the most popular has been defined by Institute of Electrical and
Electronics Engineers (IEEE). The IEEE 754-2008 standard specification defines a 64 bit floating-
point format with:
an 11-bit binary exponent, using "excess-1023" format. Excess-1023 means the exponent appears
as an unsigned binary integer from 0 to 2047; subtracting 1023 gives the actual signed value
a 52-bit significand, also an unsigned binary number, defining a fractional value with a leading
implied "1"
byte 0 S x10 x9 x8 x7 x6 x5 x4
byte 7 m7 m6 m5 m4 m3 m2 m1 m0
where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once
the bits here have been extracted, they are converted with the computation:
This scheme provides numbers valid out to about 15 decimal digits, with the following range of
numbers:
maximum minimum
positive 1.797693134862231E+308 4.940656458412465E-324
negative -4.940656458412465E-324 -1.797693134862231E+308
The specification also defines several special values that are not defined numbers, and are known
as NaNs, for "Not A Number". These are used by programs to designate invalid operations and the
like.
Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit
significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving seven valid decimal
digits.
byte 0 S x7 x6 x5 x4 x3 x2 x1
byte 3 m7 m6 m5 m4 m3 m2 m1 m0
maximum minimum
positive 3.402823E+38 2.802597E-45
negative -2.802597E-45 -3.402823E+38
Such floating-point numbers are known as "reals" or "floats" in general, but with a number of
variations:
A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-
point value".
The relation between numbers and bit patterns is chosen for convenience in computer
manipulation; eight bytes stored in computer memory may represent a 64-bit real, two 32-bit reals,
or four signed or unsigned integers, or some other kind of data that fits into eight bytes. The only
difference is how the computer interprets them. If the computer stored four unsigned integers and
then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real
number, though it would be junk data.
Only a finite range of real numbers can be represented with a given number of bits. Arithmetic
operations can overflow or underflow, producing a value too large or too small to be represented.
The representation has a limited precision. For example, only 15 decimal digits can be represented
with a 64-bit real. If a very small floating-point number is added to a large one, the result is just the
large one. The small number was too small to even show up in 15 or 16 digits of resolution, and the
computer effectively discards it. Analyzing the effect of limited precision is a well-studied problem.
Estimates of the magnitude of round-off errors and methods to limit their effect on large
calculations are part of any large computation project. The precision limit is different from the range
limit, as it affects the significand, not the exponent.
The significand is a binary fraction that doesn't necessarily perfectly match a decimal fraction. In
many cases a sum of reciprocal powers of 2 does not match a specific decimal fraction, and the
results of computations will be slightly off. For example, the decimal fraction "0.1" is equivalent to
an infinitely repeating binary fraction: 0.000110011 ...[6]
Programming in assembly language requires the programmer to keep track of the representation of
numbers. Where the processor does not support a required mathematical operation, the
programmer must work out a suitable algorithm and instruction sequence to carry out the operation;
on some microprocessors, even integer multiplication must be done in software.
High-level programming languages such as Ruby and Python offer an abstract number that may be
an expanded type such as rational, bignum, or complex. Mathematical operations are carried out by
library routines provided by the implementation of the language. A given mathematical symbol in
the source code, by operator overloading, will invoke different object code appropriate to the
representation of the numerical type; mathematical operations on any number—whether signed,
unsigned, rational, floating-point, fixed-point, integral, or complex—are written exactly the same way.
Some languages, such as REXX and Java, provide decimal floating-points operations, which provide
rounding errors of a different form.
See also
Arbitrary-precision arithmetic