Data Represantation
Data Represantation
This chapter describes the various ways in which computers can store and manipulate
numbers and characters.
Bit: The most basic unit of information in a digital computer is called a bit, which is a
contraction of binary digit.
Byte: In 1964, the designers of the IBM System/360 mainframe computer established
a convention of using groups of 8 bits as the basic unit of addressable computer
storage. They called this collection of 8 bits a byte.
Word: Computer words consist of two or more adjacent bytes that are sometimes
addressed and almost always are manipulated collectively. The word size represents
the data size that is handled most efficiently by a particular architecture. Words can
be 16 bits, 32 bits, 64 bits.
Nibbles: Eight-bit bytes can be divided into two 4-bit halves call nibbles.
Radix (or Base): The general idea behind positional numbering systems is that a
numeric value is represented through increasing powers of a radix (or base).
10410 = 102123
3|104 2
3| 34 1
3| 11 2
3| 3 0
3|1 1
0
14710 = 100100112
2|147 1
2| 73 1
2|36 0
2|18 0
2|9 1
2|4 0
2|2 0
2|1 1
0
0.430410 = 0.20345
0.4304
X 5
2.1520
X 5
0.7600
X 5
3.8000
X 5
4.0000
EXAMPLE 2.7 Convert 0.3437510 to binary with 4 bits to the right of the binary
point.
0.3437510 = 0.01012
0.34375
X 2
0.68750
X 2
1.37500
X 2
0.75000
X 2
1.50000
Reading from top to bottom, 0.3437510 = 0.01012 to four binary places. We simply
discard (or truncate) our answer when the desired accuracy has been achieved.
3|217 1
3| 72 0
3| 24 0
3| 8 2
3|2 2
0
110 010 011 1012 = 62358 Separate into groups of 3 for octal conversion
1100 1001 11012 = C9D16 Separate into groups of 4 for octal conversion
A signed-magnitude number has a sign as its left-most bit (also referred to as the
high-order bit or the most significant bit) while the remaining bits represent the
magnitude (or absolute value) of the numeric value.
N bits can represent –(2n-1 - 1) to 2n-1 -1
EXAMPLE 2.10 Add 010011112 to 001000112 using signed-magnitude arithmetic.
The signed magnitude has two representations for zero, 10000000 and 00000000
(and mathematically speaking, the simple shouldn’t happen!).
One’s Complement
o This sort of bit-flipping is very simple to implement in computer hardware.
o EXAMPLE 2.16 Express 2310 and -910 in 8-bit binary one’s complement form.
Two’s Complement
o Find the one’s complement and add 1.
o EXAMPLE 2.20 Express 2310, -2310, and -910 in 8-bit binary two’s complement
form.
o EXAMPLE 2.23 Find the sum of 2310 and -910 in binary using two’s complement
arithmetic.
o EXAMPLE 2.22 Find the sum of 12610 and 810 in binary using two’s complement
arithmetic.
A one is carried into the leftmost bit, but a zero is carried out. Because these
carries are not equal, an overflow has occurred.
o When the divisor is much smaller than the dividend, we get a condition known as
divide underflow, which the computer sees as the equivalent of division by zero.
o Computer makes a distinction between integer division and floating-point division.
With integer division, the answer comes in two parts: a quotient and a
remainder.
Floating-point division results in a number that is expressed as a binary
fraction.
Floating-point calculations are carried out in dedicated circuits call floating-
point units, or FPU.
Excess-M representation (also called offset binary representation) is another way for
unsigned binary values to represent signed integers.
Excess-M representation is intuitive because the binary string with all 0s represents
the smallest number, whereas the binary string with all 1s represents the largest
value.
An unsigned binary integer M (called the bias) represents the value 0, whereas all
zeroes in the bit pattern represents the integer -M.
The integer is interpreted as positive or negative depending on where it falls in the
range.
If n bits are used for the binary representation, we select the bias in such a manner
that we split the range equally. Typically, we choose a bias of 2n-1 - 1.
Just as with signed magnitude, one’s complement, and two’s complement, there is a
specific range of values that can be expressed in n bits.
The unsigned binary value for a signed integer using Excess-M representation is
determined simply by adding M to that integer.
o For example, assuming that we are using Excess-7 representation, the integer 010
is represented as 0 + 7 = 710 = 01112.
o The integer 310 is represented as 3 + 7 = 1010 = 10102.
o The integer -7 is represented as -7 + 7 = 010 = 00002.
o To find the decimal value of the Excess-7 binary number 11112 subtract 7: 11112
= 1510 and 15 - 7 = 8; thus 11112, in Excess-7 is +810.
If the 4-bit binary value 1101 is unsigned, then it represents the decimal value 13, but
as a signed two’s complement number, it represents -3.
C programming language has int and unsigned int as possible types for integer
variables.
If we are using 4-bit unsigned binary numbers and we add 1 to 1111, we get 0000
(“return to zero”).
If we add 1 to the largest positive 4-bit two’s complement number 0111 (+7), we get
1000 (-8).
Consider the following standard pencil and paper method for multiplying two’s
complement numbers (-5 X -4):
1011 (-5)
x 1100 (-4)
+ 0000 (0 in multiplier means simple shift)
+ 0000 (0 in multiplier means simple shift)
+ 1011 (1 in multiplier means add multiplicand and shift)
+ 1011____ (1 in multiplier means add multiplicand and shift)
10000100 (-4 X -5 = -124)
Research into finding better arithmetic algorithms has continued apace for over 50
years. One of the many interesting products of this work is Booth’s algorithm.
In most cases, Booth’s algorithm carries out multiplication faster and more accurately
than naïve pencil-and-paper methods.
The general idea of Booth’s algorithm is to increase the speed of a multiplication
when there are consecutive zeros or ones in the multiplier.
Consider the following standard multiplication example (3 X 6):
0011 (3)
x 0110 (6)
+ 0000 (0 in multiplier means simple shift)
+ 0011 (1 in multiplier means add multiplicand and shift)
+ 0011 (1 in multiplier means add multiplicand and shift)
+ 0000____ (0 in multiplier means simple shift)
0010010 (3 X 6 = 18)
Note that: 010 Ignore extended sign bit that go beyond 2n.
Booth’s algorithm not only allows multiplication to be performed faster in most cases,
but it also has the added bonus in that it works correctly on signed numbers.
For unsigned numbers, a carry (out of the leftmost bit) indicates the total number of
bits was not large enough to hold the resulting value, and overflow has occurred.
For signed numbers, if the carry in to the sign bit and the carry (out of the sign bit)
differ, then overflow has occurred.
EXAMPLE 2.31 Divide the value 12 (expressed using 8-bit signed two’s
complement representation) by 2.
In scientific notion, numbers are expressed in two parts: a fractional part call a
mantissa, and an exponential part that indicates the power of ten to which the
mantissa should be raised to obtain the value we need.
Unbiased Exponent
0 00101 10001000 1710 = 0.100012 * 25
Renormalizing we retain the larger exponent and truncate the low-order bit.
0 10001 11101110
0 10010 11110000
The IEEE-754 single precision floating point standard uses an 8-bit exponent (with a
bias of 127) and a 23-bit significand. An exponent of 255 indicates a special value.
The IEEE-754 double precision standard uses an 11-bit exponent (with a bias of
1023) and a 52-bit significand. The “special” exponent value for a double precision
number is 2047, instead of the 255 used by the single precision standard.
The range of a numeric integer format is the difference between the largest and
smallest values that is can express.
The precision of a number indicates how much information we have about a value
Accuracy refers to how closely a numeric representation approximates a true value.
Because of truncated bits, you cannot always assume that a particular floating point
operation is commutative or distributive.
2.6.2 EBCDIC 98
2.6.4 Unicode 99
Both EBCDIC and ASCII were built around the Latin alphabet.
In 1991, a new international information exchange code called Unicode.
Unicode is a 16-bit alphabet that is downward compatible with ASCII and Latin-1
character set.
Because the base coding of Unicode is 16 bits, it has the capacity to encode the
majority of characters used in every language of the world.
Unicode is currently the default character set of the Java programming language.
The Unicode codespace is divided into six parts. The first part is for Western
alphabet codes, including English, Greek, and Russian.
The lowest-numbered Unicode characters comprise the ASCII code.
The highest provide for user-defined codes.
0+0=0
0+1=1
1+0=1
1+1=0
EXAMPLE 2.39 Find the quotient and remainder when 10010112 is divided by 10112.
n=m+r
The Hamming distance between two code words is the number of bits in which two
code words differ.
10001001
10110001
*** Hamming distance of these two code words is 3
The minimum Hamming distance, D(min), for a code is the smallest Hamming
distance between all pairs of words in the code.
Hamming codes can detect (D(min) - 1) errors and correct [(D(min) – 1) / 2] errors.
EXAMPLE 2.41
00000
01011
10110
11101
D(min) = 3. Thus, this code can detect up to two errors and correct one single bit
error.
We are focused on single bit error. An error could occur in any of the n bits, so each
code word can be associated with n erroneous words at a Hamming distance of 1.
Therefore, we have n + 1 bit patterns for each code word: one valid code word, and n
erroneous words. With n-bit code words, we have 2n possible code words consisting
of 2m data bits (where m = n + r).
This gives us the inequality:
(n + 1) * 2m < = 2n
(m + r + 1) * 2m <= 2 m + r or
(m + r + 1) <= 2r
Let’s introduce an error in bit position b9, resulting in the code word:
0 1 0 1 1 1 0 1 0 1 1 0
12 11 10 9 8 7 6 5 4 3 2 1
We found that parity bits 1 and 8 produced an error, and 1 + 8 = 9, which in exactly
where the error occurred.
If we expect errors to occur in blocks, it stands to reason that we should use an error-
correcting code that operates at a block level, as opposed to a Hamming code, which
operates at the bit level.
A Reed-Soloman (RS) code can be thought of as a CRC that operates over entire
characters instead of only a few bits.
RS codes, like CRCs, are systematic: The parity bytes are append to a block of
information bytes.
RS (n, k) code are defined using the following parameters:
o s = The number of bits in a character (or “symbol”).
o k = The number of s-bit characters comprising the data block.
o n = The number of bits in the code word.
RS (n, k) can correct (n-k)/2 errors in the k information bytes.
Reed-Soloman error-correction algorithms lend themselves well to implementation in
computer hardware.
They are implemented in high-performance disk drives for mainframe computers as
well as compact disks used for music and data storage. These implementations will be
described in Chapter 7.
Computers store data in the form of bits, bytes, and words using the binary
numbering system.
Hexadecimal numbers are formed using four-bit groups called nibbles (or nybbles).
Signed integers can be stored in one’s complement, two’s complement, or signed
magnitude representation.
Floating-point numbers are usually coded using the IEEE 754 floating-point
standard.
Character data is stored using ASCII, EBCDIC, or Unicode.
Error detecting and correcting codes are necessary because we can expect no
transmission or storage medium to be perfect.
CRC, Reed-Soloman, and Hamming codes are three important error control codes.