Data Representation
Data Representation
Data Representation
DATA REPRESENTATION
CHAPTER GOALS
Any data and information processor, whether organic, mechanical, electrical, or optical, must be
capable of the following:
Computer systems represent data electrically and process it with electrical switches. Two state (on
and off) electrical switches are well suited for representing data that can be expressed in binary (1 or
0).
. Electrical switches are combined to form processing circuits, which are then combined to form
processing subsystems and entire CPUs.
Automated data processing, therefore, combines physics (electronics) and mathematics.
In a decimal (base 10) number, each digit can have 1 of 10 possible values: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
In a binary number, each digit can have only one of two possible values: 0 or 1. Computers represent
data with binary numbers for two reasons: Binary numbers represented as binary electrical signals
can be transported reliably between computer systems and their components (discussed in detail in
Chapter 8). Binary numbers represented as electrical signals can be processed by two state electrical
devices that are easy to design and fabricate (discussed in detail in Chapter 4).
Binary numbers are also well suited to computer processing because they correspond
directly with values in Boolean logic. This form of logic is named for 19th-century mathematician
George Boole, who developed methods of reasoning and logical proof that use
sequences of statements that can be evaluated only as true or false
In the decimal numbering system, the period or comma is called a decimal point. In other numbering
systems,
the term radix point is used for the period or comma. Here s an example of a decimal
value with a radix point:
The fractional portion of this real number is .368, and its value is interpreted as
follows:
To convert a binary value to its decimal equivalent, use the following procedure: 1. Determine each
position weight by raising 2 to the number of positions left ( ) or right (-) of the radix point. 2. Multiply
each digit by its position weight. 3. Sum all the values calculated in Step 2.
Figure 3.2 shows how the binary number 101101.101 is converted to its
decimal equivalent, 45.625.
Hexadecimal Notation
Hexadecimal numbering uses 16 as its base or radix ( hex 6 and decimal 10). There aren t enough
numeric symbols (Arabic numerals) to represent 16 different values, so English letters represent the
larger values
The primary advantage of hexadecimal notation, compared with binary
notation, is its compactness. Large numeric values expressed in
binary notation require four times as many digits as those expressed
in hexadecimal notation. For example, the data content of a byte
requires eight binary digits (such as 11110000) but only two
hexadecimal digits (such as F0). This compact representation helps
reduce programmer error. Hexadecimal numbers often designate memory
addresses.
Octal Notation
Some OSs and machine programming languages use octal notation. Octal notation uses
the base-8 numbering system and has a range of digits from 0 to 7. Large numeric values
expressed in octal notation are one-third the length of corresponding binary notation and
double the length of corresponding hexadecimal notation.
Compactness
Range
Accuracy
Ease of manipulation
Standardization
Accuracy
Although compact data formats can minimize hardware s complexity and
cost, they do so at the expense of accurate data representation. The
accuracy, or precision, of representation increases with the number
of data bits used.
Ease of Manipulation
When discussing computer processing, manipulation refers to executing
processor instructions, such as addition, subtraction, and equality
comparisons, and ease refers to machine efficiency. A processor s
efficiency depends on its complexity (the number of its primitive
components and the complexity of the wiring that binds them
together). Efficient processor circuits perform their functions
quickly because of the small number of components and the short
distance electricity must travel. More complex devices need more time
to perform their functions. Data representation formats vary in their
capability to support efficient processing.
Standardization
Data must be communicated between devices in a single computer and to
other computers via networks. To ensure correct and efficient data
transmission, data formats must be suitable for a wide variety of
devices and computers. For this reason, several organizations have
created standard data-encoding methods (discussed later in the
Character Data section). Adhering to these standards gives computer
users the flexibility to combine hardware from different vendors with
minimal data communication problems.
Integers
An integer is a whole number a value that doesn t have a fractional
part. For example, the values 2, 3, 9, and 129 are integers, but the
value 12.34 is not. Integer data formats can be signed or unsigned.
Most CPUs provide an unsigned integer data type, which stores
positive integer values as ordinary binary numbers. An unsigned
integer s value is always assumed to be positive.
Excess Notation
One format that can be used to represent signed integers is excess
notation, which always uses a fixed number of bits, with the leftmost
bit representing the sign
Twos Complement Notation
Real Numbers
A real number can contain both whole and fractional components. The
fractional portion is represented by digits to the right of the radix
point. For example, the following computation uses real number data
inputs and generates a real number output:
This is the equivalent computation in binary notation:
Floating-Point Notation
One way of dealing with the tradeoff between range and precision is
to abandon the concept of a fixed radix point. To represent extremely
small (precise) values, move the radix point far to the left. For
example, the following value has only a single digit to the left of
the radix point: 0.0000000013526473
Similarly, very large values can be represented by moving the radix
point far to the right, as in this example: 1352647300000000.0
The mantissa holds the bits that are interpreted to derive the real
number s digits. By convention, the mantissa is assumed to be
preceded by a radix point. The exponent value indicates the radix
point s position.
1.11111111111111111111111 x 2 ^ 111111
1/3 = 0.33333333
Processing Complexity
The difficulty of learning to use scientific and floating-point
notation is understandable. These formats are far more complex than
integer data formats, and the complexity affects
both people and computers. Although floating-point formats are
optimized for processing efficiency, they still require complex
processing circuitry. The simpler twos complement format used for
integers requires much less complex circuitry.
Character Data
In their written form, English and many other languages use
alphabetic letters, numerals, punctuation marks, and a variety of
other special-purpose symbols, such as $ and &. Each symbol is a
character. A sequence of characters that forms a meaningful word,
phrase, or other useful group is a string. In most programming
languages, single characters are surrounded by single quotation marks
('c'), and strings are surrounded by double quotation marks
("computer").
ASCII
The American Standard Code for Information Interchange (ASCII),
adopted in the United States in the 1970s, is a widely used coding
method in data communication. The international equivalent of this
coding method is International Alphabet 5 (IA5), an International
Organization for Standardization (ISO) standard. Almost all computers
and OSs support ASCII, although a gradual migration is in progress to
its newer relative, Unicode. ASCII is a 7-bit format because most
computers and peripheral devices transmit data in bytes and because
parity checking was used widely in the 1960s to 1980s for detecting
transmission errors.
Device Control
When text is printed or displayed on an output device, often it s
formatted in a particular way. For example, text output to a printer
is normally formatted in lines and paragraphs, and a customer record
can be displayed onscreen so that it looks like a printed form.
Certain text can be highlighted when printed or displayed by using
methods such as underlining, bold font, or reversed background and
foreground colors.
Software and Hardware Support
Because characters are usually represented in the CPU as unsigned
integers, there s little or no need for special character-processing
instructions. Instructions that move and copy unsigned integers
behave the same whether the content being manipulated is an actual
numeric value or an ASCII-encoded character. Similarly, an equality
or inequality comparison instruction that works for unsigned integers
also works for values representing characters.
ASCII Limitations
ASCII s designers couldn t foresee the code s long lifetime (almost
50 years) or the revolutions in I/O device technologies that would
take place. They never envisioned modern I/O device characteristics,
such as color, bitmapped graphics, and selectable fonts.
Unfortunately, ASCII doesn t have the range to define enough control
codes to account for all the formatting and display capabilities in
modern I/O devices.
Unicode The Unicode Consortium (www.unicode.org) was founded in 1991
to develop a multilingual character-encoding standard encompassing
all written languages. The original members were Apple Computer
Corporation and Xerox Corporation, but many computer companies soon
joined. This effort has enabled software and data to cross
international boundaries. Major Unicode standard releases are
coordinated with ISO standard 10646. As of this writing, the latest
standard is Unicode 5.2, published in October 2009. Like ASCII,
Unicode is a coding table that assigns nonnegative integers to
represent printable characters.
. Boolean Data
The Boolean data type has only two data values true and false. Most
people don t think of Boolean values as data items, but the primitive
nature of CPU processing requires the capability to store and
manipulate Boolean values. Recall that processing circuitry
physically transforms input signals into output signals. If the input
signals represent numbers and the processor is performing a
computational function, such as addition, the output signal
represents the numerical result. When the processing function is a
comparison operation, such as greater than or equal to, the output
signal represents a Boolean result of true or false. This result is
stored in a register (as is any other processing result) and can be
used by other instructions as input (for example, a conditional or an
unconditional branch in a program). The Boolean data type is
potentially the most concise coding format because only a single bit
is required. For example, binary 1 can represent true, and 0 can
represent false. To simplify processor design and implementation,
most CPU designers seek to minimize the number of different coding
formats used. CPUs generally use an integer coding format for
integers to represent Boolean values. When coded in this manner, the
integer value zero corresponds to false, and all nonzero values
correspond to true. To conserve memory and storage, sometimes
programmers pack many Boolean values into a single integer by using
each bit to represent a separate Boolean value. Although this method
conserves memory, generally it slows program execution because of the
complicated instructions required to extract and interpret separate
bits from an integer data item.
Memory Addresses
DATA STRUCTURES