Data Representation

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

CHAPTER 3

DATA REPRESENTATION

CHAPTER GOALS

Describe numbering systems and their use in data representation


Compare different data representation methods
Summarize the CPU data types and explain how nonnumeric data
is represented
Describe common data structures and their uses

People can understand and manipulate data represented in a variety of forms.

Any data and information processor, whether organic, mechanical, electrical, or optical, must be
capable of the following:

Recognizing external data and converting it to an internal format


Storing and retrieving data internally
Transporting data between internal storage and processing components
Manipulating data to produce results or decisions

Note that these capabilities correspond roughly to computer system components


described in Chapter 2 I/O units, primary and secondary storage, the system bus, and
the CPU.

Automated Data Processing

Computer systems represent data electrically and process it with electrical switches. Two state (on
and off) electrical switches are well suited for representing data that can be expressed in binary (1 or
0).
. Electrical switches are combined to form processing circuits, which are then combined to form
processing subsystems and entire CPUs.
Automated data processing, therefore, combines physics (electronics) and mathematics.

This relationship between mathematics and physics underlies all


automated computation devices, from mechanical clocks (using the
mathematical ratios of gears) to electronic microprocessors (using
the mathematics of electrical voltage and resistance).

Basing computer processing on mathematics and physics has limits,


however. Processing operations must be based on mathematical
functions, such as addition and equality comparison; use numerical
data inputs; and generate numerical outputs. These processing
functions are sufficient when a computer performs numeric tasks, such
as accounting or statistical analysis

However, when you want to use a computer to manipulate data with no


obvious numeric equivalent for example, literary or philosophical
analysis of concepts such as mother, friend, love, and hate numeric-
processing functions have major shortcomings. As the data you want to
process moves further away from numbers, applying computer technology
to processing the data becomes increasingly difficult and less
successful.

Binary Data Representation

In a decimal (base 10) number, each digit can have 1 of 10 possible values: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9.
In a binary number, each digit can have only one of two possible values: 0 or 1. Computers represent
data with binary numbers for two reasons: Binary numbers represented as binary electrical signals
can be transported reliably between computer systems and their components (discussed in detail in
Chapter 8). Binary numbers represented as electrical signals can be processed by two state electrical
devices that are easy to design and fabricate (discussed in detail in Chapter 4).

Binary numbers are also well suited to computer processing because they correspond
directly with values in Boolean logic. This form of logic is named for 19th-century mathematician
George Boole, who developed methods of reasoning and logical proof that use
sequences of statements that can be evaluated only as true or false

The symbol used to represent a digit and the digit s position in a


string determine its value. The value of the entire string is the sum
of the values of all digits in the string.

For example, in the decimal numbering system, the number 5689 is


interpreted as follows:
The same series of operations can be represented in columnar form,
with positions of the same value aligned in columns:

In the decimal numbering system, the period or comma is called a decimal point. In other numbering
systems,
the term radix point is used for the period or comma. Here s an example of a decimal
value with a radix point:

The fractional portion of this real number is .368, and its value is interpreted as
follows:
To convert a binary value to its decimal equivalent, use the following procedure: 1. Determine each
position weight by raising 2 to the number of positions left ( ) or right (-) of the radix point. 2. Multiply
each digit by its position weight. 3. Sum all the values calculated in Step 2.

Figure 3.2 shows how the binary number 101101.101 is converted to its
decimal equivalent, 45.625.
Hexadecimal Notation

Hexadecimal numbering uses 16 as its base or radix ( hex 6 and decimal 10). There aren t enough
numeric symbols (Arabic numerals) to represent 16 different values, so English letters represent the
larger values
The primary advantage of hexadecimal notation, compared with binary
notation, is its compactness. Large numeric values expressed in
binary notation require four times as many digits as those expressed
in hexadecimal notation. For example, the data content of a byte
requires eight binary digits (such as 11110000) but only two
hexadecimal digits (such as F0). This compact representation helps
reduce programmer error. Hexadecimal numbers often designate memory
addresses.

Octal Notation
Some OSs and machine programming languages use octal notation. Octal notation uses
the base-8 numbering system and has a range of digits from 0 to 7. Large numeric values
expressed in octal notation are one-third the length of corresponding binary notation and
double the length of corresponding hexadecimal notation.

GOALS OF COMPUTER DATA REPRESENTATION

Any representation format for numeric data represents a balance among


several
factors, including the following:

Compactness
Range
Accuracy
Ease of manipulation
Standardization

Compactness and Range


The term compactness (or size) describes the number of bits used to
represent a numeric value. Compact representation formats use fewer
bits to represent a value, but they re limited in the range of values
they can represent.

Accuracy
Although compact data formats can minimize hardware s complexity and
cost, they do so at the expense of accurate data representation. The
accuracy, or precision, of representation increases with the number
of data bits used.

Ease of Manipulation
When discussing computer processing, manipulation refers to executing
processor instructions, such as addition, subtraction, and equality
comparisons, and ease refers to machine efficiency. A processor s
efficiency depends on its complexity (the number of its primitive
components and the complexity of the wiring that binds them
together). Efficient processor circuits perform their functions
quickly because of the small number of components and the short
distance electricity must travel. More complex devices need more time
to perform their functions. Data representation formats vary in their
capability to support efficient processing.

Standardization
Data must be communicated between devices in a single computer and to
other computers via networks. To ensure correct and efficient data
transmission, data formats must be suitable for a wide variety of
devices and computers. For this reason, several organizations have
created standard data-encoding methods (discussed later in the
Character Data section). Adhering to these standards gives computer
users the flexibility to combine hardware from different vendors with
minimal data communication problems.

CPU DATA TYPES


The CPUs of most modern computers can represent and process at least
the following primitive data types:
Integer
Real number
Character
Boolean
Memory address

Integers
An integer is a whole number a value that doesn t have a fractional
part. For example, the values 2, 3, 9, and 129 are integers, but the
value 12.34 is not. Integer data formats can be signed or unsigned.
Most CPUs provide an unsigned integer data type, which stores
positive integer values as ordinary binary numbers. An unsigned
integer s value is always assumed to be positive.

A signed integer uses one bit to represent whether the value is


positive or negative. The choice of bit value (0 or 1) to represent
the sign (positive or negative) is arbitrary. The sign bit is
normally the high-order bit in a numeric data format. In most data
formats, it s 1 for a negative number and 0 for a nonnegative number.
(Note that 0 is a nonnegative number.)

Excess Notation
One format that can be used to represent signed integers is excess
notation, which always uses a fixed number of bits, with the leftmost
bit representing the sign
Twos Complement Notation

In the binary numbering system, the complement of 0 is 1, and the


complement of 1 is 0. The complement of a bit string is formed by
substituting 0 for all values of 1 and 1 for all values of 0. For
example, the complement of 1010 is 0101. This transformation is the
basis of twos complement notation. In this notation, nonnegative
integer values are represented as ordinary binary values. For
example, a twos complement representation of 710 using 4 bits is
0111..

Real Numbers
A real number can contain both whole and fractional components. The
fractional portion is represented by digits to the right of the radix
point. For example, the following computation uses real number data
inputs and generates a real number output:
This is the equivalent computation in binary notation:

Floating-Point Notation
One way of dealing with the tradeoff between range and precision is
to abandon the concept of a fixed radix point. To represent extremely
small (precise) values, move the radix point far to the left. For
example, the following value has only a single digit to the left of
the radix point: 0.0000000013526473
Similarly, very large values can be represented by moving the radix
point far to the right, as in this example: 1352647300000000.0

Real numbers are represented in computers by using floating-point


notation, which is similar to scientific notation except that 2
(rather than 10) is the base. A numeric value is derived from a
floating-point bit string according to the following formula:
Value = mantissa x 2^exponent

The mantissa holds the bits that are interpreted to derive the real
number s digits. By convention, the mantissa is assumed to be
preceded by a radix point. The exponent value indicates the radix
point s position.

The Institute of Electrical and Electronics Engineers (IEEE)


addressed this problem in standard 754, which defines the following
formats for floating-point data:
binary32 32-bit format for base 2 values
binary64 64-bit format for base 2 values
binary128 128-bit format for base 2 values
decimal64 64-bit format for base 10 values
decimal128 128-bit format for base 10 values

Range, Overflow, and Underflow


The number of bits in a floating-point string and the formats of the
mantissa and exponent impose limits on the range of values that can
be represented. The number of digits in the mantissa determines the
number of significant (nonzero) digits in the largest and smallest
values that can be represented. The number of digits in the exponent
determines the number of possible bit positions to the right or left
of the radix point. Using the number of bits assigned to mantissa and
exponent, the largest absolute value of a floating-point value
appears to be the following:

1.11111111111111111111111 x 2 ^ 111111

Floating-point numbers with large absolute values have large positive


exponents. When overflow occurs, it always occurs in the exponent.
Floating-point representation is also subject to a related error
condition called underflow. Very small numbers are represented by
negative exponents.
Underflow occurs when the absolute value of a negative exponent is
too large to fit in the bits allocated to store it.

Precision and Truncation


Recall that scientific notation, including floating-point notation,
trades numeric range for accuracy. Accuracy is reduced as the number
of digits available to store the mantissa is reduced. The 23-bit
mantissa used in the binary32 format represents approximately seven
decimal digits of precision. However, many useful numbers contain
more than seven nonzero decimal digits, such as the decimal
equivalent of the fraction 1/3:

1/3 = 0.33333333

The number of digits to the right of the decimal point is infinite.


Only a limited number of mantissa digits are available, however.
Numbers such as 1/3 are stored in floating-point format by
truncation. The numeric value is stored in the mantissa, starting
with its most significant bit, until all available bits are used.

Processing Complexity
The difficulty of learning to use scientific and floating-point
notation is understandable. These formats are far more complex than
integer data formats, and the complexity affects
both people and computers. Although floating-point formats are
optimized for processing efficiency, they still require complex
processing circuitry. The simpler twos complement format used for
integers requires much less complex circuitry.

Character Data
In their written form, English and many other languages use
alphabetic letters, numerals, punctuation marks, and a variety of
other special-purpose symbols, such as $ and &. Each symbol is a
character. A sequence of characters that forms a meaningful word,
phrase, or other useful group is a string. In most programming
languages, single characters are surrounded by single quotation marks
('c'), and strings are surrounded by double quotation marks
("computer").

The following sections describe some common coding methods for


character data.
EBCDIC
Extended Binary Coded Decimal Interchange Code (EBCDIC) is a
character-coding method developed by IBM in the 1960s and used in all
IBM mainframes well into the 2000s. Recent IBM mainframes and
mainframe OSs support more recent character-coding methods, but
support for EBCDIC is still maintained for backward compatibility.
EBCDIC characters are encoded as strings of 8 bits.

ASCII
The American Standard Code for Information Interchange (ASCII),
adopted in the United States in the 1970s, is a widely used coding
method in data communication. The international equivalent of this
coding method is International Alphabet 5 (IA5), an International
Organization for Standardization (ISO) standard. Almost all computers
and OSs support ASCII, although a gradual migration is in progress to
its newer relative, Unicode. ASCII is a 7-bit format because most
computers and peripheral devices transmit data in bytes and because
parity checking was used widely in the 1960s to 1980s for detecting
transmission errors.
Device Control
When text is printed or displayed on an output device, often it s
formatted in a particular way. For example, text output to a printer
is normally formatted in lines and paragraphs, and a customer record
can be displayed onscreen so that it looks like a printed form.
Certain text can be highlighted when printed or displayed by using
methods such as underlining, bold font, or reversed background and
foreground colors.
Software and Hardware Support
Because characters are usually represented in the CPU as unsigned
integers, there s little or no need for special character-processing
instructions. Instructions that move and copy unsigned integers
behave the same whether the content being manipulated is an actual
numeric value or an ASCII-encoded character. Similarly, an equality
or inequality comparison instruction that works for unsigned integers
also works for values representing characters.

ASCII Limitations
ASCII s designers couldn t foresee the code s long lifetime (almost
50 years) or the revolutions in I/O device technologies that would
take place. They never envisioned modern I/O device characteristics,
such as color, bitmapped graphics, and selectable fonts.
Unfortunately, ASCII doesn t have the range to define enough control
codes to account for all the formatting and display capabilities in
modern I/O devices.
Unicode The Unicode Consortium (www.unicode.org) was founded in 1991
to develop a multilingual character-encoding standard encompassing
all written languages. The original members were Apple Computer
Corporation and Xerox Corporation, but many computer companies soon
joined. This effort has enabled software and data to cross
international boundaries. Major Unicode standard releases are
coordinated with ISO standard 10646. As of this writing, the latest
standard is Unicode 5.2, published in October 2009. Like ASCII,
Unicode is a coding table that assigns nonnegative integers to
represent printable characters.

ASCII is a subset of Unicode. An important difference between ASCII


and Unicode is the size of the coding table. Early versions of
Unicode used 16-bit code, which provided 65,536 table entries
numbered 0 through 65,535. As development efforts proceeded, the
number of characters exceeded the capacity of a 16-bit code. Later
Unicode versions use 16-bit or 32-bit codes, and the standard
currently encompasses more than 100,000 characters. The additional
table entries are used primarily for characters, strokes, and
ideographs of languages other than English and its Western European
siblings. Unicode includes many other alphabets, such as Arabic,
Cyrillic, and Hebrew, and thousands of Chinese,Japanese, and Korean
ideographs and characters. Some extensions to ASCII device control
codes are also provided. As currently defined, Unicode can represent
written text from all modern languages. Approximately 6000 characters
are reserved for private use.

. Boolean Data

The Boolean data type has only two data values true and false. Most
people don t think of Boolean values as data items, but the primitive
nature of CPU processing requires the capability to store and
manipulate Boolean values. Recall that processing circuitry
physically transforms input signals into output signals. If the input
signals represent numbers and the processor is performing a
computational function, such as addition, the output signal
represents the numerical result. When the processing function is a
comparison operation, such as greater than or equal to, the output
signal represents a Boolean result of true or false. This result is
stored in a register (as is any other processing result) and can be
used by other instructions as input (for example, a conditional or an
unconditional branch in a program). The Boolean data type is
potentially the most concise coding format because only a single bit
is required. For example, binary 1 can represent true, and 0 can
represent false. To simplify processor design and implementation,
most CPU designers seek to minimize the number of different coding
formats used. CPUs generally use an integer coding format for
integers to represent Boolean values. When coded in this manner, the
integer value zero corresponds to false, and all nonzero values
correspond to true. To conserve memory and storage, sometimes
programmers pack many Boolean values into a single integer by using
each bit to represent a separate Boolean value. Although this method
conserves memory, generally it slows program execution because of the
complicated instructions required to extract and interpret separate
bits from an integer data item.

Memory Addresses

As described in Chapter 2, primary storage is a series of contiguous


bytes of storage. Conceptually, each memory byte has a unique
identifying or serial number. The identifying numbers are called
addresses, and their values start with zero and continue in sequence
(with no gaps) until the last byte of memory is addressed. In some
CPUs, this conceptual view is also the physical method of identifying
and accessing memory locations. That is, memory bytes are identified
by a series of nonnegative integers. This approach to assigning
memory addresses is called a flat memory model. In CPUs using a flat
memory model, using twos complement or unsigned binary as the coding
format for memory addresses is logical and typical. The advantage of
this approach is that it minimizes the number of different data types
and the complexity of processor circuitry.

DATA STRUCTURES

A data structure is a related group of primitive data elements


organized for some type of common processing and is defined and
manipulated in software. Computer hardware can t manipulate data
structures directly; it must deal with them in terms of their
primitive components, such as integers, floating-point numbers,
single characters, and so on. Software must translate operations on
data structures into equivalent machine instructions that operate on
each primitive data element.

The complexity of data structures is limited only by programmers


imagination and skill. As a practical matter, certain data structures
are useful in a wide variety of situations and are used commonly,
such as character strings or arrays, records, and files. System
software often provides application services to manipulate these
commonly used data structures

You might also like