Week 1 Tutorial - Data Representation
Week 1 Tutorial - Data Representation
1 Introduction
In this tutorial, we consider the issue of data representation: how are different sorts of information represented inside a computer.
We will look at integers, characters and floating point numbers.
More complex data items, e.g. strings, colours, images, music etc. can be represented by groups of the above.
Before we start, there are several important things to realise:
1. Everything has to be represented by a pattern of bits; in other words, everything is a code. A specific value from the real world is encoded as a specific
pattern of bits in the computer. Some of the encodings are algorithmic (e.g. integers, floats), whereas for others the encoding is completely arbitrary (e.g.
characters).
2. For a fixed number of bits N, there are only 2N bit patterns. If we choose fixed-size storage areas for data, this instantly constrains the number of real-
world data values we can represent.
3. Above all, everything is just a bit pattern. While we tend to think that computers perform maths, in reality many of the instructions a computer performs
are comparisons of bit patterns (equality, inequality, ordering within a set etc.)
2 Integers
Since the early 1970s, computers have settled on groups of 8-bit bytes to represent data.
Groups of bytes can be used to create storage cells of different sizes. The most common data sizes today are:
8 bits, or 1 byte, which has 28=256 different bit patterns.
128 64 32 16 8 4 2 1
0 1 1 0 1 0 0 1
Groups of 4 bits are represented by a single character. As 4 bits has 24 patterns, we need 16 different characters.
The normal hexadecimal character set is 0 ... 9, followed by A ... F. We use this table to match each bit pattern to a character:
0000 = 0 1000 = 8
0001 = 1 1001 = 9
0010 = 2 1010 = A
0011 = 3 1011 = B
0100 = 4 1100 = C
0101 = 5 1101 = D
0110 = 6 1110 = E
0111 = 7 1111 = F
8-bit patterns are 2 hex digits, 16-bit patterns are 4 hex digits, 32-bit patterns are 8 hex digits.
We will use hex a lot in this subject, and we will follow the C and Java convention of writing hex number preceded by "0x", e.g. 0xFF, 0x304D, 0x00010E4A.
0+0=0
0+1=1
1+0=1
1+1=0 carry 1
Binary subtraction rules:
0-0=0
1-0=1
1-1=0
0-1=1 borrow 1
The next question is: how do we represent signed numbers? The only real difference is knowing if the number is a +ve or a -ve one.
Why not choose one of the bits in the pattern (usually the left-most bit) and treat it as the sign bit: 0 represents +ve, and 1 represents -ve.
This is known as signed-magnitude notation. While it works, it has two big problems. We will use 8-bit examples below.
Problem 1: there are now two values for zero: 00000000 and 10000000.
Why is this a problem? Let's have short x and short y. Let x have the 00000000 pattern, y have the 10000000 pattern. Both are zero.
If we do if (x == y), the CPU will compare the bit patterns, see they are different, and conclude that x != y, whereas they are both zero.
Problem 2: signed-magnitude notation does not follow the binary rules of addition nor subtraction.
As an example, let's do +5 + (-3). The answer should be +2.
00000101 +5
+ 10000011 -3, note the sign bit on the left
----------
10001000 i.e. -8
What happens instead: the unsigned sections get added together, then the sign bits get added together!
So, signed-magnitude notation doesn't really work. We need another representation for signed integers.
In twos-complement notation, the left-most bit still represents the sign: 0 represents +ve, and 1 represents -ve.
For positive bit patterns, the columns still represent powers of two like they did before.
But not for negative numbers!!
Instead, a negative number is represented by that pattern when, when added to the positive value, gives us zero.
For example, -3 is represented in 8-bit twos-complement notation as 11111101. Let's add this to +3:
11111101
+ 00000011
----------
00000000
The advantages of twos-complement representation of signed integers are that there is only one bit pattern for zero, and negative bit patterns still work for both
addition and subtraction.
I recommend that you decode positive binary numbers by adding the columns together,
but DON'T try to read or decode a negative number, as it doesn't work.
Instead, visualise negative numbers like a car trip meter (odometer): negative numbers are a certain distance "away" from zero. For example, using 4-bit
groups:
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111
Why is this analogous to the odometer? The number below 0000 on the odometer is 9999, because 9999 + 1 kilometre will bring you back to 0000.
Other things to note from the above table:
Half the bit patterns are negative, and half are positive. We treat 0 as one of the positive integers, so we get these ranges:
8 bits, or 1 byte: -128 ... +127
16 bits, or 2 bytes: -32,768 ... +32,767
32 bits, or 4 bytes: -2,147,483,648 ... +2,147,483,647
64 bits, or 8 bytes: -9,223,372,036,854,775,808 ... +9,223,372,036,854,775,807
Odd numbers always have their least-significant bit set, and even ones don't.
While not strictly a data representation issue, we should also think about:
Carry: if we add two 32-bit numbers, say, and the result is so large that it won't fit back into 32 bits because of the carry bit which falls off the left?
Similarly, if we add or subtract negative numbers, and the result is again below the most negative number, so it won't fit back into 32 bits.
Overflow: if we add two positive, signed, 32 bit numbers, say, and the result ends up with the most-significant bit set. This indicates that the number is
negative, which it isn't. This problem when the sign bit of the output differs from the sign bits of the input is known as overflow.
Similarly, for subtraction, we get underflow.
We will revisit the topic of carry, overflow and underflow later on.
2.5 Endianness
When we choose to use 16-, 32- or 64-bit patterns, we need to store the patterns in consecutive bytes in memory.
For example, the 32-bit hex pattern 0x1203F5C4 would need to be stored as 0x12, 0x03, 0xF5 and 0xC4 byte values.
We have a choice: in what order should the byte values be stored?
From a human's perspective, it makes sense to put the pattern into memory in the same way that we write it down: big-end first, i.e.
Byte Address 0 1 2 3
Value 12 03 F5 C4
Byte Address 0 1 2 3
Value C4 F5 03 12
This might seem weird-looking, but there is a reason for it. Consider the small number 0x00000005, which is 5 decimal. Let's store it as a 32-bit value at
address 0 in memory, little-endian order:
Byte Address 0 1 2 3
Value 05 00 00 00
Now, what 8-bit integer is stored at address 0: well, 0x05 of course, or 5 decimal.
And, what 16-bit little-endian integer is stored at address 0: well, 0x0005, which is again 5 decimal.
But if, instead, we had stored the 32-bit value in big-endian order:
Byte Address 0 1 2 3
Value 00 00 00 05
the 8-bit and 16-bit equivalents are no longer at address 0: they are at addresses 3 and 2, respectively.
The decision to use big- or little-endian has been a running source of arguments since the 1970s.
Most CPUs are either little-endian or big-endian, but not both:
Intel IA-32 chips are little-endian.
Motorola 680x0 chips are big-endian.
However, the MIPS CPU can be told to boot into either mode. We will be using little-endian MIPS.
It is worth noting that the default encoding of multibyte binary integers on the Internet is big-endian format.
3.1 Unicode
ASCII is only useful for English speakers: it does not have the range of bit patterns to encode all the characters used in the many human languages used
around the world.
Now, most systems are migrating up to encode characters using Unicode.
Unicode defines a codespace of 1,114,112 code points (i.e. bit patterns) in the range 0x00 to 0x10FFFF. Each code point represents a specific character.
The range of code points is broken up into a number of planes, or sub-ranges:
0x0000 to 0xFFFF is the Basic Multilingual Plane, which holds most of the useful human characters.
0x10000 to 0x1FFFF is the Supplementary Multilingual Plane, which is mostly used for historic languages, but is also used for musical and
mathematical symbols.
0x20000 to 0x2FFFF is the Supplementary Ideographic Plane, which holds Unified Han (CJK) Ideographs that were mostly not included in earlier
character encoding standards.
The code points 0x30000 upwards are either reserved for private use, or unassigned.
Although technically Unicode characters require 21-bits, many programming languages (e.g. Java) and libraries support the Basic Multilingual Plane and use
the 16-bit values 0x0000 to 0xFFFF to hold Unicode characters.
Again, the CPU is oblivious to the Unicode-ness of a pattern, but we can still do 16-bit comparisons etc.
For efficient transmission and storage of Unicode data, it is usually sent/stored using UTF-8 or UTF-16 format.
We will not be covering Unicode in any greater depth in this subject, as it's mainly an issue for libraries and programming languages.
4 Representation of Floating-Point Numbers
We now come to the most complex of all the data representation topics, that of floating point numbers.
To make it a bit easier, I'm going to do most of the discussion in decimal, and at the end switch over to binary.
Before we start, it's important to remember that:
There are an infinite set of floating-point numbers, even between 0.0 and 1.0.
Given any fixed number of bits, e.g. 32 bits, there is only a finite set of patterns.
Implication: not all floating-point numbers can be represented accurately with a fixed-size bit pattern.
We saw with signed-magnitude that the most obvious solution comes with drawbacks.
So, this time we are going to start with some goals that we want to achieve with the representation of floating-point numbers.
1. We are able to represent both positive and negative numbers.
2. We are able to represent very large numbers (large, positive exponents) and very small numbers (large, negative exponents).
3. There is a unique bit pattern for every floating-point number that we can represent: no number can be represented by two or more bit patterns.
4. The pattern for 0.0 corresponds to the integer version of 0, i.e. all zero bits.
5. It would be nice to be able to represent certain mathematical results for certain operations, e.g. 1/0 is infinity, and 0/0 is an undefined result.
+783747+06
This gives us 6 digits of accuracy in the mantissa, two sign characters, and an exponent range of 10-99 to 10+99.
Note that we don't have to encode the base, 10, which we raise using the exponent. That's implied.
Similarly, the rule says that there is a decimal point after the first digit, so there is no need to store that either.
Question: what number is
+400250-02
written in normal scientific notation and in normal decimal notation?
Moving on to goal 4. Let's assume in binary that + signs are represented by zeroes, and - signs by ones. Then the number 0.0 is
0000000000
which is all zeros.
Floating-point has been beset with data representation issues from the beginning of computing, along with many other issues such as accuracy, rounding
errors, overflows and underflows, and dealing with infinity and results which are not numbers.
In the 1980s, after several years of wrangling, IEEE defined the 754 standard for representation of floating-point numbers as binary patterns. We are only
going to touch on the most important aspects of the data representation. We are also going to look at single-precision floats stored in 32 bits.
The IEEE 754 format for floating-point numbers looks like the following:
Putting this all together, we have +1.75*20. Now 20=1, so the result is 1.75*1=1.75.
Let's try another one:
1 00001111 0011100000 ...
Taking it in stages:
sign bit is 1: negative
Technically, this is +1.0*2-127, i.e. around 7*10-46, but in fact IEEE 754 defines this as exactly +0.0.
Interestingly, the standard allows for 1 00000000 0000 ..., or -0.0! This violates our rule 3: no number represented by multiple patterns.
The highest exponent pattern, 11111111, is reserved to represent special numbers.
With this exponent and the mantissa set to all zeroes, this pattern represents infinity, i.e. the result of N/0.
With this exponent and any one bit set in the mantissa, this pattern represents "not a number", written as NaN, which is the result of doing 0/0 and some other
interesting floating-point operations.
For more details about floating-point representation, the relevant chapter in Patterson & Hennessy is excellent.
In the MARS simulator, there is a tool which you can use to convert between decimal and IEEE 754 floating-point representation: it works even without an
assembly program:
File translated from TEX by TTH, version 3.85.
On 19 Jan 2012, 06:12.