AS 1.1 Data Representation SV
AS 1.1 Data Representation SV
AS 1 Information
Representation
1.1 Data Representation
DRAFT VERSION
Overview
• Number Systems Introduction
• Binary and Decimal Prefixes
• Binary Coded Decimal
• Binary Addition
• Binary Subtraction
• Hexadecimal
• Characters and Text
General Principles of Number Bases
• The decimal system is so familiar to us that we usually do not even
think about it as a number system
• However, in Computer Science we often need to work with other
number systems, mainly binary and hexadecimal
• So, it is worth briefly going back to basics and looking at what we
mean by a number system
General Principles of Number Bases
• A number base tells us two essential and closely related facts:
• The number of symbols in the base, and
• The place value
• Base 10 (decimal or denary)
• Ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
• Place value: 10
103 102 101 100
1000s 100s 10s 1s
9 4 7 3
General Principles of Number Bases
• A number base tells us two essential and closely related facts:
• The number of symbols in the base, and
• The place value
• Base 16 (hexadecimal)
• Sixteen symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F
• Place value: 16
163 162 161 160
4096s 256s 16s 1s
5 D A 5 5DA516 = 23,97310
General Principles of Number Bases
• A number base tells us two essential and closely related facts:
• The number of symbols in the base, and
• The place value
• Base 2 (binary)
• Two symbols: 0, 1
• Place value: 2
23 22 21 20
8s 4s 2s 1s
128 64 32 16 8 4 2 1
1 0 0 1 0 0 1 1
147 19 19 19 3 3 3 1
-128 -64 -32 -16 -8 -4 -2 -1
19 XX XX 3 XX XX 1 0
128
16
2
+ 1
147
Converting Decimal to Binary
(method 2)
Take the decimal number 147, repeatedly divide by two, and set the binary digit to the remainder.
Work right to left.
128 64 32 16 8 4 2 1
1 0 0 1 0 0 1 1
• Convert 19 to binary
00010011
Check your answer before revealing!!
• Convert 75 to binary
01001011
Check your answer before revealing!!
Practice
• Convert 25 to binary
00011001
Check your answer before revealing!!
• Convert 39 to binary
00100111
Check your answer before revealing!!
128 64 32 16 8 4 2 1
1 1 0 1 0 0 1 0
128
64
16
+ 2
210
Practice
• Convert 01010101 to denary
85
Check your answer before revealing!!
C A 5 9
16
1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 1
2
Hex digits to binary
Hex Binary
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
A 1010
B 1011
C 1100
D 1101
E 1110
F 1111
Hex conversions (from binary)
• To convert from binary to hex, simply group the binary number into
four bit groups, starting at the LSB and padding with zeros, then …
• … convert each four bit binary group into its hex digit:
0110100010
0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0
2
0 1 A 2
16
Hex conversions (to and from
decimal)
• To convert from hex to decimal, you can take each hex digit, convert it
to decimal, and then multiply it by the place value (in decimal), and
add all the results
• Or, convert to binary, then binary to decimal – this is actually easier
• To convert from decimal to hex, you can repeatedly divide by 16,
noting each remainder as you go, building up the hex number digit by
digit
• Or convert to binary, then binary to hex – this is also easier
Practice
• Convert 2316 to binary using 8 bits
00100011
Check your answer before revealing!!
Binary Prefixes
Name Notation Power of 2 Value In 1024s
kibibyte kB 210 1,024 bytes 1,0241
mebibyte MB 220 1,048,576 bytes 1,0242
gibibyte GB 230 1,073,741,824 bytes 1,0243
tebibyte TB 240 1,099,511,627,776 bytes 1,0244
Mnemonic
You only need to go up to Tera/Tebi
Exam hint – write out this mnemonic at the start of your exam. You really only need to commit the denary names to
memory, and some of these at least should be familiar from your science GCSEs
You can then fill out the Binary prefixes by replacing the last two letters of each denary prefix with 'bi'. So Giga
becomes Gibi; Tera becomes Tebi etc.
Does it Matter?
• For your exam? Yes
• In the real world? Well, maybe…
• There is wide-spread confusion – many people and organisations use
the denary prefixes to mean either the denary or binary units.
• To take one example of confusion, disks and files will report different
sizes on different operating systems (some in Terabytes, some in
Tebibytes, both using the abbreviation TB)
• International standards bodies have been trying to get companies to
use denary prefixes purely for denary units, and binary prefixes for
binary units for almost 30 years
Binary-coded decimal
• Binary-coded decimal (BCD) takes a different approach to encoding
decimal values
• It is not widely used in modern computing, although there are some
cases where it is beneficial
• Standard BCD uses one byte per decimal digit (with wastage of 4+
bits)
• Packed BCD uses one nibble per decimal digit (with wastage of 6 bit
patterns)
Binary and BCD Compared
Denary Binary BCD Denary Binary BCD
0 0000 0000 0000 0000 8 0000 1000 0000 1000
1 0000 0001 0000 0001 9 0000 1001 0000 1001
2 0000 0010 0000 0010 10 0000 1010 0001 0000
3 0000 0011 0000 0011 11 0000 1011 0001 0001
4 0000 0100 0000 0100 12 0000 1100 0001 0010
5 0000 0101 0000 0101 13 0000 1101 0001 0011
6 0000 0110 0000 0110 14 0000 1110 0001 0100
7 0000 0111 0000 0111 15 0000 1111 0001 0101
Properties of symbols
Binary Arithmetic – Addition Rules
0 0 1
+ 0 + 1 + 0
0 1 1
zero plus zero = zero zero plus one = one one plus zero = one
1 1
+ 1
1 + 1
1 0 1 1
1 1
one plus one = zero carry one
= two in binary (10)
one plus one plus one = one carry one
= three in binary (11)
0
+ 0
Binary Arithmetic - Addition 0
0
+ 1
0 1 1 0 1 0 1 0 10610 1
1
0 1 1 0 1 1 0 0
+ 1 1 1
10810 + 0
1
1 1 0 1 0 1 1 0 21410
1
+ 1
1
1 0
1
+ 1
1 1
1 1
Binary Arithmetic - Subtraction
0 1
1 1 1 0 1 1 1 1 23910
0 0 0 1 1 0 0 1 2510
-
1 1 0 1 0 1 1 0 21410
Advice
• Take care setting out additions – make sure that your binary digits line up
and are written clearly
• Allow space for your carried bits – these must be shown to score full
marks
• Check your working twice – it is very easy to make errors, and errors cost
marks, cost grades
• Once - Work through the binary addition a second time, check each bit
• Twice – Convert to denary, add, convert the answer back to binary and check
Practice
• Add 101101012 to 000100012 using 8 bits
11000010
Check your answer before revealing!!
0 0 1 1 1 0 0 1 5710
-
1 0 0 1 0 1 1 0 15010
The subtractions for bits 0 to 3 are straightforward. When we get to bit 4, we need to borrow in order to
perform the subtraction, but bit 5 is zero, so we need to borrow from bit 6 – this makes the value in bit 5 10 2
(that is 2 in decimal/denary), we now borrow one from this, reducing it to 1 2, and making bit 4 102 so when we
perform that subtraction, we are subtracting 12 from 102, resulting in 12.
The need for unpredictably long chains of 'borrows' makes subtraction difficult to implement in any electronic
(or mechanical) system.
Binary Arithmetic
• Addition of two numbers – by our rules
• Adding more than two numbers? Divide and conquer…
• Subtraction? Well, clearly we have a process, just as we did for
addition, but it would be much more complicated to perform
electronically, so there is a neat trick we use instead…
• Multiplication – repeated addition
• Division – repeated subtraction
Representing negative numbers
• With N bits, we can represent numbers from 0 to 2N-1
• For example, with 8 bits, our range is 0 to 255
• What if we want to represent negative numbers too?
• We could, instead, reserve the most significant bit (MSB) for the sign:
Sign bit
0 0 0 0 1 1 1 0 1410
1 0 0 0 1 1 1 0 -1410
Sign and Magnitude
• This approach is called ‘sign and magnitude’ – the most significant bit is the sign (1 being
negative, 0 positive) and the remaining bits being the magnitude in the normal way.
• With N bits, we can now represent numbers from -2(N-1)-1 to 2(N-1)-1
• For example, with 8 bits, our range is -127 to +127
• …but, we have +0 (00000000) and -0 (10000000)!
Sign bit
0 0 0 0 1 1 1 0 1410
1 0 0 0 1 1 1 0 -1410
Representing negative numbers
• … and what happens if we add 14 and -14?
Sign bit
0 0 0 0 1 1 1 0 1410
1 0 0 0 1 1 1 0 -1410
+ 1 1 1
1 0 0 1 1 1 0 0 -2810
Two's Complement
• Two's Complement takes a different approach to representing
negative binary numbers
• There are two methods to find a Two's Complement representation:
• Two's Complement of X = 2N – X
• Or, "invert the bits and add one" 0 0 0 0 0 0 1 1 310
0 0 0 0 0 0 0 1 110
• Using this representation, we still have a sign bit
0 0 0 0 0 0 0 0 010
• We can represent numbers from -2(N-1) to 2(N-1)-1
1 1 1 1 1 1 1 1 -110
1 1 1 1 1 1 1 0 -210
1 1 1 1 1 1 0 1 -310
Two's Complement
How to express -1410 in Two's Complement How to convert from Two's Complement
128 64 32 16 8 4 2 1 128 64 32 16 8 4 2 1
0 0 0 0 1 1 1 0 1410 (negative) 1 1 1 1 0 0 1 0
Invert 1 1 1 1 0 0 0 1 Invert 0 0 0 0 1 1 0 1
How to express +1410 in Two's Complement: How to express -5.2510 in Two's Complement:
0 0 0 0 1 1 1 0 1410 0 1 0 1 0 1 0 0 5.2510
Invert 1 0 1 0 1 0 1 1
We do not make any change!
Add 1 1 0 1 0 1 1 0 0 -5.2510
0 0 0 0 1 1 1 0 1410
Invert 1 1 1 1 0 0 0 1
0 0 0 0 1 1 1 0 1410
Add 1 1 1 1 1 0 0 1 0 -1410
1 1 1 1 0 0 1 0 -1410
+ 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 010
0 0 0 0 1 1 1 0 1410
X Q X Y Q X Y Q X Y Q
0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 1 1 0 1 1
1 0 0 1 0 1 1 0 1
1 1 1 1 1 1 1 1 0
Bitmasks
• A "bitmask" is data used to set, clear, or select certain bits from other
values
• A bitmask can be applied using a Boolean operator – AND, OR, XOR
value (X) 0 0 1 1 1 0 1 0
AND
bitmask (Y) 1 1 1 1 0 0 0 0
result (Q) 0 0 1 1 0 0 0 0
Bitmasks
• A "bitmask" is data used to set, clear, or select certain bits from other
values
• A bitmask can be applied using a Boolean operator – AND, OR, XOR
value 0 0 1 1 1 0 1 0
OR
bitmask 1 0 0 0 0 0 0 0
result 1 0 1 1 1 0 1 0
Bitmasks
• A "bitmask" is data used to set, clear, or select certain bits from other
values
• A bitmask can be applied using a Boolean operator – AND, OR, XOR
value 0 0 1 1 1 0 1 0
XOR
bitmask 0 0 0 0 1 1 1 1
result 0 0 1 1 0 1 0 1
Examples
• An AND operation with the mask 10101010 is applied to the binary
number 01010101. Show the result.
Shifts
• Logical shift left and right
• For unsigned binary values = multiply and divide by 2
• Circular shift
• Arithmetic shift left and right
• For signed (2sC) and unsigned binary values ≈ multiply and divide by 2
• Note the weird overflow convention and rounding
Shifts, multiplication and division
• In any base, shifting left is broadly equivalent to multiplying by the
base, and …
• Shifting right is broadly equivalent to dividing by the base
• For fixed size representations, we have to be aware of overflow and
underflow, and…
• For two's complement (or any signed system) representations, we
need to take account of the sign information
103 102 101 100
9 4 7 3
Logical Shifts
0 1 0 1 1 0 0 1 8910
Logical shift right
Always shift a zero into the MSB
0 0 0 1 0 1 1 0 0 1 4410
0 1 0 1 1 0 0 1 8910
1 1 0 1 1 0 0 1 -3910
Arithmetic shift right – negative number
(rounds towards negative infinity)
1 1 1 0 1 1 0 0 1 -2010
Arithmetic Shift Left (sign
preserving)
0 0 0 1 1 0 0 1 2510
If we interpret this as +25 in 2sC, then everything
works, and the matching MSB and carry out
indicate no problem.
0 0 0 1 1 0 0 1 0 0 5010
0 1 0 1 1 0 0 1 8910
The carry out and MSB difference indicates an
overflow situation as we cannot represent +178 in
2sC using 8 bits.
1 0 0 1 1 0 0 1 0 0 5010
Arithmetic Shift Left
1 1 0 1 1 0 0 1 -3910
As the carry out and MSB are the same, this ASL
has worked with the 2sC number -39, producing
the correct result.
1 1 0 1 1 0 0 1 0 0 -7810
1 0 0 1 1 0 0 1 -10310
Here, the attempt to ASL the 2sC number -103 has
failed, as the result of -206 exceeds the largest
negative number we can represent in 8 bits with
2sC. The difference between the MSB and carry 0 1 0 1 1 0 0 1 0 0 -7810
out indicate this.
Circular Shift Right
1 1 0 1 1 0 0 1
Circular shift right (rotate right)
1 1 1 0 1 1 0 0
1 0 1 1 0 0 1 1
char = ‘B’
num = 65
print(ord(char)) # 66
print(chr(num)) # ‘A’
ASCII Secrets…
• The ASCII codes for the denary numbers contain their binary
equivalents:
Denary character ASCII Code
'0' 0011 0000
'1' 0011 0001
'2' 0011 0010
'3' 0011 0011
'4' 0011 0100
… …
'9' 0011 1001
ASCII Secrets…
• Conversion between upper case and lower case letters requires one
bit flip (and we can test for upper/lower case by looking at one bit):
Upper case ASCII Code Lower case ASCII Code
character character
'A' 0100 0001 'a' 0110 0001
'B' 0100 0010 'b' 0110 0010
'C' 0100 0011 'c' 0110 0011
'D' 0100 0100 'd' 0110 0100
… … … …
'Y' 0101 1001 'y' 0111 1001
'Z' 0101 1010 'z' 0111 1010
ASCII Secrets…
• The least significant five bits of a character tell us the position of the
letter in the alphabet:
Upper case ASCII Code Lower case ASCII Code
character character
'A' 0100 0001 'a' 0110 0001
'B' 0100 0010 'b' 0110 0010
'C' 0100 0011 'c' 0110 0011
'D' 0100 0100 'd' 0110 0100
… … … …
'Y' 0101 1001 'y' 0111 1001
'Z' 0101 1010 'z' 0111 1010
ASCII – points to note
• All codes are seven bits, leaving one bit for a parity check
• Number characters convert to numeric equivalents through masking the
two most significant bits (bits 6 and 7)
• Upper case Latin characters convert to lower case Latin characters by
setting second most significant bit (bit 6)
• Alphabetic sorting is very simple (and note the careful positioning of
both 'A' and 'a')
• ASCII was created in the early 1960s and based on telegraph code
• Its heritage gives rise to severe limitations, particularly a limitation to
the Latin alphabet, but also US-centricity
Unicode
• Unicode is one of the most misunderstood initiatives in Computer
Science, so, strap in… First, we need to understand how ASCII was
used and abused (or ‘Extended’)
• ASCII is a simple standard, with a direct mapping from a code to a
character (for example binary code 1000001 maps to 'A')
• Remember that ASCII only used 7 bits? Well, when there was no
longer a need for a parity bit, all the codes from 128 to 255 became
available. These new code sets were called ‘Extended ASCII’
• And they were used by lots of people for lots of different things,
culminating in much confusion between standards bodies and industry
Microsoft (not ANSI!) Code Pages
0
127
128
Western
European
Cyrillic Greek Hebrew Arabic Based on ISO-8859
255
1252 1251 1253 1255 1256
Enter Unicode
• Unicode takes a different approach, separating the coding of
characters from their representation
• The Latin character A has the Unicode identification (or code point)
U+0041 (those are hex digits by the way)
• The upper limit for a Unicode code point is U+10FFFF, which gives us a
theoretical maximum of 1,114,112 characters (the actual maximum is
smaller, and only around 10% are in use, some values are reserved)
• So one part of the Unicode standard is a truly huge list of characters
(Latin letters, pictograms, emojis, symbols, hieroglyphs, …)
Enter Unicode
• The other part of the Unicode standard deals with how these
characters can be represented (in memory, on disk, emails, websites)
• These rules are called Unicode Transformation Formats (UTF), and
there are a number in use, the main ones being:
• UTF-8
• UTF-16
• UTF-32
• These formats do *not* map to 8, 16, and 32 bits directly – do not fall
into that trap!
• They do, however, map to a minimum size
UTF-8
• UTF-8 is the most popular encoding by far, partly because it is fully
backward-compatible with ASCII
• The UTF-8 encoding can use 1, 2, 3, or 4 bytes to encode a character
0 1 0 0 0 0 0 1 4116 U+0041='A'
1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 4116
1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 4116
1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 4116
1 0 0 0 0 0 04116
1 1 1 1 0 0 0 0 1 0 0 1 1 1 1 1 1 0 0 1 0 0 1 0 1 1F4A916 💩
1 0 1 0 1 0 04116
UTF-16
• The UTF-16 encoding can use 2 or 4 bytes to encode a character
• Raises the problem of byte ordering (BE v LE)
• Requires the Byte Order Mark (BOM)
• Trade-off between compact format and efficiency
• Less-used than UTF-8
0 0 0 0 0 0 0 0 1 4116
1 0 0 0 0 0 04116
1 1 0 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 0 0 1 1F4A916
1 0 1 0 1 0 04116
UTF-32
• UTF-32 uses a fixed size of 4 bytes to capture the 21 bit Unicode
character information
• Very straight-forward, but inefficient (4 bytes for all code points, MS
byte always 00000000)
• Same byte ordering problems and solutions as UTF-16
• Very rarely used
<!DOCTYPE html>
Growth of Unicode
<html lang="en">
<head>
<meta charset="utf-8" />
Rising to 98% in
2023 [1]
[1] Usage Statistics and Market Share of Character Encodings for Websites, October 2023 (w3techs.com)
Do try this at home…
• Use Notepad to create text files in different encodings
• Use a local or an online hex editor (e.g., hexed.it) to examine the file
Images
• Images, like all other data, have to be represented in binary when
stored on computer systems
• We have two main types of image file:
• Bitmap
• Vector
Photo Images
• When an image is captured on digital camera, or when a printed
photo is scanned, we end up with a bitmap, which is a grid of pixels
(short for picture elements)
• When viewed at the intended resolution, the human eye 'fills in' the
image, and we are convinced that we are viewing a realistic image
rather than a grid of coloured dots
Bitmaps – image resolution and
colour depth
• Image resolution is the number of pixels that make up a bitmap
• We usually define this in terms of width x height
Note – there are several different 4K resolutions, but 3840 x 2160 is the most-used in consumer electronics; 4096 x 2160
(also known as Ultra HD or UHD) is the most used in the movie industry
Bitmaps – image resolution and
colour depth
19 pixels wide
10 pixels high
Bitmaps – image resolution and
colour depth
• Having looked at image resolution, we can now talk about each pixel
in an image
• The "colour depth" of an image is the number of bits used to capture
the colour information in each pixel
• With a colour depth of two, we only have two colours – these would
usually map to black and white and our image would be monochrome
(although with a high enough resolution, our eyes would interpret
this as grayscale)
• Early computer graphics used eight bit colour (so, one byte of colour
information for each pixel)
Colour depth – 8 bit choices
• With most colour information for electronic displays, we specify red,
green and blue or RGB components
• With just 8 bits, many early systems used three bits for red, three bits
for green, and two bits for blue (because the human eye is less
sensitive to the blue end of the spectrum)
• This gave us a very limited range of colours
• An alternative approach with 8 bits used a colour 'palette' – a specific
set of 256 colours chosen for the image that we are displaying
• This led to much higher quality images, but greater complexity
Colour depth – 8 bit choices
• With most colour information for electronic displays, we specify red,
green and blue or RGB components
• With just 8 bits, many early systems used three bits for red, three bits
for green, and two bits for blue (because the human eye is less
sensitive to the blue end of the spectrum)
• This gave us a very limited range of colours
• An alternative approach with 8 bits used a colour 'palette' – a specific
set of 256 colours chosen for the image that we are displaying
• This led to much higher quality images, but greater complexity
Colour depth – beyond 8 bits
• Most computer systems use 24 bits (three bytes) for colour
information, with one byte each for red, green, and blue
• This leads to over 16 million colours, comfortably beyond the ability
of the human eye to distinguish between colours (depending on
several factors, our eyes can distinguish between 1 million and 10
million colours)
• 24 bit colour depth is often referred to as "True Colour", although
fidelity depends very much on the quality of the display
• Some systems use 48 bit colour depth for technical reasons
Colour depth examples
2 Bit (monochrome)
Drawn Bitmaps
• Early video games used low-resolution bitmaps
• Most modern games use high-resolution bitmaps to achieve photo-
realistic scenes
Vector Graphics
• The main disadvantage with the bitmap format is that resizing the
image inevitably causes problems with image quality
• Each bitmap is design to be viewed at a particular resolution
• Vector graphics are more like very compact programs; they contain
instructions for drawing lines and shapes
• Vector graphics look 'perfect' at any resolution so they can be resized
without any loss of image quality
Vector Example (Simple)
Vector Example (Complex)
Vector vs Bitmap
• Vector graphics maintain image quality when resized
• Vector graphics files are much smaller
• It is much easier to create and edit vector graphics
• Vector graphics files have to be processed before they can be
displayed; very high-quality images require significant processing
Simpler images and non-real time display Complex images, real-time display
Sound - Introduction
• Converting analogue information to digital
• Sampling frequency and 'depth' (number of bits per sample)
Analogue to Digital Conversion
3.5 minute audio = 44.1 kHz x 16 x 2 x 210 = 300 million bits (37 Mbytes)
How to Tackle Calculations
• Notes
• sound file size = sample rate x duration (s) x bit depth
• image file size = colour depth x image height (px) x image width (px)
• text file size = bits per character x number of characters
• Show each of these separately, and break out what each one may look like
(e.g., sample rate may be in Hz, KHz, etc
• Key point to remember to divide by 8 if we are given bits and want the
answer in bytes or kb/mb/gb…
• Broadly, you are going to be multiplying everything else!
Analogue to Digital Conversion
3.5 minute audio = 44.1 kHz x 16 x 2 x 210 = 300 million bits (37 Mbytes)
Representative file sizes
Media Average size
eBook 2.5 Mbytes
MP3 song 3.5 Mbytes
DVD Movie 4 Gbytes
HD Movie 12 Gbytes
Blu-ray Movie 22 Gbytes
4k Movie >100 Gbytes
Use of compression
• Storage
• Transmission
• Measurement of compression
Types of Compression
• Lossless
• Lossless compression is always reversible – we can regenerate the original
digital file in its entirety
• Lossless compression depends on statistical redundancy in data – for example
large areas of the same colour in an image, or repeated patterns of bytes
• Lossy
• Lossy compression is irreversible – it depends on throwing away some (non-
essential) information
• Lossy compression always leads to some degradation of the image or audio
being compressed – there is a trade-off between loss of quality and reduction
in size
Applications
• Lossless compression is used where a reduction in quality is
undesirable (for example, maintaining the original quality of an audio,
image or video file), or unacceptable (for example, documents)
• Lossy compression is used for audio, image and video files in cases
where some loss of fidelity is acceptable – for example, the human
eye is more sensitive to luminance (brightness) than it is to colour
variations; the human ear is most sensitive in the 2kHz to 5kHz band
within the 'absolute' limits of human hearing of 200Hz to 20kHz
• Lossy compression can be much, much more effective than lossless,
and we can dynamically choose how much information we are
prepared to lose to get greater compression