7.file Compression
7.file Compression
1
What is file compression?
2
Symbols and Encoding
• We assume a piece of source data that is subject to
compression (such as text, image, etc.) is
represented by a sequence of symbols. Each symbol
is encoded in a computer by a code (or codeword,
value), which is a bit string.
• Example:
– English text: abc (symbols), ASCII (coding)
– Chinese text: 多媒體 (symbols), BIG5 (coding)
– Image: color (symbols), RGB (coding)
3
Character distribution
• Some symbols are used more frequently than others.
• In English text, ‘e’ and space occur most often.
• Fixed-length encoding: use the same number of bits
to represent each symbol.
• With fixed-length encoding, to represent n symbols,
we need log2n bits for each code.
4
Finding redundancy
• Most files are fairly redundant
• Instead of listing the same information over
and over, compression programs list the
information once and refer back to it when the
information appears again
5
Example
6
• “Ask not what your country can do for you — ask what you
can do for your country.”
• Below is a sample dictionary and the compressed sentence.
8
File Compression
• Lossless compression loses no data and is
used for data backup.
9
File Compression (continued)
10
File Compression (continued)
11
RUN-LENGTH ENCODING
Data files frequently contain the same character
repeated many times in a row.
12
Run Length Encoding
CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC
CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC
CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC
13
Run Length Encoding (Contd.)
WWWBWWWWWBWWWBWWWWBWWWWWBWWWBWW
WWWBWWBWWWWWWBBBWWWWWWWBWBWWWWW
WWBWWBBWWWWWBWWWWBWWWWBWWWWB
WWWBWWWWWBWWWBWWWWB….
3WB5WB3WB4WB….
14
Variable - Length Encoding
15
Variable - Length Encoding
ABRACADABRA
00001 00010 10010 00001 00011 00001 00100 00001 00010 10010 00001 : 55 bits
A: 0, B: 1, R: 01, C: 10, D: 11
0 1 01 0 10 0 11 0 1 01 0 : 15 bits + 10 delimiters
16
Variable - Length Encoding
delimiters are not needed if no code is a prefix of another.
ABRACADABRA
1100011110101110110001111
Trie with M nodes can be used to encode a message with M different characters
R
B D A
B
C R D C 17
Variable - Length Encoding
A
R
B D A
B
C R (a) D C
(b)
the code for each character is determined by the path from the root to the
character with 0 for “left” and 1 for “right”.
18
Huffman Coding (cont’d)
• Optimal code: minimizes the number of
code symbols per source symbol.
• Forward Pass
1. Sort probabilities per symbol
2. Combine the lowest two probabilities
3. Repeat step2 until only two
probabilities remain.
19
Huffman Coding (cont’d)
• Backward Pass
Assign code symbols going backwards
20