0% found this document useful (0 votes)
16 views20 pages

7.file Compression

Uploaded by

rgn12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

7.file Compression

Uploaded by

rgn12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

File Compression

1
What is file compression?

• Compression creates smaller files


by eliminating or reducing
redundancy
• For example, let’s say
“AAAADDDDDDD” represents a
letter on your computer
• Compression software would refer to
this as “4A7D” to save 64% of space

2
Symbols and Encoding
• We assume a piece of source data that is subject to
compression (such as text, image, etc.) is
represented by a sequence of symbols. Each symbol
is encoded in a computer by a code (or codeword,
value), which is a bit string.
• Example:
– English text: abc (symbols), ASCII (coding)
– Chinese text: 多媒體 (symbols), BIG5 (coding)
– Image: color (symbols), RGB (coding)

3
Character distribution
• Some symbols are used more frequently than others.
• In English text, ‘e’ and space occur most often.
• Fixed-length encoding: use the same number of bits
to represent each symbol.
• With fixed-length encoding, to represent n symbols,
we need log2n bits for each code.

4
Finding redundancy
• Most files are fairly redundant
• Instead of listing the same information over
and over, compression programs list the
information once and refer back to it when the
information appears again

5
Example

• “Ask not what your country can do for you —


ask what you can do for your country.”
• This quote has 17 words, with 61 letters, 16
spaces, one dash, and one period.
• Let’s say each character uses one unit of
memory; that’s a total of 79 units.
• Notice that many words appear several times.

6
• “Ask not what your country can do for you — ask what you
can do for your country.”
• Below is a sample dictionary and the compressed sentence.

 The file would now be 74 units


long (37 for the dictionary; 37 for
the sentence). Applied to the
whole speech, the space savings
would be much larger. 7
Compression Techniques
Lossless
Data can be completely recovered after decompression
Recovered data is identical to original
Exploits redundancy in data
Lossy
Data cannot be completely recovered after
decompression
Some information is lost for ever
Gives more compression than lossless
Discards “insignificant” data components

8
File Compression
• Lossless compression loses no data and is
used for data backup.

9
File Compression (continued)

• Lossy compression is used for applications


like sound and video compression and causes
minor loss of data.

10
File Compression (continued)

• The compression ratio is the ratio of the


number of bits in the original data to the
number of bits in the compressed image. For
instance, if a data file contains 500,000 bytes
and the compressed data contains 100,000
bytes, the compression ratio is 5:1

11
RUN-LENGTH ENCODING
Data files frequently contain the same character
repeated many times in a row.

Example of run-length encoding. Each run of zeros is


replaced by two characters in the compressed file: a zero
to indicate that compression is occurring, followed by the
number of zeros in the run.

12
Run Length Encoding
CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

CT5A3GTCG6TG3C5GCCT7C } Run length encoded: 21


symbols

13
Run Length Encoding (Contd.)
WWWBWWWWWBWWWBWWWWBWWWWWBWWWBWW
WWWBWWBWWWWWWBBBWWWWWWWBWBWWWWW
WWBWWBBWWWWWBWWWWBWWWWBWWWWB

WWWBWWWWWBWWWBWWWWB….

3WB5WB3WB4WB….

3151314 possible optimization, but…

#W3151314….. Optimization requires escape character

14
Variable - Length Encoding

• Variable length encoding is a scheme in which


the codes are of different lengths. More frequently
occurring codes are given shorter length and less
frequently occurring codes are given longer
lengths.
• Example:
– Morse Code (letter e and t, are most frequent in
English thus they are assigned a dot (.) and a dash (-))
– Huffman encoding - a variable length encoding in which
the lengths of the codes are based on the probability of
the their occurrence (binary tree structure)

15
Variable - Length Encoding

fixed-length encoding with 5-bit binary representation

ABRACADABRA

00001 00010 10010 00001 00011 00001 00100 00001 00010 10010 00001 : 55 bits

D appears once but requires same number of bits as A (appears 5 times)

Encode: more frequently used characters with a few bits as possible

A: 0, B: 1, R: 01, C: 10, D: 11

0 1 01 0 10 0 11 0 1 01 0 : 15 bits + 10 delimiters

16
Variable - Length Encoding
delimiters are not needed if no code is a prefix of another.

A: 11, B: 00, C: 010, D: 10, R: 011

ABRACADABRA

1100011110101110110001111

Represent this code in a Trie

Trie with M nodes can be used to encode a message with M different characters

R
B D A
B

C R D C 17
Variable - Length Encoding
A

R
B D A
B

C R (a) D C
(b)

the code for each character is determined by the path from the root to the
character with 0 for “left” and 1 for “right”.

(a) produces 000100111011

(b) produces 01011011101111

18
Huffman Coding (cont’d)
• Optimal code: minimizes the number of
code symbols per source symbol.
• Forward Pass
1. Sort probabilities per symbol
2. Combine the lowest two probabilities
3. Repeat step2 until only two
probabilities remain.

19
Huffman Coding (cont’d)
• Backward Pass
Assign code symbols going backwards

20

You might also like