Data Compression Techniques
Data Compression Techniques
Jrgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)
Outline
Huffman coding Arithmetic coding
Application: JBIG
Repetition
Coding: Assigning binary codewords to (blocks of) source symbols. Variable-length codes (VLC) and fixedlength codes. Instantaneous codes Uniqely decodable codes Non-singular codes All codes Tree codes are instantaneous. Tree code , Krafts Inequality.
In practice
Use some nice algorithm to find the code tree
Huffman coding Tunnstall coding
Huffman Coding
Two-step algorithm:
1. Iterate:
Merge the least probable symbols. Sort.
2. Assign bits.
a b c d
0 10 110 111
0.5
10
0.5 0.25
11
0 1
0.5 0.5
Merge
Sort
Assign Get code
0.25
Arithmetic Coding
Shannon-Fano-Elias Basic idea: Split the interval [0,1] according to the symbol probabilities. Example: A = {a,b,c,d}, P = {, , 1/8, 1/8}.
0.2 0.5
Start in b. Code the sequence c c a. ) Code the interval [0.9, 0.96] Bit 1 1 1 Interval 0.5 - 1 0.75 - 1 0.875 - 1 0.875 - 0.9375 Decoder c
b
0.5 0.8
0.2
0.6
a
0.2
1
0
0.90624 - 0.9375 c a
1
0.5
c a b
0.9
a
0.9 0.96
b
0.98
c
1
Such an environment is called a context. A probability distribution for X can be estimated for each state. Then arithmetic coding is used. This is the basic idea behind the JBIG algorithm for binary images and data.
Universal Coding
A universal coder doesnt need to know the statistics in advance. Instead, estimate from data. Forward estimation: Estimate statistics in a first pass and transmit to the decoder. Backward estimation: Estimate from already transmitted (received) symbols.
Arithmetic coder
LZ77
Search buffer
a
8
Look-ahead buffer
a
2
b
7
c
6
a
5
b
4
d
3
c
1
b 23
e f
0 e 1 2 f
If the size of the search buffer is N and the size of the alphabet is M we need bits to code a triplet. Variation: Use a VLC to code the triplets! PKZip, Zip, Lharc, PNG, gzip, ARJ
LZ78
Store patterns in a dictionary Transmit a tuple <dictionary index, next>
LZ78
a b c a b a b c
0 a 0 b 0 c 1 b 4 c
a b c a b ab c
a b c ab
2
3 4 5
abc
LZW
Modification to LZ78 by Terry Welch, 1984. Applications: GIF, v42bis Patented by UniSys Corp. Transmit only the dictionary index. The alphabet is stored in the dictionary in advance.
LZW
Input sequence:
1
1 2 3 4
2
a b c d ab
3
6 7 8 9
5
bc ca aba abc
1 2 3 4
b
a b c d ab
ab
ab
bc ca aba
Encoder dictionary:
Decoder dictionary:
6 7 8
GIF
CompuServe Graphics Interchange Format (1987, 89). Features:
Designed for up/downloading images to/from BBSes via PSTN. 1-, 4-, or 8-bit colour palettes. Interlace for progressive decoding (four passes, starts with every 8th row). Transparent colour for non-rectangular images. Supports multiple images in one file (animated GIFs).
GIF: Method
Compression by LZW. Dictionary size 2b+1 8-bit symbols
b is the number of bits in the palette.
Dictionary size doubled if filled (max 4096). Works well on computer generated images.
GIF: Problems
Unsuitable for natural images (photos):
Maximum 256 colors () bad quality). Repetetive patterns uncommon () bad compression).
Method:
Compression by LZ77 using a 32KB search buffer. The LZ77 triplets are Huffman coded.
Summary
Huffman coding
Simple, easy, fast Complexity grows exponentially with the block length Statistics built-in in the code
Arithmetic coding
Complexity grows linearly with the block size Easily adapted to variable statistics ) used for coding of Markov sources
Universal coding
Adaptive Huffman or arithmetic coder LZ77: Buffer with previously sent sequences <offset,length,next> LZ78: Dictionary instead of buffer <index,next> LZW: Modification to LZ78 <index>
Summary, cont
Where are the algorithms used?
Huffman coding: JPEG, MPEG, PNG, Arithmetic coding: JPEG, JBIG, MPEG-4, LZ77: PNG, PKZip, Zip, gzip, LZW: compress, GIF, v42bis,
Finally
These methods work best if the source alphabet is small and the distribution skewed.
Text Graphics