Chapter 2
Chapter 2
Entropy Coding
Coding and compression: II) Entropy coding
3) Entropy coding
Is a type of lossless source coding, also called variable length statistical coding. This name
comes from the fact that the coding process uses the statistical properties of the source,
where it assigns the shortest code words to the most frequent symbols.
c) Redundancy of a code : is the difference between the entropy of the source H(X) and
the code rate. ρ=. It can be defined also as , where it represents the percentage of
additional bits compared to an optimal code.
d) variance of a code : is calculated as
1) Shannon-Fano encoder : follows an entropy coding process that verifies the prefix
condition. The compression/decompression is acheived according to a tree as follows :
The problem with SHANNON-FANO and HUFFMAN encoders lies in the allocation of an
integer number of bits for each code, which is not always optimal . For example, if the
around 0.11 bits ( 𝑛 = I ⟹ 𝑛 = − 𝑙𝑜𝑔 2 𝑃𝑖 ). Huffman encoder will assign either 1 or 2 bits
probability of a symbol is 0.9, the optimal number of bits to encode this character is
to this symbol, which is much longer than the possible theoretical value!!. This is why
other process of coding are proposed:
3) arithmetic code
The arithmetic coding process does not replace a symbol with a specific code like the
case of Huffman encoder but it replaces a stream of symbols with a single floating point
number. The output of this coding process is a Number [0,1[.
With this method, we actually use partitions of an interval [a, b] (initially [0,1]), the size
of each sub-interval is proportional to the frequency of the symbol corresponding to it.
Each symbol to be compressed reduces the current interval [a, b] to the sub-interval
[a', b'] corresponding to it, the latter is itself partitioned and then undergoes the same
processing as [a, b]. Thus, we finally obtain a very small interval [A, B] including the
code value.
The calculation of the bounds of each interval is as follows
Coding
Low = 0.0; High = 1.0; Decoding
While (C = New character ) Number = input code
Begin For symbol = Find_symbol (which is in this range);
Range = High-Low; Range = High_Range (Symbol)-LowRange(Symbol);
High = Low + Range * High_Range (C);; Number = Number - Low_Range (Symbol);
Low=Low+Range* Low_Range (C); Number = Number/Range;
End
T S I I G E B A Space Character
1/10 1/10 2/10 1/10 1/10 1/10 1/10 1/10 1/10 Probability
0.9 ≤ x < 0.8 ≤ x < 0.6 ≤ x < 0.5 ≤ x < 0.4 ≤ x < 0.3 ≤ x < 0.2 ≤ x < 0.1 ≤ x < 0≤x< Interval
1 0.9 0.8 0.6 0.5 0.4 0.3 0.2 0.1
coding
decoding
Adaptive Codes
The compression methods used so far use a statistical model to encode unique
symbols. They perform the compression by encoding the symbols into bit strings that
use fewer bits than the original symbols. The quality of the compression increases or
decreases depending on the program's ability to develop a good model. Moreover the
model must accurately predict the probabilities of the symbols, which is not always
feasible.
Adaptive codes are more desirable coding algorithms for data streaming as they adapt
to localized changes in symbols.
They start with or without minimal dictionary of each symbol's codes and update it for
each new character.
1) Adaptive Huffman Coding : see the additional document
2) LZW
An improvement of the LZ77 and LZ78 codes Created in 1984 by Terry Welsh. This
Algorithm the existance of an initial dictionary comprising all the unitary symbols of the
message.
So, it starts with a dictionary of all single characters with indexes starting at 1. Then it
upgrade the dictionary as it processes the text, When, a new character is read a new code
is generated. The compression algorithm follows the steps below:
1) Read the longest serie of consecutive symbols “x” present in the dictionary.
2) Write the index of "x" in the output file.
3) Read the symbol “i” that follows “x”.
4) Add "xi" to the dictionary.
5) Repeat this algorithm until you get i=empty.
Example : un_ver_vert_goes_to_un_verre_vert
Initial Dictionary
u v t s r n e a _ Symbol
8 7 6 5 4 3 2 1 0 index
Coding
_vert e_ re _verr un_ _u s_ rs _ver a_ _va t_ ert _ve r_ er ve _u n_ um Symbol
28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 index
20 2 4 20 9 0 5 4 15 1 11 6 13 11 4 2 7 0 3 8 coded
Final code: 8 3 0 7 2 4 11 13 6 11 1 15 4 5 0 9 20 4 2 20 6
In the decoding process, we decode from the dictionary, where the latter is updated by the
characters corresponding to the preceding code + the 1st symbol of the current string.