Data Compression Intro
Data Compression Intro
Lecture By
Unit 1 Chapter 1
Preface
Introduction
Data
Need for Compression
Compression Techniques
Lossless and Lossy Compression
Performance Measure
Introduction
The word data means "to give", hence "something given". In geometry, mathematics, engineering, and so on, the terms given and data are used interchangeably. Also, data is a representation of a fact, figure, and idea. In computer science: data are numbers, words, images, etc., accepted as they stand.
Data(Analog)
Data (Analog)
To The Principal College
Respected Sir, Subject: Need a Heater in Class Students please fill in other things. Yours Sincerely Faculty
Data(Digital World)
Raw Data
=>
Compression
Weather Forecasting
Internet data
Broadband
Planning Cities
COMPRESSION TECHNIQUES
Loss-less compression
Compressed data can be reconstructed back to the exact original data itself.
Lossy Compression
Loss-less Compression
Main Disadvantage?
Lossy Compression
Disadvantage:
Data that have been compressed using lossy techniques generally cannot be recovered or reconstructed exactly.(Involves some loss of information) Much higher compression ratios.
Advantage:
Areas
MEASURE OF PERFORMANCE
Based on Relative complexity of the algorithm? Memory required to implement the algorithm. How fast the algorithm performs on a given machine, (Secondary).
2.
3.
Compression Ratio
Mostly widely used measure to compute data compressed is -- Compression Ratio. Ratio of the number of bits required to represent the data before compression to the number of bits required to represent the data after compression. Example : Suppose storing an image made up of a square array of 256 X 256 pixels requires 65,536 bytes. The image is compressed and the compressed version requires 16,384 bytes. Compression Ratio = 4:1
Another Measure
Rate: It is the average number of bits required to represent a single sample. Consider the last example, 256*256 original image contains 65536 bytes. Hence each pixel contains 1 byte or 8 bits per pixel(sample). Now the compressed image contains 16384 bytes. So many bits does each pixel contain? The rate is 2 bits/pixel. Is the above 2 Measures fine for lossy compression?
Distortion
In lossy compression, the reconstruction differs from the original data.
In order to determine the efficiency of a compression algorithm, we have to quantify/measure the difference. The difference between the original and the reconstruction is called the distortion.
Lossy techniques are generally used for the compression of data that originate as analog signals, such as speech and video.
For comparison of speech and video, the final arbiter/judge of quality is human.(behavioral analysis)
Because human responses are difficult to model mathematically, many approximate measures of distortion are used to determine the fidelity/quality of the reconstructed waveforms.
3) Eg. Technique used for compression of a text may not work well for compressing images.
The best approach for a given application, largely depends on the redundancies inherent in the data.
The approach may work for one kind of data, but may not work for other kind of data ( a landscape or group photo).
The development of data compression algorithms for a variety of data can be divided into two phases.
Modeling: Extract information about any redundancy present in the data and model it.
Coding: Description of the model and a "description" of how the data differ from the model are encoded (binary alphabet). The difference between the data and the model is often referred to as the residual.
Example 1
Q. Consider the following sequence of numbers X= {x1,x2, x3,...}: 9 11 11 11 14 13 15 17 16 17 20 21. How many bits are required to store or transmit every sample?
010-11-101-1-111
Example 1
The residual sequence consists of only three numbers { -1,0, 1}. Assign a code of 00 to -1, 01 to 0 & 10 to 1, we need to use 2 bits to represent each element of the residual sequence. Therefore, we can obtain compression by transmitting the parameters of the model and the residual sequence. Lossy if only model is transmitted. Lossless if both residue/difference and parameter are transmitted. Q. Model the given data for compression. { 6 8 10 10 12 11 12 15 16}
{ 5 6 9 10 11 13 17 19 20}
Example 2
Q. Find the structure present in this data sequence. 27 28 29 28 26 27 29 28 30 32 34 36 38 I. Ans. No structure is found. Hence
II. Check for closeness of the values. III. 27 1 1 -1 -2 1 2 -1 2 2 2 2 IV. Send First value. Then send rest of residue . V. Are the bits/sample reduced? VI. Decoder adds current value to the previous decoded value to reconstruct back the original sequence.
Note
Techniques that use the past values of a sequence to predict the current value and then encode the error in prediction, or residual, are called predictive coding schemes. Note: Assuming both encoder and decoder know the model being used, we would still have to send the value of the first element of the sequence.
Example 3
Suppose we have the following sequence:
Note
There will be situations in which it is easier to take advantage of the structure if we decompose the data into a number of components. We can then study each component separately and use a model appropriate to that component. There are a number of different ways to characterize data. Different characterizations will lead to different compression schemes. We can compress something with products from one vendor and reconstruct it using the products of a different vendor. International standards organizations have standards for various compression applications.
Overview
This chapter deals with Lossless scheme mathematical framework. Starting with Information Theory Basic Probability Concept. Based on the above mathematical concepts, modeling of the data.
Suppose A & B are 2 independent events. The self information associated with the occurrence of both event A & B is
Entropy
Suppose we have independent events Ai, which are the outcomes of some experiment S.
Then the average self information associated with the random experiment is
Note
Most of the experiment results that we see in this subject are independent and identical distributed(iid). Above Entropy equation holds good only if the experiment is iid. Theorem: Shannon showed that the maximum average no. of bits that a loss-less compression scheme can achieve will be equal to the entropy of the source.
The estimate of the entropy depends on our assumptions about the structure of the source sequence.
Example 4
Q. Consider the sequence 1 2 3 2 3 4 5 4 5 6 7 8 9 8 9 10 The probability of occurrence of each element is
P(l) = P(6) = P(7) = P(10) = 1/16
P(2) = P(3) = P(4) = P(5) = P(8) = P(9) =2/16 Assuming the sequence is iid, the entropy for this sequence is first-order entropy calculated as
Hence w.r.t. Shannon the maximum no. of bits required to code the sample is 3.25 bits/sample.
Example 4
Note
If the parameter rn does not changes with n , then it is called static model. A model whose parameters does not change or adapt with n to the changing characteristics of the data is called an adaptive model. Basically, we see that knowing something about the structure of the data can help to "reduce the entropy."
Structure
Consider the following sequence:
1 2 1 2 3 3 3 3 1 2 3 3 3 3 1 2 3 3 12
Obviously, there is some structure to this data. However, if we look at it one symbol at a time, the structure is difficult to extract. Consider the probabilities: P{1) = P{2) = 1/4 , and p(3) = 1/2. The entropy is 1.5 bits/symbol. This particular sequence consists of 20 symbols; therefore, the total number of bits required to represent this sequence is 30.
Now let's take the same sequence and look at it in blocks of two. Obviously, there are only two symbols, 1 2, and 3 3. The probabilities are P(l 2) = 1/2, P(3 3) = 1/2, and the entropy is 1 bit/symbol. As there are 10 such symbols in the sequence, we need a total of 10 bits to represent the entire sequencea reduction of a factor
Models
Good models for sources lead to more efficient compression algorithms. In general, in order to develop techniques that manipulate data using mathematical operations, we need to have a Mathematical model for the data. There are several approaches to build Mathematical model.
Physical Model
If we know something about the physics of the data generation process, we can use that information to construct a model. For example, In speech-related applications, knowledge about the physics of speech production can be used to construct a mathematical model for the sampled speech process. Sampled speech can then be encoded using this model. If residential electrical meter readings at hourly intervals were to be coded, knowledge about the living habits of the populace could be used to determine when electricity usage would be high and when the usage would be low. Then instead of the actual readings, the difference (residual) between the actual readings and those predicted by the model could be coded.
Physical Model
Disadvantages
In general, however, the physics of data generation is simply too complicated to understand, let alone use to develop a model. Since the physics of the problem is too complicated, currently we can a model based on empirical observation of the statistics in data.
Probability Model
The simplest Mathematical model for the source is to assume that all the events are independent and identically distributed(IID). Hence the name ignorance model.
Probability Model
Next if we discard the assumption of independence also, we come up with a better data compression scheme but we have to define the dependency of data sequence on each other.
One of the most popular ways of representing dependence in the data is through the use of Markov models, named after the Russian mathematician Andrei Andrevich Markov (1856-1922).
Markov Models
For models used in loss-less compression, we use a specific type of Markov process called a discrete time Markov chain. Let {Xn} be a sequence, which is said to follow a Kthorder morkov model if P(Xn|Xn-1,........,Xn-k) = P(Xn|Xn-1,........,Xn-k,.........) The knowledge of the past k symbols is equivalent to the knowledge of the entire past history of the process. The values taken on by the set process {Xn-1 . . . . . . ,........,Xn-k} are called the states of the process.
Markov Models
The most commonly used Markov model is the firstorder Markov model, for which
P(Xn|Xn-1) = P(Xn|Xn-1,Xn-2,Xn-3.......,) Markov chain property: probability of each subsequent state depends only on what was the previous state: The above equations indicate the existence of dependence between samples. However, they do not describe the form of the dependence. We can develop different first-order Markov models depending on our assumption about the form of the dependence between samples.
To define Markov model, the following probabilities have to be specified: transition probabilities P{X2|X1} and initial probabilities P{X1}.
Markov Models
If we assumed that the dependence was introduced in a linear manner, we could view the data sequence as the output of a linear filter driven by white noise. The output of such a filter can be given by the difference equation
En is the white noise. This model is often used when developing coding algorithms for speech and images.
Markov Model
The entropy of a finite state process with states S, is simply the average value of the entropy at each state:
Rain
Dry
0.2
0.8
Two states : Rain and Dry. Transition probabilities: P(Rain|Rain)=0.3 , P(Dry|Rain)=0.7 , P(Rain|Dry)=0.2, P(Dry|Dry)=0.8 Initial probabilities: say P(Rain)=0.4 , P(Dry)=0.6 .
Probability of the next letter is heavily influenced by the preceding letter in English. Current Text Compression literature, the k-order Markov models are widely known as finite context models, the word context is being used for state. Example:
Consider the word preceding . Suppose we have already processed precedin and are going to encode the next letter. If no account of context is taken into consideration and we treat the letter as a surprise, the probability of the letter g occurring is relatively low.
Example
If we use a first-order Markov model (i.e. we look at n probability model), we can see that the probability of g would increase substantially. As we increase the context size (go from n to in to din and so on), the probability of the alphabet becomes more and entropy decreases.
Shannon used a second-order model for English text consisting of the 26 letters and one space to obtain an entropy of 3.1 bits/letter . Using a model where the output symbols were words rather than letters brought down the entropy to 2.4 bits/letter. Note: The longer the context, the better its predictive value.
Coding
Coding: Assignment of binary sequences (0s or 1s) to elements or symbols.
The set of binary sequences is called a code, and the individual members of the set are called codewords.
Code ( 100101100110010101) Codewords ( a -> 001, b -> 010) An alphabet is a collection of symbols called letters. For example, the alphabet used in writing most books consists of the 26 lowercase letters, 26 uppercase letters, and a variety of punctuation marks. In the terminology used in this book, a comma is a letter. The ASCII code for the letter a is 1000011, the letter A is coded as 1000001, and the letter "," is coded as 0011010. Notice that the ASCII code uses the same number of bits to represent each symbol. Such a code is called a fixed-length code.
Coding
To reduce the number of bits required to represent different messages, we need to use a different number of bits to represent different symbols. If we use fewer bits to represent symbols that occur more often, on the average we would use fewer bits per symbol. The average number of bits per symbol is often called the rate of the code. Example: Morse Code, Huffman code. letters that occur more frequently are shorter than for letters that occur less frequently. The codeword for E is 1 bit while the codeword for Z is 7 bits.
Average length of the code, is not only the criteria for good code.
Example Suppose our source alphabet consists of four letters a1, a2, a3 & a4 with probabilities P(a1) = 1/2 , P(a2) = 1/4, P(a3) = P(a4) = 1/8. The entropy for this source is 1.75 bits/symbol.
where n(ai) is the number of bits in the codeword for letter ai and the average length is given in bits/symbol.
From the table, w.r.t average length Code1 appears to be the best code. However code should have the ability to transfer information in an unambiguous way.
Code 1
Both a1 and a2 have been assigned the codeword 0. When a 0 is received, there is no way to know whether an a1 was transmitted or an a2. Hence we would like each symbol to be assigned a unique codeword.
Code 2 seems to have no problem with ambiguity. However if we encode {a2 a1 a1}. Binary string would be 100. However 100 can be decoded as {a2 a1 a1} and {a2 a3} Meaning original sequence cannot be recovered with certainty. There is no Unique decodability. (Not Desirable)
Code 3
Code 3?
Notice that the first three codewords all end in a 0. In fact, a 0 always denotes the termination of a codeword.
The final codeword contains no 0s and is 3 bits long. Because all other codewords have fewer than three 1s and terminate in a 0, the only way we can get three as in a row is as a code for a4. The decoding rule is simple. Accumulate bits until you get a 0 or until you have three 1s. There is no ambiguity in this rule, and it is reasonably easy to see that this code is uniquely decodable.
Code 4
Difference between Code 3 and Code 4 is that the Code 3, decoder knows the moment a code is complete. In Code 4, we have to wait till the beginning of the next codeword before we know that the current codeword is complete. Because of this property, Code 3 is called an instantaneous code and code 4 is near instantaneous code. Q). Is code 4 Uniquely Decodable? Ans. Decode the string 011111111111111111.
Assuming 1st codeword is a1, after decoding other 8 codewords as a3's. We will be left with a single (dangling) 1. If we assume 1st codeword as a2, we will be able to decode next 16 codewords as 8 a3's.
The string can be uniquely decoded. In fact. Code 5, while it is certainly not instantaneous, but is uniquely decodable in one way and not unique in other way.
Even with looking at these small codes, it is not immediately evident whether the code is uniquely decodable or not. For Lager codes?
There are no other pairs for which one element of the pair is the prefix of the other.
Example 1
Let us augment (add) the codeword list with the dangling suffix. {0,01,11,1} Comparing the elements of this list, we find 0 is a prefix of 01 with a dangling suffix of 1. But we have already included 1 in our list.
Also, 1 is a prefix of 11. This gives us a dangling suffix of 1, which is already in the list.
Example 1
There are no other pairs that would generate a dangling suffix, so we cannot augment the list any further. Therefore, Code 5 is uniquely decodable.
{0,01,10}
The codeword 0 is a prefix for the codeword 01. The dangling suffix is 1. There are no other pairs for which one element of the pair is the prefix of the other. Augmenting the codeword list with 1, we obtain the list
{0,01,10,1}
Example 2
In this list, 1 is a prefix for 10. The dangling suffix for this pair is 0, which is the codeword for a1. Therefore, Code 6 is not uniquely decodable.
Prefix Codes
The test for unique Decodability requires examining the dangling suffixes. If the dangling suffix is itself a codeword, then the code is not uniquely decodable.
One type of code in which we will never face the possibility of a dangling suffix being a codeword is a code in which no codeword is a prefix of the other
A code in which no codeword is a prefix to another codeword is called a prefix code. A simple way to check if a code is a prefix code is to draw the rooted binary tree corresponding to the code.
Prefix Codes
Draw a tree that starts from a single node(the root node) & has possible 2 branches at each node. One of the branch corresponds to 1 and the other 0. The Convention followed is that root node at the top, left branch is 0 and the right branch is 1. Using convention, draw binary tree for Code 2, 3 & 4.
Prefix Codes
Prefix Codes
Note that apart from the root node, the trees have 2 Kinds of nodes. 1. Internal nodes (nodes that give rise to other nodes)and
2. External nodes(doent give rise to other nodes) They are also called leaves.
In a prefix code, the codewords are only associated with the external nodes. A code that is not a prefix code, such as Code 4, will have codewords associated with internal nodes.(01111111111111).
The Kolmogorov complexity k(x)) of a sequence x is the size of the program needed to generate x.
Size: includes all the needed i/p's for the program are present.
If x was a sequence of all ones, a highly compressible sequence, the program would simply be a print statement in a loop.
On the other extreme, if x were a random sequence with no structure then the only program that could generate it would contain the sequence itself. The size of the program, would be slightly larger than the sequence itself. Thus, there is a clear correspondence between the size of the smallest program that can generate a sequence and the amount of compression that can be obtained. Lower bound uncertain and is not practically used.
Huffman Coding
The Huffman Coding Algorithm Developer: David Huffman class assignment; information theory, taught by Robert Fano at MIT. These codes are prefix codes and are optimum for a given model (set of probabilities).
The Huffman procedure is based on two observations regarding optimum prefix codes.
1. In an optimum code, symbols that occur more frequently (have a higher probability of occurrence) will have shorter codewords than symbols that occur less frequently. 2. In an optimum code, the two symbols that occur least frequently will have the same length.
Let us design a Huffman code for a source that puts out letters from an alphabet A = {a1, a2, a3, a4, a5} with P(a1) = P(a3) = 0.2, P(a2) = 0.4, and P(a4) = P(a5) = 0.1. First find first order Entropy? Step1: Sort the letters in Descending Probability order.
L = . 4 x l + . 2 x 2 + . 2 x 3 + . l x 4 + . l x 4 = 2.2 bits/symbol.
Example
L = . 4 x l + . 2 x 2 + . 2 x 3 + . l x 4 + . l x 4 = 2.2 bits/symbol.
Example 2
However, the buffer has to be of finite size, and the greater the variance in the codewords, the more difficult the buffer
On the other hand, if we use the second code, we would be generating 30,000 bits per second, and the buffer would have to store 8000 bits for every second. If we have a string of a2's instead of a string of a4 and a5 the first code would result in the generation of 10,000 bits per second. Deficit of 12000 bits per second.
,
Second code would lead to deficit of 2000 bits per second. So which do we select?
Audio Compression
Monochrome Image
Pixel(0-255)
Monochrome Images
Original (Uncompressed) test images are represented using bits/pixel. The image consists of 256 rows of 256 pixels, so the uncompressed representation uses 65,536 bytes.
Image Compression
From a visual inspection of the test images, we can clearly see that the pixels in an image are heavily correlated with their neighbors. We could represent this structure with the crude model Xn = Xn-1 The residual would be the difference between neighboring pixels.
We encoded the earlier version of this chapter using Huffman codes that were created using the probabilities of occurrence obtained from the chapter. The file size dropped from about 70,000 bytes to about 43,000 bytes with Huffman coding.
Audio Compression