Supplementary Notes On Compression and Formats
Supplementary Notes On Compression and Formats
Storing, managing, and transferring data becomes essential in data communication and other
data-driven solutions. This is because no matter the degree of advancement in computer
hardware (RAM, ROM, GPU) and forms of communication (internet), these resources are
scarce.
To utilize these resources efficiently, the data is often required to be compressed, i.e., reduced
to a smaller size without losing any or losing minimal information.
Varied kinds of data can be compressed. This includes numbers, text, video, images, audio, or
even programs and software. These data types can be reduced in different ratios, such as 2:1,
which means a data file with a 100 MB size can take up only 50MB of disk space after
compression. This compression, also known as compaction, is performed through various
compression techniques.
Data compression techniques in digital communication refer to the use of specific formulas and
carefully designed algorithms used by a compression software or program to reduce the size of
various kinds of data. There are particular types of such techniques that we will get into, but to
have an overall understanding, we can focus on the principles.
Data compression can be performed by using smaller strings of bits (0s and 1s) in place of the
original string and using a ‘dictionary’ to decompress the data if required. Other techniques
include the introduction of pointers (references) to a string of bits that the compression program
has become familiar with or removing redundant characters.
For a video, compression can be achieved by skipping every 3rd frame, as this will result (as
one can imagine) in a 1/3 reduction in the size of the file. All such compression can dramatically
reduce data size (in cases up to 70% or more without losing any significant data). Compression
formats like ZIP, GZIP, etc., are used when transferring data via the internet.
The use of data compression techniques in digital communication greatly helps in reducing the
time for a file transfer, the cost of storage, and traffic in the network.
1. Lossy
2. Lossless
Lossy compression
To understand the lossy compression technique, we must first understand the difference
between data and information. Data is a raw, often unorganized collection of facts or values
and can mean numbers, text, symbols, etc. On the other hand, Information brings context by
carefully organizing the facts.
To put this in context, a black and white image of 4×6 inches in 100 dpi (dots per inch) will
have 2,40,000 pixels. Each of these pixels contains data in the form of a number between 0 to
255, representing pixel density (0 being black and 255 being white).
Example: This image as a whole can have some information like it is a picture of the 16th
president of the USA- Abraham Lincoln. If we display an image in 50 dpi, i.e., in 60,000 pixels,
the data required to save the image will reduce, and perhaps the quality too, but the information
will remain intact. Only after considerable loss in data, we can lose the information. Below is
an explanation of how it works.
With the above understanding of the difference between data and information, we now can
comprehend Lossy compression. As the name suggests, Lossy compression loses data, i.e., gets
rid of it to reduce the size of the data.
Advantage Disadvantage
It is relatively quick, can reduce Data compressed through lossy will not return the
the file size dramatically, and same data (in terms of quality, size, etc.). Still, it
the user can select the will hold similar information (this, in fact, is
compression level. useful in some instances, such as streaming or
It is beneficial for compressing downloading content on the internet). However,
data like images, video, and on the flip side, constant downloading and
even audio by taking advantage uploading of a file can compress and consequently
of the limitation of the human distort it beyond the point of recognition, causing
sense. This is because of the permanent information loss. Similarly, if a severe
limit of our eyes and ears as they level of compression is used by the user, then the
cannot perceive a difference in output file might not be anywhere close to the
the quality of an image and original input file.
audio before a certain point.
Lossless Compression
Lossless compression, unlike lossy compression, doesn’t remove any data; instead, it
transforms it to reduce its size. To understand the concept, we can take a simple example.
There is a piece of text where the word ‘because’ is repeated quite often. The term is comprised
of seven letters, and by using a shorthand or abbreviated version of it like ‘bcz’, we can
transform the text. This information of replacing ‘because’ with ‘bcz’ can be stored in a
dictionary for later use (during decompression).
While lossy compression removes redundant or unnoticeable pieces of data to reduce the size,
lossless compression transforms it through encoding it by using some formula or logic.
Advantage Disadvantage
There are types of data where lossy compression is There is a limit to data
not feasible. For example, in a spreadsheet, software, compression. If data is
program, or any data comprised of factual text or already compressed, then
numbers, lossy cannot work as every number might compressing it again will
be essential and can’t be considered redundant as any result in little to no reduction
reduction will immediately cause loss of information. in its size.
Here lossless compression becomes crucial as, upon Less effective against larger
decompression, the file can be restored to its original file sizes.
state without losing any data.
There are several advantages of using the different data compression techniques discussed
above. Even with a range of advantages of the data compression techniques, there is a trade off
as a cost is always associated with the compression of a file. This cost results in certain
disadvantages. The advantages and disadvantages of compression given in Tables.
Advantages Disadvantages
Reduces the disk The processing time taken by complex data compression
space occupied by algorithms can be very high, especially if the data in question
the file. is large.
Reading and Certain compression algorithms are resource-intensive and
Writing of files can may cause the machine to go out of memory.
be done quickly. There is a dependency on software that decompresses
Increases the speed compressed files.
of transferring files Incompatibility issues can occur during decompression
through the internet processes.
and other networks. Any error occurred during the transmission of compressed
data can cause significant information loss.
There are a number of types of data compression models that use different compression
algorithms pertaining to the two compression techniques discussed above.
Reference
https://www.analytixlabs.co.in/blog/data-compression-technique/
Run-length encoding (RLE) is the simplest method of compression. It can be used to compress
data made of any combination of symbols. It does not need to know the frequency of occurrence
of symbols. The general idea behind this method is to replace consecutive repeating
occurrences of a symbol by one occurrence of the symbol followed by the number of
occurrences.
Example 1
In the diagram below, a lengthy message is reduced into fewer bits.
Example 2
Where "4B" means four B's, and 2H means two H's, and so on. In this case, the number of
repetitions is put “before” instead of “later” (just another example)
Huffman Encoding
Prefix Rule
Huffman Coding implements a rule known as a prefix rule.
This is to prevent the ambiguities while decoding.
It ensures that the code assigned to any character is not a prefix of the code assigned to any
other character.
Huffman Tree
The steps involved in the construction of Huffman Tree are as follows-
Step-01:
Create a leaf node for each character of the text.
Leaf node of a character contains the occurring frequency of that character.
Step-02:
Arrange all the nodes in increasing order of their frequency value.
Step-03:
Considering the first two nodes having minimum frequency,
Create a new internal node.
The frequency of this new node is the sum of frequency of those two nodes.
Make the first node as a left child and the other node as a right child of the newly
created node.
Step-04:
Keep repeating Step-02 and Step-03 until all the nodes form a single tree.
The tree finally obtained is the desired Huffman Tree.
Important Formulas
The following 2 formulas are important to solve the problems based on Huffman Coding-
Formula-01:
Formula-02:
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per character
= ∑ ( frequencyi x Code lengthi )
WORKED EXAMPLE ON HUFFMAN CODING
Problem
A file contains the following characters with the frequencies as shown. If Huffman Coding is
used for data compression, determine-
1. Huffman Code for each character
2. Average code length
3. Length of Huffman encoded message (in bits)
Characters Frequencies
A 10
E 15
I 12
O 3
U 4
S 13
T 1
Solution
First, let us construct the Huffman Tree.
Huffman Tree is constructed in the following steps:
Step-01:
Step-02:
Step-03:
Step-04:
Step-05:
Step-06:
Step-07:
Now,
We assign weight to all the edges of the constructed Huffman Tree.
Let us assign weight ‘0’ to the left edges and weight ‘1’ to the right edges.
Rule
If you assign weight ‘0’ to the left edges, then assign weight ‘1’ to the right edges.
If you assign weight ‘1’ to the left edges, then assign weight ‘0’ to the right edges.
Any of the above two conventions may be followed.
But follow the same convention at the time of decoding that is adopted at the time of
encoding.
After assigning weight to all the edges, the modified Huffman Tree is-
Now, let us answer each part of the given problem one by one-
Normal character encoding uses the American Standard Code for Information Interchange (ASCII),
which uses 8 bits for each character. Therefore, for 58 characters we expect 58*8=464 bits without
compression.
When Huffman Coding is used, the number of bits used includes the message and the encoding table
to allow the decoding to take place. Table 2 gives the details for the same.
Table 2: Huffman Encoding showing the total bits for decoding
SN Character Code Frequency Total Bits
1 a 111 10 3*10=30
2 e 10 15 2*15=30
3 i 00 12 2*12=24
4 o 11001 3 5*3=15
5 u 1101 4 4*4=16
6 s 01 13 2*13=26
7 t 11000 1 5*1=5
Total 58 146
Total Bits 7*8=56 bits 23 bits 146 bits
Total bits used will be 56+23+146 = 225
This implies that on this particular example encoding has reduced the number of bits by
almost 50%.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝐵𝐸𝐹𝑂𝑅𝐸 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝐴𝐹𝑇𝐸𝑅 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
464
𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑎𝑡𝑖𝑜 = = 2.1
225
Exercise
How many bits may be required for encoding the message using Huffman encoding?
WAGAGAGIGIKOKO.
A video file format is a format used to store digital video data on a computer. Extensions found
after the name of the file show us which format the video is stored in. This might also dictate
which software will be able to open and play the video (and whether you will have to transcode
the video in order to play it). There are dozens of different formats out there, and not all of
them are suitable for all purposes. The most commonly used video formats include:
AVI It is one of the most Compatible with most players, browsers, It requires more storage than most
(1992) common formats and platforms; other video file formats;
for TV, which Offers high-quality video and audio; Not a good option for live
might account for Suitable for short videos, promos, streaming videos
its popularity advertisements, teasers, etc. Compression with quality
slightly dropping in retention is not its strongest suit
recent years
MKV a video file format that Free and open-source, meaning that it is Not compatible with many
(2002) can contain an constantly being updated and improved; players and devices;
unlimited number of It supports almost all codecs out there; Uses a more complicated
video or audio tracks It is a universal container that supports compression process than most
within it. unlimited tracks, menus, chapters, and other formats;
more. MKV file sizes are relatively
large.
WMV WMV was developed It can store a lot of data without taking up Compatibility with other
(2003) by Microsoft. As such, too much space; operating systems and programs
it is compatible with It has a high compression ratio (twice as is quite limited
Windows Media high as for MPEG-4) with the ability to The compression ratio is not
Player and other retain relatively good video quality. manually adjustable;
Windows-based It is fully compatible with Windows, Due to limited compatibility, it is
programs. including older software such as Microsoft hardly a standard video file
PowerPoint. format.
AVCHD It includes highly Highest-quality video files; Files saved in this format are quite
(2006) efficient encoding 3D video support; large;
using the H.264 codec Compatibility with Blu-ray and memory Compatibility with various
without significant cards; devices and programs is relatively
quality loss. AVCHD Compatibility with Sony, Panasonic, and limited;
is the format used by Canon cameras.
professional recording Editing this format can be
equipment to store quite complicated and time-
data. consuming
REFERENCE
https://target-video.com/best-video-file-formats/