0% found this document useful (0 votes)
12 views

Supplementary Notes On Compression and Formats

Uploaded by

abasijuma707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Supplementary Notes On Compression and Formats

Uploaded by

abasijuma707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Compression as a way of data Representation

Storing, managing, and transferring data becomes essential in data communication and other
data-driven solutions. This is because no matter the degree of advancement in computer
hardware (RAM, ROM, GPU) and forms of communication (internet), these resources are
scarce.

To utilize these resources efficiently, the data is often required to be compressed, i.e., reduced
to a smaller size without losing any or losing minimal information.

Varied kinds of data can be compressed. This includes numbers, text, video, images, audio, or
even programs and software. These data types can be reduced in different ratios, such as 2:1,
which means a data file with a 100 MB size can take up only 50MB of disk space after
compression. This compression, also known as compaction, is performed through various
compression techniques.

Data compression technique

Data compression techniques in digital communication refer to the use of specific formulas and
carefully designed algorithms used by a compression software or program to reduce the size of
various kinds of data. There are particular types of such techniques that we will get into, but to
have an overall understanding, we can focus on the principles.

Data compression can be performed by using smaller strings of bits (0s and 1s) in place of the
original string and using a ‘dictionary’ to decompress the data if required. Other techniques
include the introduction of pointers (references) to a string of bits that the compression program
has become familiar with or removing redundant characters.

For a video, compression can be achieved by skipping every 3rd frame, as this will result (as
one can imagine) in a 1/3 reduction in the size of the file. All such compression can dramatically
reduce data size (in cases up to 70% or more without losing any significant data). Compression
formats like ZIP, GZIP, etc., are used when transferring data via the internet.

The use of data compression techniques in digital communication greatly helps in reducing the
time for a file transfer, the cost of storage, and traffic in the network.

Types of data compression techniques


While one can refer to this data compression technique PDF[source], to know about the various
type of techniques available, the two common types that always stand out are:

1. Lossy
2. Lossless

Lossy compression

To understand the lossy compression technique, we must first understand the difference
between data and information. Data is a raw, often unorganized collection of facts or values
and can mean numbers, text, symbols, etc. On the other hand, Information brings context by
carefully organizing the facts.
To put this in context, a black and white image of 4×6 inches in 100 dpi (dots per inch) will
have 2,40,000 pixels. Each of these pixels contains data in the form of a number between 0 to
255, representing pixel density (0 being black and 255 being white).

Example: This image as a whole can have some information like it is a picture of the 16th
president of the USA- Abraham Lincoln. If we display an image in 50 dpi, i.e., in 60,000 pixels,
the data required to save the image will reduce, and perhaps the quality too, but the information
will remain intact. Only after considerable loss in data, we can lose the information. Below is
an explanation of how it works.

With the above understanding of the difference between data and information, we now can
comprehend Lossy compression. As the name suggests, Lossy compression loses data, i.e., gets
rid of it to reduce the size of the data.

Advantages and disadvantages of Lossy Compression

Advantage Disadvantage
 It is relatively quick, can reduce  Data compressed through lossy will not return the
the file size dramatically, and same data (in terms of quality, size, etc.). Still, it
the user can select the will hold similar information (this, in fact, is
compression level. useful in some instances, such as streaming or
 It is beneficial for compressing downloading content on the internet). However,
data like images, video, and on the flip side, constant downloading and
even audio by taking advantage uploading of a file can compress and consequently
of the limitation of the human distort it beyond the point of recognition, causing
sense. This is because of the permanent information loss. Similarly, if a severe
limit of our eyes and ears as they level of compression is used by the user, then the
cannot perceive a difference in output file might not be anywhere close to the
the quality of an image and original input file.
audio before a certain point.

Lossless Compression

Lossless compression, unlike lossy compression, doesn’t remove any data; instead, it
transforms it to reduce its size. To understand the concept, we can take a simple example.
There is a piece of text where the word ‘because’ is repeated quite often. The term is comprised
of seven letters, and by using a shorthand or abbreviated version of it like ‘bcz’, we can
transform the text. This information of replacing ‘because’ with ‘bcz’ can be stored in a
dictionary for later use (during decompression).

While lossy compression removes redundant or unnoticeable pieces of data to reduce the size,
lossless compression transforms it through encoding it by using some formula or logic.

Advantages and disadvantages of Lossless Compression

Advantage Disadvantage
 There are types of data where lossy compression is  There is a limit to data
not feasible. For example, in a spreadsheet, software, compression. If data is
program, or any data comprised of factual text or already compressed, then
numbers, lossy cannot work as every number might compressing it again will
be essential and can’t be considered redundant as any result in little to no reduction
reduction will immediately cause loss of information. in its size.
Here lossless compression becomes crucial as, upon  Less effective against larger
decompression, the file can be restored to its original file sizes.
state without losing any data.

Advantages and Disadvantages of Data Compression Techniques

There are several advantages of using the different data compression techniques discussed
above. Even with a range of advantages of the data compression techniques, there is a trade off
as a cost is always associated with the compression of a file. This cost results in certain
disadvantages. The advantages and disadvantages of compression given in Tables.

Advantages Disadvantages
 Reduces the disk  The processing time taken by complex data compression
space occupied by algorithms can be very high, especially if the data in question
the file. is large.
 Reading and  Certain compression algorithms are resource-intensive and
Writing of files can may cause the machine to go out of memory.
be done quickly.  There is a dependency on software that decompresses
 Increases the speed compressed files.
of transferring files  Incompatibility issues can occur during decompression
through the internet processes.
and other networks.  Any error occurred during the transmission of compressed
data can cause significant information loss.

Data Compression Technique Model

There are a number of types of data compression models that use different compression
algorithms pertaining to the two compression techniques discussed above.

Following are the most common data compression models-


Some common lossless compression Some common lossy compression
technique models technique models
 RLE (Run Length Encoding)  Transform coding
 Dictionary Coder (LZ77, LZ78,  Discrete Cosine Transform
LZR, LZW, LZSS, LZMA, LZMA2)  Discrete Wavelet Transform
 Huffman Encoding  Fractal Compression
 Adaptive Huffman Coding
 Shannon Fano Encoding
 Arithmetic Encoding
 Lempel Ziv Welch Encoding

Reference

https://www.analytixlabs.co.in/blog/data-compression-technique/

Some examples of Compression Algorithms Explained

Run-length encoding (RLE)

Run-length encoding (RLE) is the simplest method of compression. It can be used to compress
data made of any combination of symbols. It does not need to know the frequency of occurrence
of symbols. The general idea behind this method is to replace consecutive repeating
occurrences of a symbol by one occurrence of the symbol followed by the number of
occurrences.

Example 1
In the diagram below, a lengthy message is reduced into fewer bits.

Example 2

The message BBBBHHDDXXXXKKKKWWZZZZ can be encoded to


4B2H2D4X4K2W4Z.

Where "4B" means four B's, and 2H means two H's, and so on. In this case, the number of
repetitions is put “before” instead of “later” (just another example)
Huffman Encoding

 Huffman Coding is a famous Greedy Algorithm.


 It is used for the lossless compression of data.
 It uses variable length encoding.
 It assigns variable length code to all the characters.
 The code length of a character depends on how frequently it occurs in the given text.
 The character which occurs most frequently gets the smallest code.
 The character which occurs least frequently gets the largest code.
 It is also known as Huffman Encoding.

Prefix Rule
Huffman Coding implements a rule known as a prefix rule.
This is to prevent the ambiguities while decoding.
It ensures that the code assigned to any character is not a prefix of the code assigned to any
other character.

Major Steps in Huffman Coding


There are two major steps in Huffman Coding-
1. Building a Huffman Tree from the input characters.
2. Assigning code to the characters by traversing the Huffman Tree.

Huffman Tree
The steps involved in the construction of Huffman Tree are as follows-
Step-01:
 Create a leaf node for each character of the text.
 Leaf node of a character contains the occurring frequency of that character.

Step-02:
 Arrange all the nodes in increasing order of their frequency value.

Step-03:
Considering the first two nodes having minimum frequency,
 Create a new internal node.
 The frequency of this new node is the sum of frequency of those two nodes.
 Make the first node as a left child and the other node as a right child of the newly
created node.
Step-04:
 Keep repeating Step-02 and Step-03 until all the nodes form a single tree.
 The tree finally obtained is the desired Huffman Tree.

Important Formulas
The following 2 formulas are important to solve the problems based on Huffman Coding-
Formula-01:

Formula-02:
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per character
= ∑ ( frequencyi x Code lengthi )
WORKED EXAMPLE ON HUFFMAN CODING

Problem
A file contains the following characters with the frequencies as shown. If Huffman Coding is
used for data compression, determine-
1. Huffman Code for each character
2. Average code length
3. Length of Huffman encoded message (in bits)

Characters Frequencies
A 10
E 15
I 12
O 3
U 4
S 13
T 1
Solution
First, let us construct the Huffman Tree.
Huffman Tree is constructed in the following steps:

Step-01:

Step-02:

Step-03:
Step-04:

Step-05:
Step-06:
Step-07:

Now,
 We assign weight to all the edges of the constructed Huffman Tree.
 Let us assign weight ‘0’ to the left edges and weight ‘1’ to the right edges.

Rule
 If you assign weight ‘0’ to the left edges, then assign weight ‘1’ to the right edges.
 If you assign weight ‘1’ to the left edges, then assign weight ‘0’ to the right edges.
 Any of the above two conventions may be followed.
 But follow the same convention at the time of decoding that is adopted at the time of
encoding.

After assigning weight to all the edges, the modified Huffman Tree is-
Now, let us answer each part of the given problem one by one-

1. Huffman Code for Characters-


To write Huffman Code for any character, traverse the Huffman Tree from root node to the
leaf node of that character.
Following this rule, the Huffman Code for each character is-
 a= 111
 e= 10
 i= 00
 o= 11001
 u= 1101
 s= 01
 t= 11000

From here, we can observe-


 Characters occurring less frequently in the text are assigned the larger code.
 Characters occurring more frequently in the text are assigned the smaller code.

2. Average Code Length

Using formula-01, we have-


Average code length
= ∑ ( frequencyi x code lengthi ) / ∑ ( frequencyi )
= { (10 x 3) + (15 x 2) + (12 x 2) + (3 x 5) + (4 x 4) + (13 x 2) + (1 x 5) } / (10 + 15 + 12 + 3
+ 4 + 13 + 1)
= 2.52

3. Length of Huffman Encoded Message-

Using formula-02, we have-


Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per character
= 58 x 2.52
= 146.16
≅ 147 bits

The above calculations can also be achieved as shown in Table 1

Table 1: Huffman Encoding


Character Code Frequency Total Bits
a 111 10 3*10=30
e 10 15 2*15=30
i 00 12 2*12=24
o 11001 3 5*3=15
u 1101 4 4*4=16
s 01 13 2*13=26
t 11000 1 5*1=5
total 58 146 bits

How compression is achieved:

Normal character encoding uses the American Standard Code for Information Interchange (ASCII),
which uses 8 bits for each character. Therefore, for 58 characters we expect 58*8=464 bits without
compression.
When Huffman Coding is used, the number of bits used includes the message and the encoding table
to allow the decoding to take place. Table 2 gives the details for the same.
Table 2: Huffman Encoding showing the total bits for decoding
SN Character Code Frequency Total Bits
1 a 111 10 3*10=30
2 e 10 15 2*15=30
3 i 00 12 2*12=24
4 o 11001 3 5*3=15
5 u 1101 4 4*4=16
6 s 01 13 2*13=26
7 t 11000 1 5*1=5
Total 58 146
Total Bits 7*8=56 bits 23 bits 146 bits
Total bits used will be 56+23+146 = 225
This implies that on this particular example encoding has reduced the number of bits by
almost 50%.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝐵𝐸𝐹𝑂𝑅𝐸 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑎𝑡𝑖𝑜 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑖𝑡𝑠 𝐴𝐹𝑇𝐸𝑅 𝑐𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛
464
𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑎𝑡𝑖𝑜 = = 2.1
225

Exercise
How many bits may be required for encoding the message using Huffman encoding?
WAGAGAGIGIKOKO.

Follow the procedure in this handout to perform the following:


i) Calculate the frequency of characters
ii) Generate Huffman Tree
iii) Calculate number of bits using frequency of characters and number of bits required to
represent those characters.
iv) Use the table to calculate number of bits used
v) Calculate the compression ratio.
Video File Formats

A video file format is a format used to store digital video data on a computer. Extensions found
after the name of the file show us which format the video is stored in. This might also dictate
which software will be able to open and play the video (and whether you will have to transcode
the video in order to play it). There are dozens of different formats out there, and not all of
them are suitable for all purposes. The most commonly used video formats include:

Format Description and Advantages Disadvantages


(release Characteristics
date)
MP4  MP4 is a universal  Compatibility with a long list of media  Encoding, playing, and editing an
(2001) format that is players (universal format) MP4 video require quite a bit of
supported by all major  Compatibility with a wide array of video- computing power;
operating systems, sharing websites, including YouTube;  MP4 makes it quite easy to alter
browsers, and players.  High compression without a lot of quality the metadata of a file and illegally
It is the safest option deterioration. distribute content;
for maintaining decent  Repeat encoding can lead to
video quality without significant quality deterioration,
sacrificing too much as MP4 is a lossy format.
storage space.
WebM  WebM is an open-  Small file sizes call for low computational  Not the best compatibility with
(2010) source format power; mobile devices;
developed by Google  It is an open-source format available to  Might not be compatible with
and is a great option everyone and supported by all major some players and browsers,
for online libraries and browsers especially older ones.
live streaming,  Offers great quality real-time video 
especially on delivery, making it great for live streaming;
Windows devices.  Compatible with major online video
platforms, such as YouTube.
 it allows for almost immediate playback,
making it great for websites with a lot of
video content.
MOV  MOV is most  Offers great video quality;  Large file sizes;
(1991) compatible with iOS  Can contain different multimedia elements,  Poor compatibility with players
devices and is a great such as video, audio, or text, stored as other than QuickTime or VLC;
option for full-length separate tracks;  Compatibility with Facebook and
movies.  Compatible with a wide range of codecs Instagram is limited to files of up
and platforms, including YouTube, to 4GB.
Facebook, and Instagram.

AVI  It is one of the most  Compatible with most players, browsers,  It requires more storage than most
(1992) common formats and platforms; other video file formats;
for TV, which  Offers high-quality video and audio;  Not a good option for live
might account for  Suitable for short videos, promos, streaming videos
its popularity advertisements, teasers, etc.  Compression with quality
slightly dropping in retention is not its strongest suit
recent years
MKV  a video file format that  Free and open-source, meaning that it is  Not compatible with many
(2002) can contain an constantly being updated and improved; players and devices;
unlimited number of  It supports almost all codecs out there;  Uses a more complicated
video or audio tracks  It is a universal container that supports compression process than most
within it. unlimited tracks, menus, chapters, and other formats;
more.  MKV file sizes are relatively
large.

WMV  WMV was developed  It can store a lot of data without taking up  Compatibility with other
(2003) by Microsoft. As such, too much space; operating systems and programs
it is compatible with  It has a high compression ratio (twice as is quite limited
Windows Media high as for MPEG-4) with the ability to  The compression ratio is not
Player and other retain relatively good video quality. manually adjustable;
Windows-based  It is fully compatible with Windows,  Due to limited compatibility, it is
programs. including older software such as Microsoft hardly a standard video file
PowerPoint. format.

AVCHD  It includes highly  Highest-quality video files;  Files saved in this format are quite
(2006) efficient encoding  3D video support; large;
using the H.264 codec  Compatibility with Blu-ray and memory  Compatibility with various
without significant cards; devices and programs is relatively
quality loss. AVCHD  Compatibility with Sony, Panasonic, and limited;
is the format used by Canon cameras.
professional recording Editing this format can be
equipment to store quite complicated and time-
data. consuming

REFERENCE

https://target-video.com/best-video-file-formats/

You might also like