0% found this document useful (0 votes)
25 views4 pages

A Survey On Different Text Data Compress

Uploaded by

laribi.rhm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views4 pages

A Survey On Different Text Data Compress

Uploaded by

laribi.rhm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Science and Research (IJSR)

ISSN (Online): 2319-7064


Impact Factor (2012): 3.358

A Survey on Different text Data Compression


Techniques
Apoorv Vikram Singh1, Garima Singh2
1
Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology, Allahabad, Uttar Pradesh, India
2
Department of Computer Science and Engineering, Gautam Buddha University, Greater Noida, Uttar Pradesh, India

Abstract: Data Compression refers to the process of reducing the data size and removing the excessive information. The main objective
of data compression is to reduce the amount of redundant information in the stored or communicated data. Data compression is quite
useful as it helps us to reduce the resources usage such as data storage space or transmission capacity. It finds its application in the area
of file storage and distributed system because in distributed system we need to send data from and to all the systems. Data compression
techniques are mainly used for speed and performance efficiency along with maintaining the cost of transmission. There are number of
different data compression methodologies, which are used to compress different data formats like text, video, audio, image files. Data
compression techniques can be broadly classified into two major categories, “lossy” and “lossless” data compression techniques. In this
paper, reviews of different basic lossless data compression methods are considered and a conclusion is drawn on the basis of these
methods.

Keywords: Data Compression, Lossless data compression, Lossy data compression, encoding, coding

1. Introduction such a way that when compression information is


decompressed, there is no loss of information. Lossless
Data compression is a way to reduce the data size, remove compression is possible because most real-world data have
excessive information and minimize storage cost by statistical redundancy and these algorithms exploit these
eliminating redundancies that happen in most files. Data statistical redundancies to represent data more concisely
compression is a common requirement for most of the without losing information.
computerized application. We find the use of data
compression in the area of file storage and distributed Lossy: Lossy data compression is contrasted with lossless
systems. It also finds its application in network processing data compression. Lossy data compression algorithms do not
techniques in order to save energy because it reduces the produce an exact copy of the information after
amount of data in order to reduce data transmitted and/or decompression as was present before compression. In these
decreases transfer time because the size of the data is schemes, some loss of information is acceptable. Lossy
reduced. Data compression is used in multimedia field, text Compression reduce the file size by eliminating some
documents and database tables as well. The most important redundant data that won’t be recognized by humans after
criteria of classification is whether the compression decoding. Applications of Lossy Data Compression

 Lossy image compression can be used in digital cameras,


algorithm removes some part of data which cannot be techniques:
recovered during decompression. On the basis of this
criterion, the data compression techniques are divided into to increase storage capacities with minimal degradation of

 Similarly, DVDs use the Lossy MPEG-2 Video codec for


two major categories, “lossy” data compression techniques picture quality.

video compression. 
and “lossless” data compression techniques.

2. Data Compression Techniques  In Lossy audio compression, methods of psychoacoustics

components of the signal [3]. 


are used to remove non-audible (or less audible)
Lossless: Lossless data compression algorithms are used to
reduce the amount of source information to be transmitted in

Tree representation of compression methods


Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 1999
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
3. Lossless Data Compression Techniques instead of writing this data item k times we can replace it by
‘kd’.
There are several types of data compression techniques apart
from the three mentioned in the figure. Some of them are: Example-
Input Stream
3.1 Run Length Encoding or : AAAAAAABBBBCCCCCCCAAAAAADDDDD

RepetitiveSequence Suppression: Compressed Stream: 7A4B7C6A5D

Run-length encoding is a very simple data compression RLE is mainly used to compress runs of same data byte.
technique whose basic principle is to count the number of This method is used in the case when there is a lot of
consecutive data items and then use that count for repetition of data items. Thus, RLE is often used to
compression. The main idea behind this approach is this: If compress a bitmap image, especially the low bit one
any data item ‘d’ occurs ‘k’ times in an input stream, then

Input Rotations Transformation Sorting All Rows in Alphabetical Order Taking Last Column Output Last Column
^BANANA| ANANA|^B ANANA|^B
|^BANANA ANA|^BAN ANA|^BAN
A|^BANAN A|^BANAN A|^BANAN
^BANANA| NA|^BANA BANANA|^ BANANA|^ BNN^AA|A
AN NANA|^BA NANA|^BA
NANA|^BA NA|^BANA NA|^BANA
ANANA|^B ^BANANA| ^BANANA|
BANANA|^ |^BANANA |^BANANA

3.2 Burrows Wheeler Transform or Block Sorting Iteration Sequence List


Compression Bananaaa 1 (abcdefghijklmnopqrstuvwxyz)
Bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz)
Burrows Wheeler transform works in Block mode while Bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz)
others mostly work in streaming mode[2]. This algorithm is Bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz)
classified as transformation algorithm because it rearranges Bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz)
Bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz)
a character string into runs of similar characters. Now, these
Bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz)
strings of similar characters can be used as an input stream
Bananaaa 1,1,13,1,1,1,0, (anbcdefghijklmopqrstuvwxyz)
for other algorithms like run-length encoding or move-to-
front transform for achieving better compression ratios.
This is how input streams are transformed into output stream
using Move-to-front Transformation. This technique is
Example-
intended to be used as optimization for other algorithm likes
InputStream: AAAAAAABBBBCCCCCCCAAAAAADDDDD
Burrows-wheeler transform.3.4 LZW (Lempel‐Ziv Welch)
compression
Output Stream: 7A4B7C6A5D
LZWis one of the most popular method of data
The transform is done by sorting all rotations of the text in
compression. The main steps for this technique are given
lexicographic order, then taking the last column.[6]. Since
 Firstly it will read the file and given a code to each
below:-
the BWT operates on data in memory, you may encounter
files too big to process in one fell swoop. In these cases, the
 If the same characters are found in a file then it will not
character.
file must be split up and processed a block at a time [5]. One
of the important feature of BWT is that, thistransformation is
reversible because when a character string is transformed by assign the new code and then use the existing code from

 The process is continuous until the characters in a file are


BWT, the value of the character does not changes, it only a dictionary.
permutes the order of characters.
null.
3.3 Move-to-front Transform
3.5 Shannon Fano Coding
Move-to-front Transform is another basic techniquefor data
compression but the irony is that, it does not compress data, Shannon Fano Coding technique is used to encode data or
rather it helps to reduce redundancy sometimes. The main messages depending upon their probability of occurrence.

 For a given list of symbols, develop a probability table.


idea is that each symbol in the data is replaced by its index This technique involves following steps:

 Sorting the table according to the probability and placing


in the “stack of recently used symbols” thus, providing the
symbol a smaller output number.

 The table is then divided into two parts, such that the sum
the most probable element at the top of the list.
Example: Input Stream: banana

 The left half of the list is assigned ‘0’ and the right half is
of probabilities both the parts are as close as possible.
Output Stream: 1,1,13,1,1,1,0,0
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2000
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358

 Repeat the steps 3 and 4 for each of the two halves then
assigned ‘1’. floating number as output.
a) In the first step, we calculate the frequency count of
further divide the groups and adding bits to the codes and different symbols.
stop the process when each symbol has a corresponding b) In second step we encode the string by dividing up the


leaf on the tree. interval [0, 1] and allocate each letter an interval whose
size depends on how often it comes in the string.
Example: c) In third step we consider the next letter, so now we
Symbol A B C D E subdivide the interval of that letter in the same way. We
Frequency 15 7 6 6 5 carry on through the message….And, continuing in this
Probabilities 0.3846 0.1795 0.1538 0.1538 0.1282 way, we obtain the required interval.

After going through all the steps mentioned above we get: A message is represented by a half-open interval [a, b)
where a and b are real numbers between 0 and 1. Initially,
Symbol A B C D E
the interval is [0, 1). When the message becomes longer, the
Code 00 01 10 110 111 length of the interval shorts and the number of bits needed to
represent the interval increases.[4]
On calculating the average number of bits, we get it to be
around 2.28 bits. 4. Measuring Compression Performances
3.6 Huffman Coding Performance measure is use to find which technique is good
according to some criteria. The performance of the
A Huffman Coding is more sophisticated and efficient compression algorithm can be measured on the basis of
lossless data compression technique. In Huffman Coding the different criteria depending upon the nature of the
characters in a data file are converted to binary code. And in application. The most important thing we should keep in
this technique the most common characters in the file have mind while measuring performance is space efficiency.
the shortest binary codes, and the least common have the Time efficiency is also an important factor. Since the
longest binary code [7]. compression behavior depends on the redundancy of
1. Initialization: Put the elements in a list sorted according to symbols in the source file, it is difficult to measure
their frequency counts. performance of compression algorithm in general. The
2. Repeat the following steps until the sorted list has only performance of data compression depends on the type of
one node left: data and structure of input source. The compression
a) From the list pick two elements with the lowest behavior depends on the category of the compression
frequency counts. Form a Huffman sub tree that has algorithm: lossy or lossless [1]. Following are some
these two nodes as child nodes and create a parent measurements use to calculate the performances of lossless
node.
 Compression Ratio: Compression Ratio is the ratio
algorithms.
b) Assign the sum of the children’s frequency to the
parent node and now considering the parent node as
between the size of the file after compression and the size
one of the nodes of the list, again pick the lowest two
 Compression Ratio = Size after compression/size before
of the file before compression.
frequency counts and form a Huffman sub tree.
3. In third step we do labelling the edges from each parent to
 Compression Factor: Compression Factor is the inverse
compression
its left child with the digit 0 and the edge to right child
with 1. The code word for each source letter is the
sequence of labels along the path from root to leaf node of compression ratio. It is the ratio between the size of the
representing the letter. file before compression and size of the file after

 Compression Factor = Size before compression/size after


compression.
Example: Using the same frequency as Shannon Fano
above: compression
Symbol A B C D E
5. Conclusion
Frequency 15 7 6 6 5
Probabilities 0.3846 0.1795 0.1538 0.1538 0.1282 In this paper, we have talked about the need of data
compressions and the situations in which lossy and lossless
After going through all the steps mentioned above, we get: data compressions are useful. Several algorithms used for
Symbol A B C D E lossless compression are described in brief and various
Code 0 100 101 110 111 conclusions are drawn. Compression techniques like
On calculating the average number of bits, we get it to be BWT(Burrows Wheeler Transform) and MFT(Move-to-
around 2.23 bits. front Transform) are the algorithms which does not
compress data, they just transform the input stream and these
3.7 Arithmetic Coding Technique transformed input stream act as input for better compression
techniques. Run Length Encoding is a good compression
Arithmetic coding is the most powerful coding technique. technique but it is effective only in the case when there is a
This method is different from other compression techniques consecutive repetition of symbols or data. Thus, when such
as it does not replace each bit with a codeword as other repetitions are not present, then this compression does not
methods instead it replaces a stream of input data with a
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2001
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
work effectively. Huffman coding is a better compression but this one is her first to be published. She has interests in web
technique than Shannon Fano coding but Arithmetic Coding development and has developed a website for her college fest.
is the most effective compression technique among all the Besides this she loves to read novels, painting and listening to
music.
above mentioned compression techniques. Compression
speed of Huffman and Shannon Fano coding is faster than
Arithmetic Coding but the compression ratio of Arithmetic
Coding is far better than the other two. And furthermore
arithmetic encoding reduces channel bandwidth and
transmission time.

Compression Ratio of any technique can further be


improved by applying two techniques on the same data or
message. For instance, we can firstly apply BWT on any
data and then we can apply any of the compression
techniques like RLE or Huffman Coding. This type of
combination improves the compression ratio. Future work
can be done on implementing the compression schemes so
that the searching and compression is faster.

References
[1] “Data Compression Methodologies for Lossless Data and
Comparison between Algorithms”, International Journal
of Engineering Science and Innovative Technology, Vol
2, March 2013.
[2] I Made Agus Dwi Suarjaya, “A New Algorithm for Data
Compression Optimization”, International Journal of
Advanced Computer Science and Applications, VOl 3,
2012.
[3] A Survey on Different Compression Techniques and Bit
Reduction Algorithm for Compression of Text/Lossless
Data”, International Journal of Advanced Research in
Computer Science and Software Engineering, Vol 3,
March 2013
[4] “A Survey on the different text data compression
techniques”, International Journal of Advanced Research
in Computer Science and Technology, Vol 2,Feb 2013.
[5] M. 1996. Data compression with Burrows-Wheeler
Transform. Dr. Dobb's Journal.
[6] Ken Huffman. Profile: David A. Huffman, Scientific
American, September 1991, pp. 54–58.
[7] Mark Daniel Ward, “Exploring Data Compression via
Binary Trees1,” International Journal of Advanced
Computer Science and Applications (IJACSA), Vol. 3,
No.8, 2012.
[8] Blelloch, E., 2002. Introduction to Data Compression,
Computer Science Department, Carnegie Mellon
University.

Author Profile
Apoorv Vikram Singh is currently enrolled in 4thyear
of his B.Tech programme (2011-2015) from Motilal
Nehru National Institute of Technology(MNNIT). He
is an ace programmer and has developed many
android applications. In 2013, he received the title of
“Mr. Avishkar” in the Technical Festival organised by his college.
Besides programming, he has interests in playing football and
listening to music.

Garima Singh is presently enrolled in the 4thyearof


her Integrated M.Tech programme (2011-2016) in
Computer Science Engineering from Gautam Buddha
University. She has already written papers in 3rd year
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2002
Licensed Under Creative Commons Attribution CC BY

You might also like