A Survey On Different Text Data Compress
A Survey On Different Text Data Compress
Abstract: Data Compression refers to the process of reducing the data size and removing the excessive information. The main objective
of data compression is to reduce the amount of redundant information in the stored or communicated data. Data compression is quite
useful as it helps us to reduce the resources usage such as data storage space or transmission capacity. It finds its application in the area
of file storage and distributed system because in distributed system we need to send data from and to all the systems. Data compression
techniques are mainly used for speed and performance efficiency along with maintaining the cost of transmission. There are number of
different data compression methodologies, which are used to compress different data formats like text, video, audio, image files. Data
compression techniques can be broadly classified into two major categories, “lossy” and “lossless” data compression techniques. In this
paper, reviews of different basic lossless data compression methods are considered and a conclusion is drawn on the basis of these
methods.
Keywords: Data Compression, Lossless data compression, Lossy data compression, encoding, coding
video compression.
and “lossless” data compression techniques.
Run-length encoding is a very simple data compression RLE is mainly used to compress runs of same data byte.
technique whose basic principle is to count the number of This method is used in the case when there is a lot of
consecutive data items and then use that count for repetition of data items. Thus, RLE is often used to
compression. The main idea behind this approach is this: If compress a bitmap image, especially the low bit one
any data item ‘d’ occurs ‘k’ times in an input stream, then
Input Rotations Transformation Sorting All Rows in Alphabetical Order Taking Last Column Output Last Column
^BANANA| ANANA|^B ANANA|^B
|^BANANA ANA|^BAN ANA|^BAN
A|^BANAN A|^BANAN A|^BANAN
^BANANA| NA|^BANA BANANA|^ BANANA|^ BNN^AA|A
AN NANA|^BA NANA|^BA
NANA|^BA NA|^BANA NA|^BANA
ANANA|^B ^BANANA| ^BANANA|
BANANA|^ |^BANANA |^BANANA
The table is then divided into two parts, such that the sum
the most probable element at the top of the list.
Example: Input Stream: banana
The left half of the list is assigned ‘0’ and the right half is
of probabilities both the parts are as close as possible.
Output Stream: 1,1,13,1,1,1,0,0
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2000
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
Repeat the steps 3 and 4 for each of the two halves then
assigned ‘1’. floating number as output.
a) In the first step, we calculate the frequency count of
further divide the groups and adding bits to the codes and different symbols.
stop the process when each symbol has a corresponding b) In second step we encode the string by dividing up the
leaf on the tree. interval [0, 1] and allocate each letter an interval whose
size depends on how often it comes in the string.
Example: c) In third step we consider the next letter, so now we
Symbol A B C D E subdivide the interval of that letter in the same way. We
Frequency 15 7 6 6 5 carry on through the message….And, continuing in this
Probabilities 0.3846 0.1795 0.1538 0.1538 0.1282 way, we obtain the required interval.
After going through all the steps mentioned above we get: A message is represented by a half-open interval [a, b)
where a and b are real numbers between 0 and 1. Initially,
Symbol A B C D E
the interval is [0, 1). When the message becomes longer, the
Code 00 01 10 110 111 length of the interval shorts and the number of bits needed to
represent the interval increases.[4]
On calculating the average number of bits, we get it to be
around 2.28 bits. 4. Measuring Compression Performances
3.6 Huffman Coding Performance measure is use to find which technique is good
according to some criteria. The performance of the
A Huffman Coding is more sophisticated and efficient compression algorithm can be measured on the basis of
lossless data compression technique. In Huffman Coding the different criteria depending upon the nature of the
characters in a data file are converted to binary code. And in application. The most important thing we should keep in
this technique the most common characters in the file have mind while measuring performance is space efficiency.
the shortest binary codes, and the least common have the Time efficiency is also an important factor. Since the
longest binary code [7]. compression behavior depends on the redundancy of
1. Initialization: Put the elements in a list sorted according to symbols in the source file, it is difficult to measure
their frequency counts. performance of compression algorithm in general. The
2. Repeat the following steps until the sorted list has only performance of data compression depends on the type of
one node left: data and structure of input source. The compression
a) From the list pick two elements with the lowest behavior depends on the category of the compression
frequency counts. Form a Huffman sub tree that has algorithm: lossy or lossless [1]. Following are some
these two nodes as child nodes and create a parent measurements use to calculate the performances of lossless
node.
Compression Ratio: Compression Ratio is the ratio
algorithms.
b) Assign the sum of the children’s frequency to the
parent node and now considering the parent node as
between the size of the file after compression and the size
one of the nodes of the list, again pick the lowest two
Compression Ratio = Size after compression/size before
of the file before compression.
frequency counts and form a Huffman sub tree.
3. In third step we do labelling the edges from each parent to
Compression Factor: Compression Factor is the inverse
compression
its left child with the digit 0 and the edge to right child
with 1. The code word for each source letter is the
sequence of labels along the path from root to leaf node of compression ratio. It is the ratio between the size of the
representing the letter. file before compression and size of the file after
References
[1] “Data Compression Methodologies for Lossless Data and
Comparison between Algorithms”, International Journal
of Engineering Science and Innovative Technology, Vol
2, March 2013.
[2] I Made Agus Dwi Suarjaya, “A New Algorithm for Data
Compression Optimization”, International Journal of
Advanced Computer Science and Applications, VOl 3,
2012.
[3] A Survey on Different Compression Techniques and Bit
Reduction Algorithm for Compression of Text/Lossless
Data”, International Journal of Advanced Research in
Computer Science and Software Engineering, Vol 3,
March 2013
[4] “A Survey on the different text data compression
techniques”, International Journal of Advanced Research
in Computer Science and Technology, Vol 2,Feb 2013.
[5] M. 1996. Data compression with Burrows-Wheeler
Transform. Dr. Dobb's Journal.
[6] Ken Huffman. Profile: David A. Huffman, Scientific
American, September 1991, pp. 54–58.
[7] Mark Daniel Ward, “Exploring Data Compression via
Binary Trees1,” International Journal of Advanced
Computer Science and Applications (IJACSA), Vol. 3,
No.8, 2012.
[8] Blelloch, E., 2002. Introduction to Data Compression,
Computer Science Department, Carnegie Mellon
University.
Author Profile
Apoorv Vikram Singh is currently enrolled in 4thyear
of his B.Tech programme (2011-2015) from Motilal
Nehru National Institute of Technology(MNNIT). He
is an ace programmer and has developed many
android applications. In 2013, he received the title of
“Mr. Avishkar” in the Technical Festival organised by his college.
Besides programming, he has interests in playing football and
listening to music.