0% found this document useful (0 votes)

25 views4 pages

A Survey On Different Text Data Compress

Uploaded by

laribi.rhm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views4 pages

A Survey On Different Text Data Compress

Uploaded by

laribi.rhm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

International Journal of Science and Research (IJSR)

ISSN (Online): 2319-7064

Impact Factor (2012): 3.358

A Survey on Different text Data Compression

Techniques
Apoorv Vikram Singh1, Garima Singh2
1
Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology, Allahabad, Uttar Pradesh, India
2
Department of Computer Science and Engineering, Gautam Buddha University, Greater Noida, Uttar Pradesh, India

Abstract: Data Compression refers to the process of reducing the data size and removing the excessive information. The main objective
of data compression is to reduce the amount of redundant information in the stored or communicated data. Data compression is quite
useful as it helps us to reduce the resources usage such as data storage space or transmission capacity. It finds its application in the area
of file storage and distributed system because in distributed system we need to send data from and to all the systems. Data compression
techniques are mainly used for speed and performance efficiency along with maintaining the cost of transmission. There are number of
different data compression methodologies, which are used to compress different data formats like text, video, audio, image files. Data
compression techniques can be broadly classified into two major categories, “lossy” and “lossless” data compression techniques. In this
paper, reviews of different basic lossless data compression methods are considered and a conclusion is drawn on the basis of these
methods.

Keywords: Data Compression, Lossless data compression, Lossy data compression, encoding, coding

1. Introduction such a way that when compression information is

decompressed, there is no loss of information. Lossless
Data compression is a way to reduce the data size, remove compression is possible because most real-world data have
excessive information and minimize storage cost by statistical redundancy and these algorithms exploit these
eliminating redundancies that happen in most files. Data statistical redundancies to represent data more concisely
compression is a common requirement for most of the without losing information.
computerized application. We find the use of data
compression in the area of file storage and distributed Lossy: Lossy data compression is contrasted with lossless
systems. It also finds its application in network processing data compression. Lossy data compression algorithms do not
techniques in order to save energy because it reduces the produce an exact copy of the information after
amount of data in order to reduce data transmitted and/or decompression as was present before compression. In these
decreases transfer time because the size of the data is schemes, some loss of information is acceptable. Lossy
reduced. Data compression is used in multimedia field, text Compression reduce the file size by eliminating some
documents and database tables as well. The most important redundant data that won’t be recognized by humans after
criteria of classification is whether the compression decoding. Applications of Lossy Data Compression

 Lossy image compression can be used in digital cameras,

algorithm removes some part of data which cannot be techniques:
recovered during decompression. On the basis of this
criterion, the data compression techniques are divided into to increase storage capacities with minimal degradation of

 Similarly, DVDs use the Lossy MPEG-2 Video codec for

two major categories, “lossy” data compression techniques picture quality.

video compression. 
and “lossless” data compression techniques.

2. Data Compression Techniques  In Lossy audio compression, methods of psychoacoustics

components of the signal [3]. 

are used to remove non-audible (or less audible)
Lossless: Lossless data compression algorithms are used to
reduce the amount of source information to be transmitted in

Tree representation of compression methods

Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 1999
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
3. Lossless Data Compression Techniques instead of writing this data item k times we can replace it by
‘kd’.
There are several types of data compression techniques apart
from the three mentioned in the figure. Some of them are: Example-
Input Stream
3.1 Run Length Encoding or : AAAAAAABBBBCCCCCCCAAAAAADDDDD

RepetitiveSequence Suppression: Compressed Stream: 7A4B7C6A5D

Run-length encoding is a very simple data compression RLE is mainly used to compress runs of same data byte.
technique whose basic principle is to count the number of This method is used in the case when there is a lot of
consecutive data items and then use that count for repetition of data items. Thus, RLE is often used to
compression. The main idea behind this approach is this: If compress a bitmap image, especially the low bit one
any data item ‘d’ occurs ‘k’ times in an input stream, then

3.2 Burrows Wheeler Transform or Block Sorting Iteration Sequence List

Compression Bananaaa 1 (abcdefghijklmnopqrstuvwxyz)
Bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz)
Burrows Wheeler transform works in Block mode while Bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz)
others mostly work in streaming mode[2]. This algorithm is Bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz)
classified as transformation algorithm because it rearranges Bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz)
Bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz)
a character string into runs of similar characters. Now, these
Bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz)
strings of similar characters can be used as an input stream
Bananaaa 1,1,13,1,1,1,0, (anbcdefghijklmopqrstuvwxyz)
for other algorithms like run-length encoding or move-to-
front transform for achieving better compression ratios.
This is how input streams are transformed into output stream
using Move-to-front Transformation. This technique is
Example-
intended to be used as optimization for other algorithm likes
InputStream: AAAAAAABBBBCCCCCCCAAAAAADDDDD
Burrows-wheeler transform.3.4 LZW (Lempel‐Ziv Welch)
compression
Output Stream: 7A4B7C6A5D
LZWis one of the most popular method of data
The transform is done by sorting all rotations of the text in
compression. The main steps for this technique are given
lexicographic order, then taking the last column.[6]. Since
 Firstly it will read the file and given a code to each
below:-
the BWT operates on data in memory, you may encounter
files too big to process in one fell swoop. In these cases, the
 If the same characters are found in a file then it will not
character.
file must be split up and processed a block at a time [5]. One
of the important feature of BWT is that, thistransformation is
reversible because when a character string is transformed by assign the new code and then use the existing code from

 The process is continuous until the characters in a file are

BWT, the value of the character does not changes, it only a dictionary.
permutes the order of characters.
null.
3.3 Move-to-front Transform
3.5 Shannon Fano Coding
Move-to-front Transform is another basic techniquefor data
compression but the irony is that, it does not compress data, Shannon Fano Coding technique is used to encode data or
rather it helps to reduce redundancy sometimes. The main messages depending upon their probability of occurrence.

 For a given list of symbols, develop a probability table.

idea is that each symbol in the data is replaced by its index This technique involves following steps:

 Sorting the table according to the probability and placing

in the “stack of recently used symbols” thus, providing the
symbol a smaller output number.

 The table is then divided into two parts, such that the sum
the most probable element at the top of the list.
Example: Input Stream: banana

 The left half of the list is assigned ‘0’ and the right half is
of probabilities both the parts are as close as possible.
Output Stream: 1,1,13,1,1,1,0,0
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2000
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358

 Repeat the steps 3 and 4 for each of the two halves then
assigned ‘1’. floating number as output.
a) In the first step, we calculate the frequency count of
further divide the groups and adding bits to the codes and different symbols.
stop the process when each symbol has a corresponding b) In second step we encode the string by dividing up the


leaf on the tree. interval [0, 1] and allocate each letter an interval whose
size depends on how often it comes in the string.
Example: c) In third step we consider the next letter, so now we
Symbol A B C D E subdivide the interval of that letter in the same way. We
Frequency 15 7 6 6 5 carry on through the message….And, continuing in this
Probabilities 0.3846 0.1795 0.1538 0.1538 0.1282 way, we obtain the required interval.

After going through all the steps mentioned above we get: A message is represented by a half-open interval [a, b)
where a and b are real numbers between 0 and 1. Initially,
Symbol A B C D E
the interval is [0, 1). When the message becomes longer, the
Code 00 01 10 110 111 length of the interval shorts and the number of bits needed to
represent the interval increases.[4]
On calculating the average number of bits, we get it to be
around 2.28 bits. 4. Measuring Compression Performances
3.6 Huffman Coding Performance measure is use to find which technique is good
according to some criteria. The performance of the
A Huffman Coding is more sophisticated and efficient compression algorithm can be measured on the basis of
lossless data compression technique. In Huffman Coding the different criteria depending upon the nature of the
characters in a data file are converted to binary code. And in application. The most important thing we should keep in
this technique the most common characters in the file have mind while measuring performance is space efficiency.
the shortest binary codes, and the least common have the Time efficiency is also an important factor. Since the
longest binary code [7]. compression behavior depends on the redundancy of
1. Initialization: Put the elements in a list sorted according to symbols in the source file, it is difficult to measure
their frequency counts. performance of compression algorithm in general. The
2. Repeat the following steps until the sorted list has only performance of data compression depends on the type of
one node left: data and structure of input source. The compression
a) From the list pick two elements with the lowest behavior depends on the category of the compression
frequency counts. Form a Huffman sub tree that has algorithm: lossy or lossless [1]. Following are some
these two nodes as child nodes and create a parent measurements use to calculate the performances of lossless
node.
 Compression Ratio: Compression Ratio is the ratio
algorithms.
b) Assign the sum of the children’s frequency to the
parent node and now considering the parent node as
between the size of the file after compression and the size
one of the nodes of the list, again pick the lowest two
 Compression Ratio = Size after compression/size before
of the file before compression.
frequency counts and form a Huffman sub tree.
3. In third step we do labelling the edges from each parent to
 Compression Factor: Compression Factor is the inverse
compression
its left child with the digit 0 and the edge to right child
with 1. The code word for each source letter is the
sequence of labels along the path from root to leaf node of compression ratio. It is the ratio between the size of the
representing the letter. file before compression and size of the file after

 Compression Factor = Size before compression/size after

compression.
Example: Using the same frequency as Shannon Fano
above: compression
Symbol A B C D E
5. Conclusion
Frequency 15 7 6 6 5
Probabilities 0.3846 0.1795 0.1538 0.1538 0.1282 In this paper, we have talked about the need of data
compressions and the situations in which lossy and lossless
After going through all the steps mentioned above, we get: data compressions are useful. Several algorithms used for
Symbol A B C D E lossless compression are described in brief and various
Code 0 100 101 110 111 conclusions are drawn. Compression techniques like
On calculating the average number of bits, we get it to be BWT(Burrows Wheeler Transform) and MFT(Move-to-
around 2.23 bits. front Transform) are the algorithms which does not
compress data, they just transform the input stream and these
3.7 Arithmetic Coding Technique transformed input stream act as input for better compression
techniques. Run Length Encoding is a good compression
Arithmetic coding is the most powerful coding technique. technique but it is effective only in the case when there is a
This method is different from other compression techniques consecutive repetition of symbols or data. Thus, when such
as it does not replace each bit with a codeword as other repetitions are not present, then this compression does not
methods instead it replaces a stream of input data with a
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2001
Licensed Under Creative Commons Attribution CC BY
International Journal of Science and Research (IJSR)
ISSN (Online): 2319-7064
Impact Factor (2012): 3.358
work effectively. Huffman coding is a better compression but this one is her first to be published. She has interests in web
technique than Shannon Fano coding but Arithmetic Coding development and has developed a website for her college fest.
is the most effective compression technique among all the Besides this she loves to read novels, painting and listening to
music.
above mentioned compression techniques. Compression
speed of Huffman and Shannon Fano coding is faster than
Arithmetic Coding but the compression ratio of Arithmetic
Coding is far better than the other two. And furthermore
arithmetic encoding reduces channel bandwidth and
transmission time.

Compression Ratio of any technique can further be

improved by applying two techniques on the same data or
message. For instance, we can firstly apply BWT on any
data and then we can apply any of the compression
techniques like RLE or Huffman Coding. This type of
combination improves the compression ratio. Future work
can be done on implementing the compression schemes so
that the searching and compression is faster.

References
[1] “Data Compression Methodologies for Lossless Data and
Comparison between Algorithms”, International Journal
of Engineering Science and Innovative Technology, Vol
2, March 2013.
[2] I Made Agus Dwi Suarjaya, “A New Algorithm for Data
Compression Optimization”, International Journal of
Advanced Computer Science and Applications, VOl 3,
2012.
[3] A Survey on Different Compression Techniques and Bit
Reduction Algorithm for Compression of Text/Lossless
Data”, International Journal of Advanced Research in
Computer Science and Software Engineering, Vol 3,
March 2013
[4] “A Survey on the different text data compression
techniques”, International Journal of Advanced Research
in Computer Science and Technology, Vol 2,Feb 2013.
[5] M. 1996. Data compression with Burrows-Wheeler
Transform. Dr. Dobb's Journal.
[6] Ken Huffman. Profile: David A. Huffman, Scientific
American, September 1991, pp. 54–58.
[7] Mark Daniel Ward, “Exploring Data Compression via
Binary Trees1,” International Journal of Advanced
Computer Science and Applications (IJACSA), Vol. 3,
No.8, 2012.
[8] Blelloch, E., 2002. Introduction to Data Compression,
Computer Science Department, Carnegie Mellon
University.

Author Profile
Apoorv Vikram Singh is currently enrolled in 4thyear
of his B.Tech programme (2011-2015) from Motilal
Nehru National Institute of Technology(MNNIT). He
is an ace programmer and has developed many
android applications. In 2013, he received the title of
“Mr. Avishkar” in the Technical Festival organised by his college.
Besides programming, he has interests in playing football and
listening to music.

Garima Singh is presently enrolled in the 4thyearof

her Integrated M.Tech programme (2011-2016) in
Computer Science Engineering from Gautam Buddha
University. She has already written papers in 3rd year
Volume 3 Issue 7, July 2014
www.ijsr.net
Paper ID: 020141298 2002
Licensed Under Creative Commons Attribution CC BY

Chapter 3
No ratings yet
Chapter 3
52 pages
Unit 5 Data Compression
No ratings yet
Unit 5 Data Compression
98 pages
Data Compression Seminar Report
67% (6)
Data Compression Seminar Report
34 pages
Chapter 3
No ratings yet
Chapter 3
41 pages
Chapter 3 Multimedia Data Compression
100% (2)
Chapter 3 Multimedia Data Compression
23 pages
Data Compression
No ratings yet
Data Compression
18 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
14 pages
Beginners Guide To Making Money Online
100% (8)
Beginners Guide To Making Money Online
129 pages
Compression Techniques
No ratings yet
Compression Techniques
2 pages
Modern Lossless Compression Techniques Review Comparison and Analysis
No ratings yet
Modern Lossless Compression Techniques Review Comparison and Analysis
8 pages
Section1 Data Compression
No ratings yet
Section1 Data Compression
14 pages
Image Compression Through Combination Advantages From Existing Techniques
No ratings yet
Image Compression Through Combination Advantages From Existing Techniques
7 pages
Data Representation Through Compression
No ratings yet
Data Representation Through Compression
22 pages
Chapter 6
No ratings yet
Chapter 6
82 pages
A Review of Data Compression Techniques
No ratings yet
A Review of Data Compression Techniques
9 pages
Lossless and Lossy Compression
No ratings yet
Lossless and Lossy Compression
18 pages
SAP CRM Training
67% (3)
SAP CRM Training
38 pages
DC M1 Merged
No ratings yet
DC M1 Merged
26 pages
Lecture 10 - Data Compression
No ratings yet
Lecture 10 - Data Compression
18 pages
Multimedia Unit-4
No ratings yet
Multimedia Unit-4
30 pages
Fundamentals of Compression: Prepared By: Haval Akrawi
No ratings yet
Fundamentals of Compression: Prepared By: Haval Akrawi
21 pages
Chip Malaysia August 2017
No ratings yet
Chip Malaysia August 2017
132 pages
Data Compression Explained
100% (1)
Data Compression Explained
92 pages
MMC Unit II
No ratings yet
MMC Unit II
40 pages
Lossless Data Compression Techniques and Their Performance
No ratings yet
Lossless Data Compression Techniques and Their Performance
6 pages
Lecture 3 Compression in Multimedia
No ratings yet
Lecture 3 Compression in Multimedia
60 pages
Compression Algo
No ratings yet
Compression Algo
10 pages
Assignment Agmase
No ratings yet
Assignment Agmase
14 pages
Chapter 7
No ratings yet
Chapter 7
36 pages
Data Compression
No ratings yet
Data Compression
19 pages
Compression
100% (1)
Compression
38 pages
Multimedia 1
No ratings yet
Multimedia 1
6 pages
DFC
100% (4)
DFC
47 pages
Supplementary Notes On Compression and Formats
No ratings yet
Supplementary Notes On Compression and Formats
15 pages
Compression Techniques
No ratings yet
Compression Techniques
23 pages
A Comparitive Study of Text Compression Algorithms PDF
No ratings yet
A Comparitive Study of Text Compression Algorithms PDF
9 pages
Data Compresion 1
No ratings yet
Data Compresion 1
2 pages
Mobile Based Student Attendance System Using Geo-F
No ratings yet
Mobile Based Student Attendance System Using Geo-F
17 pages
Data Compression
No ratings yet
Data Compression
10 pages
MM Unit-III - 0
No ratings yet
MM Unit-III - 0
22 pages
3 MM Compression
100% (1)
3 MM Compression
35 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Data Compression
No ratings yet
Data Compression
6 pages
Chapter 5 New
No ratings yet
Chapter 5 New
19 pages
Image Compression Using Huffman Coding
No ratings yet
Image Compression Using Huffman Coding
25 pages
Computer Science Extended Essay
No ratings yet
Computer Science Extended Essay
15 pages
Data Compression Techniques
No ratings yet
Data Compression Techniques
21 pages
Research
No ratings yet
Research
4 pages
(IJCST-V2I4P27) Author: Ritu, Puneet Sharma
No ratings yet
(IJCST-V2I4P27) Author: Ritu, Puneet Sharma
5 pages
Data Compression and Huffman Algorithm
0% (1)
Data Compression and Huffman Algorithm
18 pages
Image Compression Techniques: H.S Samra
No ratings yet
Image Compression Techniques: H.S Samra
4 pages
Image Compression Techniques: H.S Samra
No ratings yet
Image Compression Techniques: H.S Samra
4 pages
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Dereje Teferi Dereje - Teferi@aau - Edu.et
No ratings yet
Dereje Teferi Dereje - Teferi@aau - Edu.et
36 pages
SAP HR-Set Up Personnel Action - Configuration Steps
100% (2)
SAP HR-Set Up Personnel Action - Configuration Steps
16 pages
Data Compression
No ratings yet
Data Compression
20 pages
Nteractive Ultimedia Ystems: Ompression Types and Techniques
No ratings yet
Nteractive Ultimedia Ystems: Ompression Types and Techniques
12 pages
Aadel Veri
No ratings yet
Aadel Veri
37 pages
Operation Manual
No ratings yet
Operation Manual
27 pages
Seminar Data Compression
No ratings yet
Seminar Data Compression
32 pages
BLUE SKY TRAINING MANUAL Product Key
No ratings yet
BLUE SKY TRAINING MANUAL Product Key
67 pages
Data Compression Report
No ratings yet
Data Compression Report
10 pages
Vik
No ratings yet
Vik
23 pages
Thinkcspy Ukzn Vol1 2016
No ratings yet
Thinkcspy Ukzn Vol1 2016
155 pages
New Dynamic Approach For LZW Data Compression
No ratings yet
New Dynamic Approach For LZW Data Compression
5 pages
LaTeXBibliography Management - Wikibooks, Open Books For An Open World
No ratings yet
LaTeXBibliography Management - Wikibooks, Open Books For An Open World
15 pages
2-Lexical Analysis Part1
No ratings yet
2-Lexical Analysis Part1
39 pages
Data Compression Report
No ratings yet
Data Compression Report
12 pages
W.A.S.M.U.Widanaarachchi Postgraduate Institute of Science University of Peradeniya Peradeniya, Sri Lanka Csc2239@pgis - LK
No ratings yet
W.A.S.M.U.Widanaarachchi Postgraduate Institute of Science University of Peradeniya Peradeniya, Sri Lanka Csc2239@pgis - LK
7 pages
Notula Rapat Bulan Desember 2021
No ratings yet
Notula Rapat Bulan Desember 2021
40 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
AJP Microproject
No ratings yet
AJP Microproject
14 pages
Web Browser
No ratings yet
Web Browser
71 pages
Fujifilm Landscape Sellsheet
No ratings yet
Fujifilm Landscape Sellsheet
2 pages
HM-12/HM-13 Firmware Upgrade Instructions: Firmware Upgrade May Damage The Module Boot Loader System, Please Use Caution
No ratings yet
HM-12/HM-13 Firmware Upgrade Instructions: Firmware Upgrade May Damage The Module Boot Loader System, Please Use Caution
5 pages
Priyanka M: Brief Summary
No ratings yet
Priyanka M: Brief Summary
3 pages
Data Compression
No ratings yet
Data Compression
29 pages
Kallam Haranadhareddy Institute of Technology: Presentation By: Modukuri John Jaya Prakash 188X1A0524
No ratings yet
Kallam Haranadhareddy Institute of Technology: Presentation By: Modukuri John Jaya Prakash 188X1A0524
18 pages
GCP Part2
No ratings yet
GCP Part2
5 pages
Treasury'S Borrowing Program: Investone
No ratings yet
Treasury'S Borrowing Program: Investone
42 pages
Part B Unit 3 RDBMS Objectives
No ratings yet
Part B Unit 3 RDBMS Objectives
4 pages
What Are Arrays in JavaScript
No ratings yet
What Are Arrays in JavaScript
9 pages
Systems Developmentfor Records Archivingand Digital Documents Repository ACase Study
No ratings yet
Systems Developmentfor Records Archivingand Digital Documents Repository ACase Study
19 pages
FitSM Sample Service Portfolio Catalogue v2.0
No ratings yet
FitSM Sample Service Portfolio Catalogue v2.0
9 pages
Data Compression For Network GIS: Synonyms
No ratings yet
Data Compression For Network GIS: Synonyms
6 pages
Dave Matthews Band Crash Into Me: Recomendar Curtir
No ratings yet
Dave Matthews Band Crash Into Me: Recomendar Curtir
2 pages
Design and Construction of An Ledscore Board For Minna Township Stadium
No ratings yet
Design and Construction of An Ledscore Board For Minna Township Stadium
9 pages
Critical Path Analysis Cpa
No ratings yet
Critical Path Analysis Cpa
3 pages
Literature Survey
No ratings yet
Literature Survey
5 pages
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
From Everand
Data Compression: Unlocking Efficiency in Computer Vision with Data Compression
Fouad Sabry
No ratings yet

A Survey On Different Text Data Compress

Uploaded by

A Survey On Different Text Data Compress

Uploaded by

International Journal of Science and Research (IJSR)

ISSN (Online): 2319-7064

A Survey on Different text Data Compression

1. Introduction such a way that when compression information is

 Lossy image compression can be used in digital cameras,

 Similarly, DVDs use the Lossy MPEG-2 Video codec for

2. Data Compression Techniques  In Lossy audio compression, methods of psychoacoustics

components of the signal [3]. 

Tree representation of compression methods

RepetitiveSequence Suppression: Compressed Stream: 7A4B7C6A5D

3.2 Burrows Wheeler Transform or Block Sorting Iteration Sequence List

 The process is continuous until the characters in a file are

 For a given list of symbols, develop a probability table.

 Sorting the table according to the probability and placing

 Compression Factor = Size before compression/size after

Compression Ratio of any technique can further be

Garima Singh is presently enrolled in the 4thyearof

You might also like