0% found this document useful (0 votes)
4 views

Data encryption & comp 2

The document discusses an innovative approach to data compression and encryption for secure internet transmission, focusing on Intelligent Dictionary Based Encoding (IDBE) to enhance efficiency and security. It highlights the importance of preprocessing text to improve compression ratios and outlines the steps involved in creating a dynamic dictionary for encoding. Performance analysis indicates that IDBE outperforms traditional methods in both compression ratio and speed.

Uploaded by

tiktokpvr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data encryption & comp 2

The document discusses an innovative approach to data compression and encryption for secure internet transmission, focusing on Intelligent Dictionary Based Encoding (IDBE) to enhance efficiency and security. It highlights the importance of preprocessing text to improve compression ratios and outlines the steps involved in creating a dynamic dictionary for encoding. Performance analysis indicates that IDBE outperforms traditional methods in both compression ratio and speed.

Uploaded by

tiktokpvr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

AN INTELLIGENT TEXT DATA ENCRYPTION AND COMPRESSION FOR

HIGH SPEED AND SECURE DATA TRANSMISSION OVER INTERNET


Dr. V.K. Govindan21
B.S. Shajee mohan
1. Prof. & Head CSED, NIT Calicut, Kerala
2. Assistant Prof.,CSED, L.B.S.C.E., Kasaragod, Kerala.

form which can be compressed with better efficiency and


ABSTRACT which exploits the natural redundancy of the language in
making the transformation. A strategy called Intelligent
Compression algorithms reduce the redundancy in data Dictionary Based Encoding (IDBE) is discussed to achieve this.
representation to decrease the storage required for that data. It has been observed that a preprocessing of the text prior to
Data compression offers an attractive approach to reducing conventional compression will improve the compression
communication costs by using available bandwidth efficiency much better. The intelligent dictionary based
effectively. Over the last decade there has been an encryption provides the required security.
unprecedented explosion in the amount of digital data
transmitted via the Internet, representing text, images, video, Key words: Data compression, BWT, IDBE, Star Encoding,
sound, computer programs, etc. With this trend expected to Dictionary Based Encoding, Lossless
continue, it makes sense to pursue research on developing
algorithms that can most effectively use available network 1. RELATED WORK AND BACKGROUND
bandwidth by maximally compressing data. It is also
important to consider the security aspects of the data being In the last decade, we have seen an unprecedented explosion of
transmitted while compressing it, as most of the text data textual information through the use of the Internet, digital
transmitted over the Internet is very much vulnerable to a library and information retrieval system. It is estimated that by
multitude of attacks. This presentation is focused on the year 2004 the National Service Provider backbone will
addressing this problem of lossless compression of text files have an estimated traffic around 30000Gbps and that the
wit an added security. Lossless compression researchers have growth will continue to be 100% every year. The text data
developed highly sophisticated approaches, such as Huffman competes for 45% of the total Internet traffic. A number of
encoding, arithmetic encoding, the Lempel-Ziv (LZ) family, sophisticated algorithms have been proposed for lossless text
Dynamic Markov Compression (DMC), Prediction by Partial compression of which BWT and PPM out perform the classical
Matching (PPM), and Burrows-Wheeler Transform (BWT) algorithms like Huffman, arithmetic and LZ families of Gzip
and Unix compress. The BWT is an algorithm that takes a
based algorithms. However, none of these methods has been
block of data and rearranges it using a sorting algorithm. The
able to reach the theoretical best-case compression ratio
resulting output block contains exactly the same data elements
consistently, which suggests that better algorithms may be
that it started with, differing only in their ordering. The
possible. One approach for trying to attain better compression
transformation is reversible; meaning the original ordering of
ratios is to develop new compression algorithms. An
the data elements can be restored with no loss of fidelity.
alternative approach, however, is to develop intelligent,
reversible transformations that can be applied to a source text The BWT is performed on an entire block of data at once.
that improve an existing, or backend, algorithm’s ability to Most of today's familiar lossless compression algorithms
compress and also offer a sufficient level of security of the operate in streaming mode, reading a single byte or a few
transmitted information. The latter strategy has been explored bytes at a time. But with this new transform, we want to
here operate on the largest chunks of data possible. Since the BWT
operates on data in memory, you may encounter files too big
Michael Burrows and David Wheeler recently released the
to process in one fell swoop. In these cases, the file must be
details of a transformation function that opens the door to
split up and processed a block at a time. The output of the
some revolutionary new data compression techniques. The
BWT transform is usually piped through a move-to-front
Burrows-Wheeler Transform, or BWT, transforms a block of
stage, then a run length encoder stage, and finally an entropy
data into a format that is extremely well suited for
encoder, normally arithmetic or Huffman coding. The actual
compression. The block sorting algorithm they developed
command line to perform this sequence will look like this:
works by applying a reversible transformation to a block of
input text. The transformation does not itself compress the BWT < input-file | MTF | RLE | ARI >
data, but reorders it to make it easy to compress with simple output-file
algorithms such as move to front encoding.
The basic philosophy of our secure compression is to The decompression is just the reverse process and look like
preprocess the text and transform it into some intermediate this
UNARI input-file | UNRLE | UNMTF | looks as though it is more compressible and at the same time
UNBWT > output-file does not offer any serious challenge to the hacker!
An alternate approach to this is to perform a lossless,
reversible transformation to a source file prior to applying an
2. AN INTELLIGENT DICTIONARY
existing compression algorithm. The transformation is BASED ENCODING
designed to make it easier to compress the source file. The star In these circumstances we propose a better encoding strategy,
encoding is generally used for this type of pre processing which will offer higher compression ratios and better security
transformation of the source text. Star-encoding works by towards all possible ways of attacks while transmission. The
creating a large dictionary of commonly used words expected objective of this paper is to develop a better transformation
in the input files. The dictionary must be prepared in advance, yielding greater compression and added security. The basic
and must be known to the compressor and decompressor. philosophy of compression is to transform text in to some
intermediate form, which can be compressed with better
Each word in the dictionary has a star-encoded equivalent, in efficiency and more secure encoding, which exploits the
which as many letters a possible are replaced by the '*' natural redundancy of the language in making this
character. For example, a commonly used word such the might transformation. We have explained the basic approach of our
be replaced by the string t**. The star-encoding transform compression method in the previous sentence but let us use the
simply replaces every occurrence of the word the in the input same sentence as an example to explain the point further. Let
file with t**. us rewrite it with a lot of spelling mistakes: Our philosophy of
compression is to trasfom the txt into som intermedate form
Ideally, the most common words will have the highest
which can be compresed with bettr efficency and which
percentage of '*' characters in their encoding. If done properly,
xploits the natural redundancy of the language in making
this means that transformed file will have a huge number of '*'
this tranformation. Most people will have no problem to read
characters. This ought to make the transformed file more
it. This is because our visual perception system recognizes each
compressible than the original plain text. The existing star
word with an approximate signature pattern for the word
encoding does not provide any compression as such but opposed to an actual and exact sequence of letters and we have
provide the input text a better compressible format for a later a dictionary in our brain, which associates each misspelled
stage compressor. The star encoding is very much weak and word with a corresponding, correct word. The signatures for
vulnerable to attacks. As an example, a section of text from the word for computing machinery could be arbitrary as long as
Project Guttenburg’s version of Romeo and Juliet looks like they are unique. The algorithm we developed is a two step
this in the original text: process consisting

But soft, what light through yonder window Step1: Make an intelligent dictionary
breaks?
Step2: Encode the input text data
It is the East, and Iuliet is the Sunne,
Arise faire Sun and kill the enuious The entire process can be summerised as follows.
Moone,
2.1 Encoding Algorithm
Who is already sicke and pale with griefe,
Start encode with argument input file inp
That thou her Maid art far more faire then
A. Read the dictionary and store all words and their codes in a
she
table
Running this text through the star-encoder yields the following
text: B . While inp is not empty
1.Read the characters from inp and form tokens.
B** *of*, **a* **g** *****g* ***d*r ***do*
b*e***? 2. If the token is longer than 1 character, then

It *s *** E**t, **d ***i** *s *** *u**e, 1.Search for the token in the table

A***e **i** *un **d k*** *** e****** 2. If it is not found,


M****,
1.Write the token as such in to the output file.
*ho *s a****** **c*e **d **le ***h ****fe,
Else
***t ***u *e* *ai* *r* f*r **r* **i** ***n 1.Find the length of the code for the
s** word.
You can clearly see that the encoded data has exactly the same
number of characters, but is dominated by stars. It certainly 2.The actual code consists of the length concatenated with the
code in the table, the length serves as a marker while decoding
and is represented by the ASCII characters 251 to254 with And the earth was without form, and void; and darkness
251 representing a code of length 1, 252 was upon the face of the deep. And the Spirit of God
moved upon the face of the waters.
3. Write the actual code into the output
And God said, Let there be light: and there was light.
file.
4. read the next character and neglect the And God saw the light, that it was good: and God
it if it is a space. If it is any other divided the light from the darkness.
character, make it the first character of
the next token and go back to B, after And God called the light Day, and the darkness he called
inserting a marker character (ASCII Night. And the evening and the morning were the first
255) to indicate the absence of a space day.

Else And God said, Let there be a firmament in the midst of


the waters, and let it divide the waters from the waters.
1. Write the 1 character token
And God called the firmament Heaven. And the evening
2. If the character is one of the ASCII characters and the morning were the second day.
251 –255, write the character once more so as to
show that it is part of the text and not a marker Running the text through the Intelligent Dictionary Based
Encoder (IDBE) yields the following text:
Endif

End (While)
û©û!ü%;ûNü'Œû!ü"ƒû"û!û˜ÿ. û*û!û˜û5ü"8ü"}ÿ, û"ü2Óÿ;
C. Stop. û"ü%Lû5ûYû!ü"nû#û!ü&“ÿ.
û*û!ü%Ìû#ûNü&ÇûYû!ü"nû#û!ü#Éÿ.
2.2. Dictionary Making Algorithm
û*ûNûAÿ, ü"¿û]û.ü"’ÿ: û"û]û5ü"’ÿ.
Start MakeDict with multiple source files as input
û*ûNü"Qû!ü"’ÿ, û'û1û5û²ÿ: û"ûNü(Rû!ü"’û;û!ü%Lÿ.
1. Extract all words from input files.
û*ûNûóû!ü"’ü%…ÿ, û"û!ü%Lû-ûóü9[ÿ.
2. If a word is already in the table increment the û*û!ü'·û"û!ü#¹ûSû!ûºûvÿ.
number of occurrence by 1, otherwise add it to the
table and set the number occurrence to 1. û*ûNû‚û!ü6 ÿ,
û"ü(Rû!ü#Éû:ûSü"2û!ü6 û;û!ü#Éû:ûSü"‚û!ü6 ÿ: û"û1û5ûeÿ.
3. Sort the table by frequency of occurrences in
descending order. û*ûNûóû!ü6•ü#Wÿ. û*û!ü'·û"û!ü#¹ûSû!ü"ßûvÿ

4. Start giving codes using the following method:


It is clear from the above sample data that the encoded text
i). Give the first 218 words the ASCII characters 33
to 250 as the code. provide a better compression and a stiff challenge to the
hacker! It may look as if the encoded text can be attacked
ii). Now give the remaining words each one using a conventional frequency analysis of the words in the
permutation of two of the ASCII characters (in the encoded text, but a detailed inspection of the dictionary
range 33 – 250), taken in order. If there are any making algorithm reveal that it is not so. An attacker can
remaining words give them each one permutation of decode the encoded text only if he knows the dictionary. The
three of the ASCII characters and finally if required dictionary on the other hand is a dynamically created one. It
permutation of four characters. depends on the nature of the text being encoded. The nature of
the text differs for different sessions of communication
5. Create a new table having only words and their between a server and client. In addition to this fact we suggest
codes. Store this table as the Dictionary in a file. a stronger encryption strategy for the dictionary transfer. A
proper dictionary management and transfer protocol can be
6. Stop. adopted for a more secure data transfer.

As an example, a section of the text from Canterbury


corpus version of bible.txt looks like this in the original
text:
2.3. Dictionary Management and Transfer
Protocol
In the beginning God created the heaven and the earth.
In order to make the system least vulnerable to possible attacks
by hackers, a suitable dictionary management and transfer
protocol can be devised. This topic is currently in our
consideration, but so far we haven’t implemented any models
for this as such. One suggested method for dictionary transfer
between server and client can be as per SSL (Secure Socket
Layer) Record Protocol, which provides basic security services
to various higher-level protocols such as HyperText Transport
Protocol (HTTP). A typical strategy can be accepted as follows:

The fist step is to fragment the dictionary in to chunks of


suitable size, say 16KB. Then an optional compression can be
applied. The next step is to compute a message authentication
code (MAC) over the compressed data. A secret key can be
used for this purpose. Cryptographic hash algorithm such as
SHA-1 or MD5 can be used for the calculation. The
compressed dictionary fragment and the MAC are encrypted
using symmetric encryption such as IDEA, DES or Fortezza.
The final process is to prepend the encrypted dictionary Table 1.0: BPC comparison of simple BWT, BWT
fragment with the header. with *Encode and BWT with IDBE in Calgary corpuses

3. PERFORMANCE ANALYSIS
Calgary corpuses
The performance issues such as Bits Per Character (BPC) and
conversion time are compared for the three cases i.e., simple File File BWT BWT with BWT with
BWT, BWT with Star encoding and BWT with Intelligent Names size *Encode IDBE
Dictionary Based Encoding (IDBE). The results are shown
graphically and prove that BWT with IDBE out performs all Kb BPC Time BPC Time BPC Time
other techniques in compression ratio, speed of compression
(conversion time) and have higher level of security.
bib 108.7 2.11 1 1.93 6 1.69 4
Fig.1.0: BPC & Conversion time comparison of transform
with BWT, BWT with *Encoding and BWT with IDBE for book1 750.8 2.85 11 2.74 18 2.36 11

Calgary corpus files. book2 596.5 2.43 9 2.33 14 2.02 10

geo 100.0 4.84 2 4.84 6 5.18 5

news 368.3 2.83 6 2.65 10 2.37 7

paper1 51.9 2.65 1 1.59 5 2.26 3

paper2 80.3 2.61 2 2.45 5 2.14 4

paper3 45.4 2.91 2 2.60 6 2.27 3

Paper4 13.0 3.32 2 2.79 5 2.52 3

Paper5 11.7 3.41 1 3.00 4 2.8 2

Paper6 37.2 2.73 1 2.54 5 2.38 3

progc 38.7 2.67 2 2.54 5 2.44 3

prog1 70.0 1.88 1 1.78 5 1.70 3

trans 91.5 1.63 2 1.53 5 1.46 4

Fig.2.0 :BPC & Conversion time comparison of transform


with BWT, BWT with *Encoding and BWT with IDBE for
Canterbury corpus files.
4. CONCLUSION
In an ideal channel, the reduction of transmission time is
directly proportional to the amount of compression. But in a
typical Internet scenario with fluctuating bandwidth,
congestion and protocols of packet switching, this does not
hold true. Our results have shown excellent improvement in
text data compression and added levels of security over the
existing methods. These improvements come with additional
processing required on the server/nodes

3. REFERENCES
1. M. Burrows and D. J. Wheeler. “A Block-sorting
Lossless Data Compression Algorithm”, SRC
Research Report 124, Digital Systems Research
Cente

2. H. Kruse and A. Mukherjee. “Data Compression


Using Text Encryption”, Proc. Data Compression
Conference, 1997, IEEE Computer Society Press,
1997, p. 447.
3 H. Kruse and A. Mukherjee. “Preprocessing Text to
Table 2.0: BPC comparison of simple BWT, BWT with Improve Compression Ratios”, Proc. Data
*Encode and BWT with IDBE in Canterbury corpuses Compression Conference, 1998, IEEE Computer
Society Press, 1997, p. 556.
Cantebury corpuses

File Names File BWT BWT with BWT with


4. N.J. Larsson. “The Context Trees of Block Sorting
size *Encode IDBE
Compression”, Proceedings of the IEEE Data
Compression Conference, March 1998, pp. 189-198.
Kb BPC Ti BPC Ti BPC Ti
me me me
5 A. Moffat. “Implementing the PPM Data
Compression Scheme”, IEEE Transactions on
Communications, COM-38, 1990, pp. 1917-1921.

alice29.txt 148.5 2.45 3 2.39 6 2.11 4 6 T. Welch, “A Technique for High-Performance Data
Compression”, IEEE Computer, Vol. 17, No. 6,
Asyoulik.txt 122.2 2.72 2 2.61 7 2.32 4 1984.

cp.html 24.0 2.6 1 2.27 4 2.13 3 7 R. Franceschini, H. Kurse, N. Zhang, R. Iqbal and A.
Mukherjee, “Lossless, Reversible Transformations
fields.c 10.9 2.35 0 2.20 4 2.06 3
that Improve Text Compression Ratios”, submitted
grammar.lsp 3.60 2.88 0 2.67 4 2.44 3 to IEEE Transactions on Multimedia Systems (June
2000).
kennedy.xls 1005. 0.81 10 0.82 17 0.98 17
6 8 F. Awan, and A. Mukherjee, “LIPT: A losskess Text
Transform to Improve Compression”, Proceedings of
Icet10.txt 416.8 2.38 7 2.25 12 1.87 7 International Conference on Information and
plrabn12.txt 470.6 2.80 10 2.69 13 2.30 8
Theory: Coding and computing, IEEE Computer
Society, Las Vegas Nevada, April 2001.
ptt5 501.2 0.85 27 0.85 33 0.86 31
9. N. Motgi and A. Mukherjee, “Network Conscious
sum 37.3 2.80 2 2.75 4 2.89 4 Text Compression Systems (NCTCSys)”,
Proceedings of International Conference on
xrgs.1 4.1 3.51 1 3.32 4 2.93 2
Information and Theory: Coding aand Computing,
IEEE Computer Society, Las Vegas Nevada, April
2001.

10. F. Awan, Nan Zhang N. Motgi, R.Iqbal and


A. Mukherjee, “LIPT: A reversible Lossless Text
Transformation to Improve Compression
Performance”, Proceedings of data Compression
Conference, Snowbird, Utah, March, 2001.

11. Dr. V. K. Govindan, B.S. Shajee Mohan “IDBE - An


Intelligent Dictionary Based Encoding Algorithm for
Text Data Compression for High Speed Data
Transmission Over Internet” Proceeding of the
International Conference on Intelligent Signal
Processing and Robotics
IIIT Allahabad February 2004.(Selected for
presentation.).

You might also like