0% found this document useful (0 votes)
131 views5 pages

A Simple Compression Scheme For Unicode Myanmar Documents

1) The document proposes a simple compression scheme for Unicode Myanmar documents to improve storage efficiency and transmission speeds. 2) Currently, Unicode Myanmar characters require two bytes of storage compared to one byte for ASCII characters. The proposed scheme substitutes ASCII characters for Unicode characters to reduce file sizes. 3) Decompression reverses the process by replacing ASCII characters with the original Unicode characters without any data loss, making it a lossless compression technique. The goal is to apply this efficiently to government, business, and individual websites and applications that use Myanmar Unicode text.

Uploaded by

MyintMoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views5 pages

A Simple Compression Scheme For Unicode Myanmar Documents

1) The document proposes a simple compression scheme for Unicode Myanmar documents to improve storage efficiency and transmission speeds. 2) Currently, Unicode Myanmar characters require two bytes of storage compared to one byte for ASCII characters. The proposed scheme substitutes ASCII characters for Unicode characters to reduce file sizes. 3) Decompression reverses the process by replacing ASCII characters with the original Unicode characters without any data loss, making it a lossless compression technique. The goal is to apply this efficiently to government, business, and individual websites and applications that use Myanmar Unicode text.

Uploaded by

MyintMoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of 10th International Conference on Science and Engineering 2019,

7-8 December 2019, Yangon, Myanmar

ICSE2019-ICT-16
A Simple Compression Scheme for Unicode Myanmar Documents
Pyae Phyo Aung#1, Zar Nay Lin#2, Myo Min Hein#3
Department of Computer Science, Defence Services Academy
Pyin Oo Lwin, Myanmar
1
[email protected]
2
[email protected]
3
[email protected]

Abstract - Efficient character encoding and data compression a Unicode based representation and compression technique
are necessities and significant aspects of information and with a view to achieving better performance. For the
communication theory. Unicode characters, such as Myanmar characters the reserved Unicode range from
Myanmar3, Pyidaungsu which requires more space in
U+1000 to U+109F.
memory during storage and takes more time to transmit data
for that language. Usage of Myanmar language for II. AIM
communication and storage was increased due to the
digitization of government documents and orders. Lossless The main purposes of this research paper are to analyse
text compression process for Myanmar Unicode characters and apply the compression method for Myanmar Unicode
document involves substituting an ASCII character instead of character sets, to increase data capacity and payload by
Unicode Myanmar characters, since the size of an ASCII compression method, and to be able to apply efficiently our
character is one byte where as a Unicode character size range proposed system in Government, institution, commercial
between 1 byte to 4 bytes depends on the storage of encoding and individual’s websites and applications with ease access
file type. Decompression is the reverse process of compression
technique replacing ASCII characters with Unicode
of Myanmar Unicode characters.
characters.
III. LITERATURE REVIEW
Keywords - Compression, Decompression, Unicode, ASCII At present, due to the increase amount of information,
more memory is needed to store and the processing time for
I. INTRODUCTION such data is very large. Data compression can be used in
There are many compression techniques available for network processing technique in order to save energy
different world. Language likes English, German, France because it reduces the amount of data in order to reduce
etc. has developed many compression techniques. There is data transmitted and/or decreases transfer time because the
a high demand to do compression for Myanmar languages. size of data is reduced.
Myanmar language spoken by more than 30 million people A. Related Works
as their first language is the official language used in the
Ajantha Devi and S.Santhosh Baboo [1] described about
administrative, judicial and commercial systems throughout
the embedded optical character recognition on Tamil text
the nation.
image using raspberry Pi. B. Vijayalakshmi and N.
Nowadays, Unicode Myanmar character is widely used
Sasirekha [2] proposed loosess text compression for
in digitalization of government, companies, institutions and
Unicode Tamil documents. The result show that surly paves
military’s documents and orders. There was no exclusive
a way to store the Tamil documents in minimum storage
compression technique available for Myanmar language.
and almost the compressed document will be reduces to
Hence, we need to compress this data in an efficient
50%. Guy E. Blelloch [3] introduced the data compression
manner. The coding technique to reduce the size of data is
techniques and described the other compression techniques
called data compression. The main characteristic of data
in his dissertation. Md.Abu Marjan and Md.Palash Uddin
compression is the transformation of a string of characters
[5] proposed an efficient algorithm for representation and
into another set of symbols, which consists of the same
compression of large Bengali text. It was carried out an
information, but the length of which is as small as possible.
enhanced text representation scheme and a lossless
The lossy compression results in some loss of data from
innovative approach for compression of Bengali text.
the original while performing the decompression process.
Maxime Crochemore and Thierry Lecroq [6] published the
The lossless compression on the other hand will retain its
text data compression algorithm, it was described the
original file exactly without any loss of data. There are
various compression method and advantages and
many compression types available for text, image, audio
disadvantages of compression methods.
and video etc.
There are very few applications which need to compress
This paper deals with lossless text compression for
Unicode Data using a special compression scheme. In
Unicode Myanmar documents. Text compression is a
certain situations, especially when the text contains
decrease in the quantity of bits required to signify the data.
characters from multiple character sets, the compressed text
Compressed data can accumulate less storage capacity,
can end up being larger size than the uncompressed one.
enlarge the velocity of communication, diminish the cost
According to the review of literatures, we have proposed
for storage hardware and network bandwidth. Myanmar
a Unicode based representation and compression technique
characters need two bytes whereas an ASCII characters
occupies one byte for a character. Hence we have proposed
for Unicode Myanmar documents to achieve better Fig .3 Character points of Myanmar Unicode characters
performance. Unicode is designed to represent almost all characters in
every language in the world. Myanmar Unicode fonts have
been produced according to the instructions of the
IV. BACKGROUND THEORY international standards organization (ISO) and which can
In this section, the background theories about Data be used in Microsoft Windows, Apple and Linux. Myanmar
compression methods will be described. is a Unicode block containing characters for the Burmese,
Mon, Karen, Kayah, Shan, and Palaung languages of
B. Data Compression Methods Myanmar. It is also used to write Pali and Sanskrit in
Data compression is a coding strategy that decides to Myanmar. All the characters of Burmese language are now
significantly reduce the total number of bits for storing or encoded as per the Universal Principle of Unicode.
transmitting a document. Fig.1 shown the data compression Unicode occupy more space in memory during storage.
and decompression process. The data compression Fig.3 shown 160 Myanmar Unicode characters range from
techniques are divided into two major categories, “lossy” U+1000 to U+109F in Unicode version 5.2.
data compression techniques and “lossless” data
compression techniques, as shown in Fig.2. Compression V. PROPOSED ALGORITHM
ratio of the methods depends on the input data. To perform lossless data compression for Unicode
Compression Ratio is the ratio between the result of the Myanmar texts, each word in the text to be compressed is
compressed file and the result of the source file. indexed and then each character of the indexed words is
Compression Ratio = (After Compression / Before separated. Now each separated character is converted into
Compression) *100 its equivalent Unicode. Unicode holds four decimal digits
(4.1) for each character. The equivalent decimal values for the
Saving Percentage can be calculated as the following range of Unicode for Myanmar language range from 4096
equation: to 4255. The first decimal value of extended ASCII value is
Saving Percentage = {(Before Compression - After 128 and the first decimal value of Myanmar Unicode is
Compression) / Before Compression} *100 4096. Unicode decimal value into ASCII Extended value to
(4.2) be replaced by differences between these two digits 4096
and 128. As a result, 3968, the number to be constant and if
the input Unicode decimal number is exceed 127 replaced
by subtracting the value of 3967and we got a new decimal
Fig .1 Data compression process value. Finally, all the usable Myanmar characters are
represented by the 2-bytes Unicode characters to 1-byte
character values ranging from 128- 255 .Table 1 shown the
Data compression representation of some presently used Myanmar character.
method
A. Compression Algorithm
Lossy Method The following algorithm show the compression process
Lossless Method of our proposed method:
(Image, audio, video)
(Text or Program) 1. Input the Myanmar Unicode text.
2. For each word,
Fig .2 Types of data compression methods a. Take Unicode values of each character.
b. Convert each Unicode into its represent decimal
B. Myanmar Unicode Characters Sets value.
The Unicode is the most acceptable industrial standards 3. If the input decimal vale is exceed decimal value
for storing, transmitting and documentation. It was (127), manipulated by proposed method and represent
developed in conjunction with the Universal Coded new decimal values.
Character Set (UCS) standard and published as the Unicode 4. Represent each new decimal value, referred to as
Standard. The latest version of Unicode contains a information value, with the indices by their equivalent
repertoire of more than 128,000 characters covering 135 binary.
modern and historic scripts, as well as multiple symbol 5. Store or transmit the binary numbers.
sets.
B. Decompression Algorithm
Decompression is the reverse of compression technique.
The entire encoded bit stream is converted as the indices
and the information values. All binary values are
transformed into their decimal values which form a table in
which one column contains indices and another column
contain the information values.
The substituted code of the decompression algorithm can
be summarized as follows:
1. Input the substituted binary value of the compressed
text.
2. Retrieve the indices and convert into their decimal technique like Unicode, Unicode big endian or UTF-8 will
values. be given as input to the proposed method. The Unicode
3. Manipulate the reverse process of compression, if the Myanmar characters (16 bits) will be replaced with ASCII
decimal number is exceed 127, we were added these characters (8bits) using proposed substitution method. The
decimal number with 3968. compressed file contain ASCII characters and can be stored
4. Convert these decimal value into its Unicode character as ANSI encoding file type.
and then form the Myanmar words
5. Form the original text by placing the words according
to its indices.
My- Unicode Binary Decim- Mani- New
anm- 16 bits String al pulate replaced
ar (16 Value by value
Uni- bits) prop- (8 bit)
code osed
Cha- meth-
ract- od
er (n-c)
မ U+1019 000100 4121 154 10011010
000001
1001 Fig. 5 Performance of text compression process
ြြ U+103C 000100 4156 189 10111101
000011 The compressed file will be reduced to 50% from its
1100 original size. This ANSI file contains compressed data with
န U+1014 000100 4116 149 10010101 collection of unreadable ASCII characters. In future it is
000001 easy for the users to do compression and decompression in
0100 online itself. Fig 6 show the variation of the capacity of
ြ U+103A 000100 4154 187 10111011
000011 Test Total Origin Compressed Compre
1010 Data Characte al Size size (bits) ssion
မ U+1019 000100 4121 154 10011010 rs (bits) (proposed ratio
000001 Method)
1001 1 32 496 248 50%
ြာ U+102C 000100 4140 173 10101101 2 259 4144 2072 50%
000010 3 452 7232 3616 50%
1100 4 120 1920 960 50%
TABLE I
characters mentioned files as shown in Table 2.
EXAMPLE OF COMPRESSION PROCESS
TABLE II
VI. EXPERIMENTAL RESULT TESTING RESULT OF COMPRESSED DATA COMPRESSED RATIO
This section presents an experimental result done with In this analysis, we have used Myanmar Unicode
the platform specification Intel core i5 2.5GHz CPU, 4GB characters. Traditionally, the bit size of default Unicode for
of RAM and is using an operating system WINS10-64 bits. a character is 16. If the input message in both English and
Object Oriented Programming Language, C#.NET, Myanmar Unicode characters, the compression ratio will be
ASP.NET was used to simulate the total scheme. changed depending on the input message. Input messages
are the more Unicode characters the better the compression

ratio.

Fig.4 Performance of text compression process


Fig. 4 and 5 shown the application testing result for our
proposed system. The above performance process of
compression and decompression show that how the size of
Myanmar Unicode characters reduced from 96 bits to 48
bits after compression. The text file which contains the
Myanmar text documents with any one of the encoding
Fig. 6 Performance result of various text data scheme is enhance for any Unicode standard, it may be
After compressing the Myanmar Unicode characters, we employed very easily for any large Unicode text
can be designed the new encoding format for Myanmar compression. Since it takes up a low-memory and can be
Unicode characters as window code pages. Compressed adapted to compress text in small memory devices.
data can be 8 bit character encoding designed to cover Unicode text messaging can also be greatly facilitated using
Myanmar characters. We can also use the second choice the proposed Unicode text compression method. The
after UTF-8 encoding and ANSI format. Compare to UTF- compression technique also works well, when the source
8, it encodes the Unicode Myanmar characters use 3 bytes document contain both ASCII characters and Unicode
per characters, while compressed data use only 2 bytes. Myanmar characters. The proposed method has been
Fig.7 shown the experimental result of website using our developed and is applicable to any Unicode text, its main
proposed compression method. Website admin can use the idea can be used for other natural languages by defining,
compression method and, that data can be stored as other manipulating and substituting their text value. Thus, it will
encoding formats such as Latin, UTF-8 and each be useful for real-time application such web services,
compressed data use 8 bits per characters. We have image steganography and data communication processes.
simplified the data compression process without using the
tables and dictionaries. Our proposed system is used to ACKNOWLEDGEMENT
improve data capacity and reduce the processing time for I am deeply appreciated to my supervisor, Lecturer of
data communication. Both English and Myanmar Unicode Department of Computer Science at Defence Services
characters can be combined in the compression and Academy, Dr. Zar Nay Linn who gave the encouragement,
decompression process at the same time. suggestion, supporting, and for all helps throughout the
study and co-supervisor Dr. Myo Min Hein for his helpful
and constructive suggestion.

REFERENCES
[1] Ajantha Devi and S.Santhosh Baboo, “Embedded Optical Character
Recognition on Tamil Text Image using Raspberry Pi”, International
Journal of Computer Science Trends and Technology, Vol. 2, No. 4,
pp. 11-15, 2014.
[2] B. Vijayalakshmi and N. Sasirekha “Loosess text compression for
Unicode Tamil documents” ICTACT journal on soft computing,
January 2018, Volume: 08 ISSUE: 02.
[3] Guy E. Blelloch, “Introduction to Data Compression”, PhD
Dissertation, Computer Science Department, CarNegie Mellon
University, 2001.
[4] K. Ashok Babu and V. Satish Kumar, “Implementation of Data
Compression Using Huffman Coding”, International Conference on
Methods and Models in Computer Science, India, 2010.
[5] Md.Abu Marjan, Md.Palash Uddin, Masud Ibn Afjal and Md.Dulal
Haque, “ Developing an Efficient Algorithm for Representation and
Compression of Large Bengali Text” Papers of The 9th International
Fig. 7 Experimental result of website using proposed method Fourm on Strategic Technology (IFOST) October 21-23, 2914.
[6] Maxime Crochemore and Thierry Lecroq, “Text Data Compression
Algorithms” International journal of Computer science. King’s
VII. CONCLUSIONS college London, October 2015.
In this paper, we have described about an enhanced [7] Williams, Aaron (2013). "The greedy Gray code algorithm".
text representation and a lossless innovative approach for Proceedings of the 13th International Symposium on
Algorithms and Data Structures (WADS). London (Ontario,
compression of Myanmar Unicode text. As the proposed Canada). pp. 525–536. doi:10.1007/978-3-642-40104-6_46.

You might also like