Data Compression Engine Enhancement Using Huffman Coding Algorithm2
Data Compression Engine Enhancement Using Huffman Coding Algorithm2
CODING ALGORITHM
KAN FU GIANG
KAN FU GIANG
MAY 2011
ii
I declare that this project report entitled Data Compression Enhancement Using
Huffman Coding Algorithm is the result of my own research except as cited in
references. The project report has not been accepted for any degree and is not
concurrently submitted in candidature of any other degree.
Signature
Name
:
Kan Fu Giang
:
Date
20 May 2011
:
iii
Specially dedicated to
my family and friends
iv
ACKNOWLEDGEMENTS
Last but not least, I would like to express my appreciation to my family who
are always there supporting me in all aspects.
ABSTRACT
As the usage of computer and networking increased day by day, data storage
and transmission have been the issue among people because it cost money. There is a
need to do data compression to increase effective communication bandwidth and
effective storage capacity. This thesis proposed a widely used data compression
technique for textual file and its enhancement using Huffman Coding. The LZSS
compression algorithm implementation and its enhancement are model in C
language. The design is further verified by experiment on the compression and
decompression process with a sample text file. Performance analysis is done by
comparing the compression ratio before and after enhancement. Result of the
analysis showed that the enhancement using Huffman Coding is successful.
vi
ABSTRAK
vii
TABLE OF CONTENTS
CHAPTER
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENTS
iv
ABSTRACT
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
ix
LIST OF FIGURES
LIST OF SYMBOLS
xi
LIST OF APPENDICES
xii
INTRODUCTION
1.1.
Problem Statement
1.2
Objectives
1.3
Scope of Work
1.4
1.5
Organization of Project
2.1
2.2
2.3
viii
2.4
3
10
3.1
Project Procedures
10
3.2
12
3.2.1
12
3.3
4
Summary
Summary
12
13
4.1
LZSS Algorithms
13
13
15
4.2
15
4.3
Summary
16
17
ANALYSIS
5.1
Design Verification
17
18
18
5.2
Performance Analysis
20
5.3
Summary
21
CONCLUSIONS
22
6.1
Concluding Remarks
22
6.2
23
REFERENCES
24
APPENDIX
25
ix
LIST OF TABLES
TABLE NO
5.1
TITLE
Compression ratio comparison
PAGE
20
LIST OF FIGURES
FIGURE NO
TITLE
PAGE
2.1
2.2
3.1
Project Workflow
11
4.1
14
5.1
17
5.2
18
5.3
18
5.4
19
5.5
19
5.6
20
5.7
20
xi
LIST OF SYMBOLS
IC
Intergrated Circuit
LZ77
Lempel-Ziv-77
LZSS
Lempel-Ziv-Storer-Szymanski
xii
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
25-31
32-37
38-50
CHAPTER 1
INTRODUCTION
1.2.
Problem Statement
As the usage of computer and others digital technology products with the
same kind increase, most of the documents and information are being stored in the
digital form. Many people tend to digitalize everything in the life to make life
became simple and easier. Moreover, internet and networking also contribute to the
increment of this phenomenon. Therefore, data storage and data transmission have
become the main concern of these technologies products users.
Data transmission and storage cost money. The more information being dealt
with, the more it costs. In spite of this, most digital data are not stored in the most
compact form (Steven, 2007). In order to get the most compact form, we need to
compress the digital data. This process is called data compression. Since all sources
information have inherent redundancies with them (Shanon, 1948), data compression
technique tends to remove the inherent redundancies by representing data in fewer
bits. Therefore, it increases the effective storage capacity. Thus, reduce the
2
consumption of expensive resources, such as hard disk space and transmission
bandwidth.
Until today, many data compression techniques have been developed. These
techniques have different performance based on the characteristics of the source data.
However, research is still being constructed to seeking for any improvement for the
data compression.
1.6
Objectives
From the issues that discussed in the previous section, the objectives of this
1.7
Scope of Work
In this project, the data compression engine is designed in C behavioral
model. LZSS data compression is used as a case study to explore and study about the
data compression process.
After that, Huffman coding is used to enhance the data compression engine.
Meanwhile, the performance is examined based on the compression ratio before and
after the enhancement.
3
1.8
2. Huffman coding has been proposed as an enhancement for the LZSS data
compression.
1.9
Organization of Project
This project is organized into 6 chapters. The first chapter introduces the
4
Chapter four describes about design of the LZSS and Huffman coding
including the algorithms, flow charts and block diagram of the overall system design.
Chapter five reports the simulation results of the design. The results are
analyzed to validate the performance of the data compression engine.
Chapter six summarizes the project work and states all deliverables of the
project. Recommendation for potential works all provided.
CHAPTER 2
2.5
source data in such a way as to obtain a simple representation with at most tolerable
lost of fidelity [Nelson, 1996]. The source data can take many forms such as text,
image, sound or any combination of all these types such as video. Data compression
is popular for two reasons:
(1) People like to accumulate data and hate to throw anything away. No matter
how big a storage device one has, sooner or later it is going to overflow. Data
compression seems useful because it delays this inevitability [David Salomon,
2007].
(2) People hate to wait a long time for data transfers. When sitting at the
computer, waiting for a Web page to come in or for a file to download, we
naturally feel that anything longer than a few seconds is a long time to wait
[David Salomon, 2007].
There are two types of data compression in common, which are lossless
compression and lossy compression [Ida Mengyi Pu, 2006]. Lossless compression is
able to reconstruct exactly the original data from the compressed data. There is no
data loss during the compression process. It is also known as reversible compression.
Relatively, lossy compression allows some information loss during the compression
process to exchange for better compression effectiveness. Lossless compression is
suitable for textual files and executable files which require no information loss.
While lossy compression is best for multimedia files such as image, video and audio
because we need to discard noise from the source data.
7
compression, we need a prior knowledge about the characteristic of the source data
to generate predefined encoding table that will be reuse in the decompression phase.
The encoding table has to be sent separately to the decompression controller. These
kind of data compression algorithm (eg, Huffman coding) are more efficient in
compression and decompression as long as the symbol probability distribution are
constant. In contrast to statistical compression, dictionary-based compression is
universal data compression algorithm. In this case, data compression algorithm does
not require a prior knowledge of source data [Rissanen, 1983]. It can learn the
characteristic of the source data and generate a dictionary as a reference for the
coded-word during the compression phase. Moreover, the dictionary is not necessary
to sent to the decompression controller because it can be regenerated itself from the
decompression algorithm during the decompression phase.
2.6
algorithms for lossless storage. Most of the widely used data compression algorithms
are based on LZ compression [Lempel and Ziv 1977, 1978]. In the LZ compression,
a dictionary is produced based on the previous seen symbols in the source data to
detect the repeating symbols in the incoming data. Then the detected repeating
symbols are replaced with smaller sized codeword. LZ compression is dictionarybased compression algorithm mentioned above. So, LZ compression and its variants
are universal type of data compression algorithms. Lempel-Ziv-77 (LZ77) algorithm
proposed by Abraham Lempel and Jacob Ziv in 1977 is the first LZ compression
algorithm [Lempel and Ziv 1977].
8
coding is a variable length entropy encoding algorithm which attempts to reduce the
amount of bits required to represent a string of symbols. Shorter length codeword
replaces frequently used symbol while seldom used symbol will has longer
codeword. In many practical applications, Huffman coding is usually applied after an
encoding process that produce non-uniformly distributed codeword for better
compression performance. Therefore, Huffman coding is proposed for enhancement
of the data compression model in this project.
A high speed data compression and decompression cores were designed and
developed by Universiti Teknologi Malaysia [Yeem Kah Meng, 2002]. The hardware
was designed based on of LZSS algorithm and Huffman coding. It was designed as a
parameterized module for easy configurability that can provide suitable compromise
between constraints of hardware resources, processing speed and compression saving.
An improved version data compression engine was introduced in 2007 by Universiti
Teknologi Malaysia [Roslin bin Mohd Sabri, 2007] to compensate drawbacks such as
portability to any logic programmable logic device and solved the abnormal behavior of
the decompression processor core.
2.7
source data. Source data in this compacted form will enhance the data storage and
data transmission. As a result, higher data transfer rate can be achieved [Hifn, 1998].
Hence, data compression is applicable in many fields. Lossy image compression is
used in digital cameras to increase storage capacities with minimal degradation of
picture quality. Lossless data compression is used in many applications. For
example, it is used in the popular ZIP file format and in the Unix tool gzip. It is also
often used as a component within lossy data compression technologies. Besides that,
cryptosystems often use lossless data compression before encryption in for added
security. An example of this application is proposed which using Huffman coding for
data reduction before the encryption [Nilkesh Patra and Sila Siba Sankar, 2007].
Data compression is also playing a vital role in IC testing field. When the
design of the IC become more complex, larger test data volume will be required.
Result in more power consumption because of the larger test data volume. This
situation may cause damage to the circuit-under-test. To solve this problem, test data
compression is applied [Chandra, 2002].
2.8
Summary
Data compression is widely use in various applications especially in the field
of data storage and high speed communication. Lossless and lossy are two main
classes of data compression techniques that have its respective field of application.
LZSS is use in this project because it is a widely use universal data compression
algorithm with good compression ratio. Huffman coding is usually applied after an
encoding process that produce non-uniformly distributed codeword for better
compression performance. Hence, it is suitable to use for enhancement of data
compression model in this project.
CHAPTER 3
3.4
Project Procedures
This project can be generally divided into two modeling parts which are
LZSS compression and LZSS compression enhanced with Huffman Coding. This
project is divided into two phases; the first phase is essentially the study of algorithm
to be implemented and literature review of previous work. Second phase of the
system consists of design implementation where the algorithm is to be implemented
in C Language.
11
Firstly, the project is started by defining the objective from the problem
statement. Then, the identification for scope of work is done. Next is the study and
learning of tools and techniques required to facilitate the project. In this project, the
tool used is Microsoft Visual Studio 2010 and the technique used is C Language
Programming. The data compression engine is implemented into two models using C
Language. One of the models is LZSS compression and the other is LZSS
compression enhanced using Huffman Coding. Lastly, the performance analysis on
compression ratio of the both models is done to verify the significant of the
enhancement. Literature review and documentation are all along the project.
12
input source data in the experiment to get the compression ratio because both LZSS
and Huffman Coding are Lossless compression algorithms.
3.5
This section describes briefly the tool and technique used in the project.
3.5.1
3.6
Summary
Methodology and project workflow of the project are explained and
CHAPTER 4
The details of the LZSS and Huffman Coding data compression algorithms
are presented in this chapter.
4.1
LZSS Algorithm
LZSS is a compression algorithm which based on the dictionary model, is one
4.3.1
array. This array will be a combination of dictionary and incoming source data
differential by a position pointer. The position pointer is moved from the beginning
to the end of the array during the compression. Data in front of the position pointer is
14
incoming source data. Where else, the data behind the position pointer is treated as a
reference (dictionary) to the incoming source data.
Dictionary
Array
@ Y
Position pointer
Figure 4.1: Dictionary and Incoming Source Data in Array
As the position pointer move, data from the incoming source is compared
with string in dictionary to find the longest match string. The longest match string is
boundary by a predefined maximum match length. If a match string does not exist or
its length is less than the minimum match length, it is output as a literal in the
dictionary. Otherwise, the match string will be replaced by a fixed length LZSS
codeword. After that, the position pointer will moving forward to the incoming
source data which has not been compared. The matching process will be repeated
until the position pointer reach the end of the array.
The LZSS compression algorithm can be simplify into the following steps:
1.
Shift a copy of the symbols written to the encoded output from the
unencoded string to the dictionary.
6.
Read a number of symbols from the uncoded input equal to the number
of symbols written in Step 4.
15
7.
4.3.2
Repeat from Step 3, until all the entire input has been encoded.
4.2
16
1. Sort source outputs in decreasing order of their probabilities.
2. Merge the two least-probable outputs into a single output whose probability is
the sum of the corresponding probabilities.
3. If the number of remaining outputs is more than 2, then go to step 1.
4. Arbitrarily assign 0 and 1 as codewords for the two remaining outputs.
5. If an output is the result of the merger of two outputs in a preceding step,
append the current codeword with a 0 and a 1 to obtain the codeword of the
preceding outputs and repeat step 5. If no output is preceded by another
output in a preceding step, then stop.
During the compression process, each symbol in the source data is replaced
with its codeword by referring to the code table. It is the reverse for decompression
process. The encoded data is compare bit by bit with the codeword in the code table
to recovered the symbol that represented by its codeword.
4.4
Summary
The detail of LZSS and Huffman Coding compression algorithms are
CHAPTER 5
This chapter reports the results obtain on testing the C behavioral model of
data compression before and after enhancement. It starts with the design verification,
followed by performance analysis.
5.2
Design Verification
The design verification is done by conducting experiment on the designed
data compression engine using both executable files of LZSS and Huffman Coding
in the Command Prompt. The input data of the experiment is a text file with
redundancy. Sizes and contents of input and output files are compared to verify the
design.
18
5.1. 1 LZSS Compression Engine
Input file of this experiment is infile.txt with size of 8265bytes. LZSS
executable file (LZSS.exe) is switched into compression mode using Command
Prompt to get the encoded file (encodefile.txt) of compression process with size of
2120bytes. After that, LZSS.exe is switched into decompression mode to do
decompression on the encoded file. Output file obtained from the decompression
process is outfile.txt with size of 8265bytes. There is a decreasing amount of the file
size after compression process and both of the input and output files are identical.
Hence, the design of LZSS compression engine is verified. Figure 5.1-5.3 shows the
input file, encoded file and output file of the LZSS compression engine
19
mode to do decompression on the encoded file. Output file obtained from the
decompression process is outfile.txt with size of 8265bytes. There is a decreasing
amount of the file size after compression process and both of the input and output
files are identical. Hence, the design of Huffman Coding compression engine is
verified. Figure 5.1, 5.4 and 5.5 shows the input file, encoded file and output file of
the Huffman Coding engine
20
5.2
Performance Analysis
The performance analysis of this project is compression ratio comparison
Compressio n Ratio
21
From the Table 5.1, better compression ratio is obtained if enhanced LZSS is
used in the compression. These compression ratios are based on the same input file
(infile.txt).
5.3
Summary
Results of design verification and performance analysis are presented in this
chapter. LZSS, Huffman Coding and enhanced LZSS compression engines are
verified to function correctly and as expected. In performance analysis, the
compression ratio is better when enhanced LZSS compression engine is used. In the
next chapter, the conclusion of this project is presented.
CHAPTER 6
CONCLUSIONS
6.1
Concluding Remarks
Data storage and data transmission cost money. Moreover, there often have
redundancies in the data that we used. To solve these problems, data compression
has to be applied. LZSS is a widely use universal data compression algorithm.
Huffman Coding is usually used after LZSS encoding to get a better compression
saving. Therefore, LZSS and Huffman Coding are used in this project to implement a
better data compression engine.
23
6.2
Verification should not be limited to text files. Other types of source data
such as, PDF file, executable file, word file and programming code can be included
in the verification to have a more accurate and trustable result.
REFERENCES
Chandra, A. (2002). Low-Power Scan Testing and Test Data Compression for
System-on-a-Chip. IEEE Trans. Computer-Aided Design of Integrated Circuits and
Systems. Vol 21.
David Salomon. (2007).Data Compression The Complete Reference Fourth Edition.
Springer-Verlag London Limited. 2
Hifn. (1999). 9600 Data Compression Procescor. Hi/fn Inc
Huffman, D.A. (1952). A Method for the Construction of Minimum Redundancy
Codes. Proceedings IRE. Vol 40. 1098-1102.
Ida Mengyi Pu. (2006). Fundamental Data Compression. Great Britain. ButterworthHeinemann. 5-7
Lempel, A. and Ziv, J. (1977). A Universal Algorithm for Sequential Data
Compression. IEEE Trans. Information Theory. Vol IT-23. 337-343.
Lempel, A. and Ziv, J. (1978). Compression of Individual Sequence via VariableRate Coding. IEEE Trans. Information Theory. Vol IT-24. 530-536.
Nelson, M. (1996). The Data Compression Book. Hungary Minds. 11-23
Nilkesh Patra and Sila Siba Sankar. (2007). Data Reduction By Huffman Coding And
Encyption By Insertion Of Shuffled Cyclic Redundancy Code. National Institute of
Technology Rourkela. Degree Thesis.
25
Roslee Bin Mohd Sabri (2007).Register Transfer Level Design of Compression
Processor Core Using Verilog Hardware Description Language. Universiti
Teknologi Malaysia. Master Thesis
Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System
Technical Journal. 27, 379-656.
Steven W. Smith. (2007, June 14). Data compression tutorial: Part 1. Retrieved
from http://www.eetimes.com/design/signal-processing-dsp/4017497/Datacompression-tutorial-Part-1/
Yeem Kah Meng (2002). LZSS Compression Core Design Using VHDL. Universiti
Teknologi Malaysia. Master Thesis.
Rissanen, J. (1983). A Universal Data Compression System. IEEE Trans. Information
Theory. IT-29. 656-664
APPENDIX A
A.1
27
- Next byte is "D" Have this symbol been seen before? No. Encode it as
literal
Output data = 0 01000001(0"A") 0 01000010(0"B") 0 01000011(0"C")
0 01000100(0"D")
5. Input Data = "ABCDABCA"
*
- Next byte is "A" Have this symbol been seen before? YES. OK, when did
this symbol appears previous to this? It was 4 bytes ago (offset=4). OK, how
similar is the data 4 bytes ago to the current data? Well, three bytes are the
same "ABC". So have encode this as "1" (to indicate a match" followed by
(4, 3).
Output data = 0 01000001(0"A") 0 01000010(0"B") 0 01000011(0"C")
0 01000100(0"D") 1 0100 0011 (1 43)
6. Input Data = "ABCDABCA"
*
- Next byte is "A" Have this symbol been seen before? YES. 3 bytes ago.
But this time only 1 byte can be matched. Encode it as "1" followed by (3,
1).
Output data = 0 01000001(0"A") 0 01000010(0"B") 0 01000011(0"C")
0 01000100(0"D") 1 0100 0011 (1 43) 1 0011 0001 (1 31)
A.2
show the LZSS algorithm with match length limitation. Again, let * be the data
position. In this case, minimum match length is 2 bytes hence matches that are less
than 2 bytes will not be represent in (position, length) pair.
1. Input Data = " AABBCBBAABC "
*
- Lets start at the first byte "A". Have this symbol seen before? No. Encode
it as a literal byte. "0" (to indicate literal) followed by the data "A"
Output data = 0 01000001 (0"A")
28
2. Input Data = "AABBCBBAABC "
*
- Next byte is "A". Have this symbol seen before? Yes, but it can only copy 1
bytes. The match length is less than minimum march length (2). So encode it
as literal
Output data = 0 01000001(0"A") 0 01000001(0"A")
3. Input Data = "AABBCBBAABC "
*
- Next byte is "B". Have this symbol seen before? No. Encode it as literal
Output data = 0 01000001(0"A") 0 01000001(0"A") 0 01000010(0"B")
01000010(0"B")
01000010(0"B") 0
01000011(0"C")
29
Output data = 0 01000001(0"A") 0 01000001(0"A") 0 01000010(0"B")
0
01000010(0"B") 0
01000011(0"C") 1
00110010(3,
2)
The data position is thus increased by 2 bytes instead by 1 bytes.
01000010(0"B") 0
2) 1
01000011(0"C") 1
00110010(3,
01110011(7, 3)
01000010(0"B") 0
2) 1
01110011(7, 3)
01000011(0"C") 1
0
01000011(0"C")
00110010(3,
30
A.3
Lets check the next byte if it is LZSS literal or LZSS (Offset, Length) pair.
A
B
C
D
0 01000001 | 0 01000010 | 0 01000011 | 0 01000100 | 1 0100 011 | 1 0011
001
*
Since it is 0, hence it is LZSS literal. Hence take one byte and put into output file as
shown below.
01000001 01000010
Lets check the next byte if it is LZSS literal or LZSS (Offset, Length) pair.
A
B
C
D
0 01000001 | 0 01000010 | 0 01000011 | 0 01000100 | 1 0100 011 | 1 0011
001
*
Since it is 0, hence it is LZSS literal. Hence take one byte and put into output file as
shown below.
01000001 01000010 01000011
Lets check the next byte if it is LZSS literal or LZSS (Offset, Length) pair.
31
A
B
C
D
0 01000001 | 0 01000010 | 0 01000011 | 0 01000100 | 1 0100 011 | 1 0011
001
*
Since it is 0, hence it is LZSS literal. Hence take one byte and put into output file as
shown below.
01000001 01000010 01000011 01000100
Lets check the next byte if it is LZSS literal or LZSS (Offset, Length) pair.
A
B
C
D
0 01000001 | 0 01000010 | 0 01000011 | 0 01000100 | 1 0100 011 | 1 0011
001
*
Hey, it is 1! Hence it is LZSS (Offset, Length) pair. Now go back 4 bytes (offset),
and take a length, 3 bytes (Length) then put into output file.
01000001 01000010 01000011 01000100 01000001 01000010 01000011
Lets check the next byte if it is LZSS literal or LZSS (Offset, Length) pair.
A
B
C
D
0 01000001 | 0 01000010 | 0 01000011 | 0 01000100 | 1 0100 011 | 1 0011
001
*
Again, it is 1! Hence it is LZSS (Offset, Length) pair. Now go back 3 bytes (offset),
and take a length, 1 bytes (Length) then put into output file.
01000001 01000010 01000011 01000100 01000001 01000010 01000011 01000001
A
Or ABCDABCA
Original data is obtained.
APPENDIX B
lzss_compress.cpp
33
} // End If
} // End while
CompressedStreamWriteBits(0, 16);
CompressedStreamWriteBits(0, 0);
*nCompressedSize = m_nCompressedSize;
return HS_LZSS_E_OK;
} // Compress()
34
if (nTempLen >= HS_LZSS_MATCHLEN)
{
nTempLen = HS_LZSS_MATCHLEN;
nBestLen = nTempLen;
nBestOffset = nDataStreamPos - nWPos;
break;
// Force the while loop to end
}
nBestLen = nTempLen;
nBestOffset = nDataStreamPos - nWPos;
}
} // End if
nWPos++;
} // End while
if ( (nBestOffset == 0) )
return false;
else
{
m_nDataStreamPos = m_nDataStreamPos + nBestLen;
*nOffset = nBestOffset;
*nLen
= nBestLen;
return true;
}
// No match
// Match! :)
} // FindMatches()
lzss_uncompress.cpp
35
// Uncompressed size (4 bytes)
CompressedStreamReadBits( &nTemp, 16);
*nUncompressedSize = ((unsigned long)nTemp) << 16;
CompressedStreamReadBits( &nTemp, 16);
*nUncompressedSize = *nUncompressedSize | (unsigned long)nTemp;
return HS_LZSS_E_OK;
} // GetUncompressedSize()
int HS_LZSS::Uncompress(unsigned char *bCompressedData, unsigned char *bData)
{
unsigned int
nTemp;
unsigned int
nOffset, nLen;
unsigned long
nTempPos;
// Set up initial values
m_nDataStreamPos
= 0;
// We are at the start of the input data
m_nCompressedStreamPos
= 0;
// We are at the start of the compressed data
m_nCompressedLong
= 0;
// Compressed stream temporary 32bit value
m_nCompressedBitsUsed
= 0;
// Number of bits used in temporary value
m_bCompressedData
= bCompressedData; // Pointer to our input buffer
m_bData
= bData;
// Pointer to our output buffer
m_nDataSize
= 0;
// We will get this from the input
// Skip the LZSS alg ID (4 bytes)
CompressedStreamReadBits(&nTemp, 16);
CompressedStreamReadBits(&nTemp, 16);
// Get the uncompressed size (4 bytes)
CompressedStreamReadBits( &nTemp, 16);
m_nDataSize = ((unsigned long)nTemp) << 16;
CompressedStreamReadBits( &nTemp, 16);
m_nDataSize = m_nDataSize | (unsigned long)nTemp;
// Skip the window and bit lengths (not yet implemented) 2 bytes
CompressedStreamReadBits(&nTemp, 16);
// Perform decompression until we fill our predicted buffer
while(m_nDataStreamPos < m_nDataSize)
{
// Read in the 1 bit flag
CompressedStreamReadBits(&nTemp, 1);
// Was it a literal byte, or a (offset,len) match pair?
if (nTemp == HS_LZSS_MATCH)
{
// Read the offset and length
CompressedStreamReadBits(&nOffset, HS_LZSS_WINDOWBITS);
CompressedStreamReadBits(&nLen, HS_LZSS_MATCHBITS);
// Write out our match
nTempPos = m_nDataStreamPos - nOffset;
while (nLen > 0)
{
nLen--;
m_bData[m_nDataStreamPos++] = m_bData[nTempPos++];
}
}
else
{
// Output a literal byte
CompressedStreamReadBits(&nTemp, 8);
m_bData[m_nDataStreamPos++] = (unsigned char)nTemp;
}
}
return HS_LZSS_E_OK;
} // Uncompress()
main.cpp
#include <stdio.h>
#include <windows.h>
#include "lzss.h"
int main(int argc, char* argv[])
{
HS_LZSS oLZSS;
FILE
*fptr;
unsigned char
unsigned char
unsigned long
unsigned long
int
nRes;
if (argc != 4)
36
{
}
// Compress function
if (!stricmp("-c", argv[1]))
{
if ( (fptr = fopen(argv[2], "rb")) == NULL)
{
printf("Error opening input file.\n");
return 0; // Error
}
fseek(fptr, 0, SEEK_END);
nUncompressedSize = ftell(fptr);
printf("Input file size : %d\n", nUncompressedSize);
bmyData = (unsigned char *)malloc(nUncompressedSize);
bmyCompressedData = (unsigned char *)malloc(nUncompressedSize);
fseek(fptr, 0, SEEK_SET);
fread(bmyData, sizeof(unsigned char), nUncompressedSize, fptr);
fclose(fptr);
nRes = oLZSS.Compress(bmyData, bmyCompressedData, nUncompressedSize, &nCompressedSize);
if (nRes != HS_LZSS_E_OK)
{
printf("File not worth compressing.\n");
return 0;
}
if ( (fptr = fopen(argv[3], "w+b")) == NULL)
{
printf("Error opening output file.\n");
free(bmyData);
free(bmyCompressedData);
return 0; // Error
}
fwrite(bmyCompressedData, sizeof(unsigned char), nCompressedSize, fptr);
fclose(fptr);
printf("Output file size : %d\n", nCompressedSize);
printf("Compression ratio: %f%%\n",
100 - (((float)nCompressedSize / (float)nUncompressedSize)*100.0) );
free(bmyData);
free(bmyCompressedData);
return 0;
}
// Uncompress function
if (!stricmp("-u", argv[1]))
{
if ( (fptr = fopen(argv[2], "rb")) == NULL)
{
printf("Error opening input file.\n");
return 0; // Error
}
fseek(fptr, 0, SEEK_END);
nCompressedSize = ftell(fptr);
printf("Input file size : %d\n", nCompressedSize);
bmyCompressedData = (unsigned char *)malloc(nCompressedSize);
fseek(fptr, 0, SEEK_SET);
fread(bmyCompressedData, sizeof(unsigned char), nCompressedSize, fptr);
fclose(fptr);
nRes = oLZSS.GetUncompressedSize(bmyCompressedData, &nUncompressedSize);
if (nRes != HS_LZSS_E_OK)
{
printf("Error not a valid LZSS file.\n");
return 0;
}
bmyData = (unsigned char *)malloc(nUncompressedSize);
oLZSS.Uncompress(bmyCompressedData, bmyData);
if ( (fptr = fopen(argv[3], "w+b")) == NULL)
{
printf("Error opening output file.\n");
free(bmyData);
free(bmyCompressedData);
return 0; // Error
}
fwrite(bmyData, sizeof(unsigned char), nUncompressedSize, fptr);
fclose(fptr);
printf("Output file size: %d\n", nUncompressedSize);
37
}
return 0;
free(bmyData);
free(bmyCompressedData);
return 0;
LZSS.h
#ifndef __HS_LZSS_H
#define __HS_LZSS_H
// Error codes
#define HS_LZSS_E_OK
0
// OK
#define HS_LZSS_E_BADCOMPRESS
1
// Compressed file would be bigger than source!
#define HS_LZSS_E_NOTLZSS
2
// Not a valid LZSS data stream
// Stream flags
#define HS_LZSS_LITERAL
0
// Just output the literal byte
#define HS_LZSS_MATCH
1
// Output a (offset, len) match pair
// Compression options, these will be user selectable in a later version
#define HS_LZSS_WINDOWLEN
1023
// Sliding window size (10 bits, 0-1023)
#define HS_LZSS_WINDOWBITS
10
// Num of bits that this is
#define HS_LZSS_MATCHLEN
7
// Maximum size of match (3 bits, 0-7)
#define HS_LZSS_MATCHBITS
3
// Num of bits that this is
#define HS_LZSS_MINMATCHLEN
3
// Minimum match size (3 bytes or not efficient)
#define HS_LZSS_MINDATASIZE
32
// Arbitrary miniumum data size that we should attempt to compress
class HS_LZSS
{
public:
// Functions
int
Compress(unsigned char *bData, unsigned char *bCompressedData,
unsigned long nDataSize, unsigned long *nCompressedSize);
int
Uncompress(unsigned char *bCompressedData, unsigned char *bData);
int
GetUncompressedSize(unsigned char *bCompressedData, unsigned long *nUncompressedSize);
private:
// Variables
unsigned long
m_nDataStreamPos;
// Current position in the data stream
unsigned long
m_nCompressedStreamPos;
// Curent position in the compressed stream
unsigned char
*m_bData;
unsigned char
*m_bCompressedData;
unsigned long
m_nDataSize;
// The size of our uncompressed data
unsigned long
m_nCompressedSize;
// The size of our compressed data
// Temporary variables used for the bit operations
unsigned long
m_nCompressedLong;
// Compressed stream temporary 32bit value
int
m_nCompressedBitsUsed;
// Number of bits used in temporary value
// Functions
bool
FindMatches(unsigned int *nOffset, unsigned int *nLen);
// Searches for pattern matches
// Bit operation functions
void
CompressedStreamWriteBits(unsigned int nValue, unsigned int nNumBits);
void
CompressedStreamReadBits(unsigned int *nValue, unsigned int nNumBits);
};
#endif
APPENDIX C
<optlist.h>
#ifndef OPTLIST_H
#define OPTLIST_H
#define OL_NOINDEX
typedef struct option_t
{
char option;
char *argument;
int argIndex;
struct option_t *next;
} option_t;
-1
<optlist.cpp>
#include "optlist.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
option_t *MakeOpt(const char option, char *const argument, const int index);
option_t *GetOptList(const int argc, char *const argv[], char *const options)
{
int nextArg;
option_t *head, *tail;
int optIndex;
nextArg = 1;
head = NULL;
tail = NULL;
while (nextArg < argc)
{
if ((strlen(argv[nextArg]) > 1) && ('-' == argv[nextArg][0]))
{
optIndex = 0;
while ((options[optIndex] != '\0') &&
(options[optIndex] != argv[nextArg][1]))
{
do
{
optIndex++;
}
while ((options[optIndex] != '\0') &&
(':' == options[optIndex]));
}
if (options[optIndex] == argv[nextArg][1])
{
if (NULL == head)
{
head = MakeOpt(options[optIndex], NULL, OL_NOINDEX);
39
tail = head;
}
else
{
tail->next = MakeOpt(options[optIndex], NULL, OL_NOINDEX);
tail = tail->next;
}
if (':' == options[optIndex + 1])
{
if (strlen(argv[nextArg]) > 2)
{
tail->argument = &(argv[nextArg][2]);
tail->argIndex = nextArg;
}
else if (nextArg < argc)
{
nextArg++;
tail->argument = argv[nextArg];
tail->argIndex = nextArg;
}
}
}
}
nextArg++;
}
return head;
option_t *MakeOpt(const char option, char *const argument, const int index)
{
option_t *opt;
opt = malloc(sizeof(option_t));
if (opt != NULL)
{
opt->option = option;
opt->argument = argument;
opt->argIndex = index;
opt->next = NULL;
}
else
{
perror("Failed to Allocate option_t");
}
return opt;
}
void FreeOptList(option_t *list)
{
option_t *head, *next;
head = list;
list = NULL;
while (head != NULL)
{
next = head->next;
free(head);
head = next;
}
return;
}
<huflocal.h>
#ifndef _HUFFMAN_LOCAL_H
#define _HUFFMAN_LOCAL_H
#include <limits.h>
#if (UCHAR_MAX != 0xFF)
#error This program expects unsigned char to be 1 byte
#endif
#if (UINT_MAX != 0xFFFFFFFF)
#error This program expects unsigned int to be 4 bytes
#endif
/* system dependent types */
typedef unsigned char byte_t;
/* unsigned 8 bit */
typedef unsigned int count_t;
/* unsigned 32 bit for character counts */
typedef struct huffman_node_t
40
{
int value;
/* character(s) represented by this entry */
count_t count; /* number of occurrences of value (probability) */
char ignore;
/* TRUE -> already handled or no need to handle */
int level;
/* depth in tree (root is 0) */
struct huffman_node_t *left, *right, *parent;
} huffman_node_t;
#define FALSE 0
#define TRUE 1
#define NONE -1
#define COUNT_T_MAX UINT_MAX /* based on count_t being unsigned int */
#define COMPOSITE_NODE
-1 /* node represents multiple characters */
#define NUM_CHARS
257 /* 256 bytes + EOF */
#define EOF_CHAR (NUM_CHARS - 1) /* index used for EOF */
#define max(a, b) ((a)>(b)?(a):(b))
#ifdef _HUFFMAN_LOCAL_C
#define _HL_EXTERN
#else
#define _HL_EXTERN extern
#endif
_HL_EXTERN huffman_node_t *huffmanArray[NUM_CHARS]; /* array of all leaves */
/* create/destroy tree */
huffman_node_t *GenerateTreeFromFile(FILE *inFile);
huffman_node_t *BuildHuffmanTree(huffman_node_t **ht, int elements);
huffman_node_t *AllocHuffmanNode(int value);
void FreeHuffmanTree(huffman_node_t *ht);
#endif /* define _HUFFMAN_LOCAL_H */
<huflocal.cpp>
#define _HUFFMAN_LOCAL_C
#include <stdio.h>
#include <stdlib.h>
#include "huflocal.h"
huffman_node_t *GenerateTreeFromFile(FILE *inFile)
{
huffman_node_t *huffmanTree;
/* root of huffman tree */
int c;
for (c = 0; c < NUM_CHARS; c++)
{
if ((huffmanArray[c] = AllocHuffmanNode(c)) == NULL)
{
for (c--; c >= 0; c--)
free(huffmanArray[c]);
return NULL;
}
}
huffmanArray[EOF_CHAR]->count = 1;
huffmanArray[EOF_CHAR]->ignore = FALSE;
while ((c = fgetc(inFile)) != EOF)
{
if (huffmanArray[c]->count < COUNT_T_MAX)
{
huffmanArray[c]->count++;
huffmanArray[c]->ignore = FALSE;
}
else
{
fprintf(stderr,
"Input file contains too many 0x%02X to count.\n", c);
return NULL;
}
}
huffmanTree = BuildHuffmanTree(huffmanArray, NUM_CHARS);
return huffmanTree;
}
huffman_node_t *AllocHuffmanNode(int value)
{
huffman_node_t *ht;
ht = (huffman_node_t *)(malloc(sizeof(huffman_node_t)));
if (ht != NULL)
{
ht->value = value;
41
ht->ignore = TRUE; /* will be FALSE if one is found */
/* at this point, the node is not part of a tree */
ht->count = 0;
ht->level = 0;
ht->left = NULL;
ht->right = NULL;
ht->parent = NULL;
}
else
{
perror("Allocate Node");
}
return ht;
42
{
}
return ht[min1];
<huffman.h>
#ifndef _HUFFMAN_H_
#define _HUFFMAN_H_
/* traditional codes */
int HuffmanShowTree(char *inFile, char *outFile);
/* dump codes */
int HuffmanEncodeFile(char *inFile, char *outFile); /* encode file */
int HuffmanDecodeFile(char *inFile, char *outFile); /* decode file */
/* canonical code */
int CHuffmanShowTree(char *inFile, char *outFile);
/* dump codes */
int CHuffmanEncodeFile(char *inFile, char *outFile); /* encode file */
int CHuffmanDecodeFile(char *inFile, char *outFile); /* decode file */
#endif /* _HUFFMAN_H_ */
<huffman.cpp>
#include <stdio.h>
#include <stdlib.h>
#include "huflocal.h"
#include "bitarray.h"
#include "bitfile.h"
typedef struct code_list_t
{
byte_t codeLen; /* number of bits used in code (1 - 255) */
bit_array_t *code; /* code used for symbol (left justified) */
} code_list_t;
static int MakeCodeList(huffman_node_t *ht, code_list_t *codeList);
static void WriteHeader(huffman_node_t *ht, bit_file_t *bfp);
static int ReadHeader(huffman_node_t **ht, bit_file_t *bfp);
int HuffmanEncodeFile(char *inFile, char *outFile)
{
huffman_node_t *huffmanTree;
/* root of huffman tree */
code_list_t codeList[NUM_CHARS]; /* table for quick encode */
FILE *fpIn;
bit_file_t *bfpOut;
int c;
if ((fpIn = fopen(inFile, "rb")) == NULL)
{
perror(inFile);
return FALSE;
}
if (outFile == NULL)
bfpOut = MakeBitFile(stdout, BF_WRITE);
else
{
if ((bfpOut = BitFileOpen(outFile, BF_WRITE)) == NULL)
{
perror(outFile);
fclose(fpIn);
return FALSE;
}
}
if ((huffmanTree = GenerateTreeFromFile(fpIn)) == NULL)
return FALSE;
for (c = 0; c < NUM_CHARS; c++)
{
codeList[c].code = NULL;
codeList[c].codeLen = 0;
}
43
if (!MakeCodeList(huffmanTree, codeList))
return FALSE;
WriteHeader(huffmanTree, bfpOut);
rewind(fpIn);
/* start another pass on the input file */
while((c = fgetc(fpIn)) != EOF)
{
BitFilePutBits(bfpOut,
BitArrayGetBits(codeList[c].code),
codeList[c].codeLen);
}
BitFilePutBits(bfpOut,
BitArrayGetBits(codeList[EOF_CHAR].code),
codeList[EOF_CHAR].codeLen);
for (c = 0; c < NUM_CHARS; c++)
{
if (codeList[c].code != NULL)
BitArrayDestroy(codeList[c].code);
}
fclose(fpIn);
BitFileClose(bfpOut);
FreeHuffmanTree(huffmanTree); /* free allocated memory */
return TRUE;
44
{
if (c != 0)
currentNode = currentNode->right;
else
currentNode = currentNode->left;
if (currentNode->value != COMPOSITE_NODE)
{
if (currentNode->value == EOF_CHAR)
break;
fputc(currentNode->value, fpOut); /* write out character */
currentNode = huffmanTree;
/* back to top of tree */
}
}
BitFileClose(bfpIn);
fclose(fpOut);
FreeHuffmanTree(huffmanTree);
return TRUE;
45
while (htp->parent != NULL)
{
if (htp != htp->parent->right)
{
code[depth - 1] = '1';
htp = htp->parent->right;
break;
}
else
{
depth--;
htp = htp->parent;
code[depth] = '\0';
}
}
if (htp->parent == NULL)
break;
}
fclose(fpIn);
fclose(fpOut);
FreeHuffmanTree(huffmanTree);
return TRUE;
46
}
BitFilePutChar(0, bfp);
for(i = 0; i < sizeof(count_t); i++)
BitFilePutChar(0, bfp);
<bitfile.h>
#ifndef _BITFILE_H_
#define _BITFILE_H_
#include <stdio.h>
typedef enum
{
BF_READ = 0,
BF_WRITE = 1,
BF_APPEND= 2,
BF_NO_MODE
} BF_MODES;
struct bit_file_t;
typedef struct bit_file_t bit_file_t;
bit_file_t *BitFileOpen(const char *fileName, const BF_MODES mode);
bit_file_t *MakeBitFile(FILE *stream, const BF_MODES mode);
int BitFileClose(bit_file_t *stream);
FILE *BitFileToFILE(bit_file_t *stream);
/* toss spare bits and byte align file */
47
int BitFileByteAlign(bit_file_t *stream);
/* fill byte with ones or zeros and write out results */
int BitFileFlushOutput(bit_file_t *stream, const unsigned char onesFill);
/* get/put character */
int BitFileGetChar(bit_file_t *stream);
int BitFilePutChar(const int c, bit_file_t *stream);
/* get/put single bit */
int BitFileGetBit(bit_file_t *stream);
int BitFilePutBit(const int c, bit_file_t *stream);
/* get/put number of bits (most significant bit to least significat bit) */
int BitFileGetBits(bit_file_t *stream, void *bits, const unsigned int count);
int BitFilePutBits(bit_file_t *stream, void *bits, const unsigned int count);
int BitFileGetBitsInt(bit_file_t *stream, void *bits, const unsigned int count,
const size_t size);
int BitFilePutBitsInt(bit_file_t *stream, void *bits, const unsigned int count,
const size_t size);
#endif /* _BITFILE_H_ */
<bitfile.cpp>
#include <stdlib.h>
#include <errno.h>
#include "bitfile.h"
typedef enum
{
BF_UNKNOWN_ENDIAN,
BF_LITTLE_ENDIAN,
BF_BIG_ENDIAN
} endian_t;
struct bit_file_t
{
FILE *fp;
/* file pointer used by stdio functions */
endian_t endian;
/* endianess of architecture */
unsigned char bitBuffer; /* bits waiting to be read/written */
unsigned char bitCount; /* number of bits in bitBuffer */
BF_MODES mode;
/* open for read, write, or append */
};
/* union used to test for endianess */
typedef union
{
unsigned long word;
unsigned char bytes[sizeof(unsigned long)];
} endian_test_t;
endian_t DetermineEndianess(void);
int BitFilePutBitsLE(bit_file_t *stream, void *bits, const unsigned int count);
int BitFilePutBitsBE(bit_file_t *stream, void *bits, const unsigned int count,
const size_t size);
int BitFileGetBitsLE(bit_file_t *stream, void *bits, const unsigned int count);
int BitFileGetBitsBE(bit_file_t *stream, void *bits, const unsigned int count,
const size_t size);
bit_file_t *BitFileOpen(const char *fileName, const BF_MODES mode)
{
char modes[3][3] = {"rb", "wb", "ab"}; /* binary modes for fopen */
bit_file_t *bf;
bf = (bit_file_t *)malloc(sizeof(bit_file_t));
if (bf == NULL)
errno = ENOMEM;
else
{
bf->fp = fopen(fileName, modes[mode]);
if (bf->fp == NULL)
{
free(bf);
bf = NULL;
}
else
{
bf->bitBuffer = 0;
bf->bitCount = 0;
bf->mode = mode;
bf->endian = DetermineEndianess();
}
}
return (bf);
}
48
............,
............,
............,
int BitFilePutBitsBE(bit_file_t *stream, void *bits, const unsigned int count,
const size_t size)
{
unsigned char *bytes, tmp;
int offset, remaining, returnValue;
if (count > (size * 8))
return EOF;
bytes = (unsigned char *)bits;
offset = size - 1;
remaining = count;
while (remaining >= 8)
{
returnValue = BitFilePutChar(bytes[offset], stream);
if (returnValue == EOF)
return EOF;
remaining -= 8;
offset--;
}
if (remaining != 0)
{
tmp = bytes[offset];
tmp <<= (8 - remaining);
while (remaining > 0)
{
returnValue = BitFilePutBit((tmp & 0x80), stream);
if (returnValue == EOF)
return EOF;
tmp <<= 1;
remaining--;
}
}
return count;
}
<bitarray.h>
#ifndef BIT_ARRAY_H
#define BIT_ARRAY_H
struct bit_array_t;
typedef struct bit_array_t bit_array_t;
bit_array_t *BitArrayCreate(unsigned int bits);
void BitArrayDestroy(bit_array_t *ba);
/* debug functions */
void BitArrayDump(bit_array_t *ba, FILE *outFile);
/* set/clear functions */
void BitArraySetAll(bit_array_t *ba);
void BitArrayClearAll(bit_array_t *ba);
void BitArraySetBit(bit_array_t *ba, unsigned int bit);
void BitArrayClearBit(bit_array_t *ba, unsigned int bit);
/* raw bit access */
void *BitArrayGetBits(bit_array_t *ba);
/* bit test function */
int BitArrayTestBit(bit_array_t *ba, unsigned int bit);
/* copy functions */
void BitArrayCopy(bit_array_t *dest, const bit_array_t *src);
bit_array_t *BitArrayDuplicate(const bit_array_t *src);
/* logical operations */
void BitArrayAnd(bit_array_t *dest,
const bit_array_t *src1,
const bit_array_t *src2);
void BitArrayOr(bit_array_t *dest,
const bit_array_t *src1,
const bit_array_t *src2);
void BitArrayXor(bit_array_t *dest,
const bit_array_t *src1,
const bit_array_t *src2);
void BitArrayNot(bit_array_t *dest,
const bit_array_t *src);
/* bit shift functions */
void BitArrayShiftLeft(bit_array_t *ba, unsigned int shifts);
49
void BitArrayShiftRight(bit_array_t *ba, unsigned int shifts);
/* increment/decrement */
void BitArrayIncrement(bit_array_t *ba);
void BitArrayDecrement(bit_array_t *ba);
/* comparison */
int BitArrayCompare(const bit_array_t *ba1, const bit_array_t *ba2);
#endif /* ndef BIT_ARRAY_H */
<bitarray.cpp>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <limits.h>
#include <string.h>
#include "bitarray.h"
/* make CHAR_BIT 8 if it's not defined in limits.h */
#ifndef CHAR_BIT
#warning CHAR_BIT not defined. Assuming 8 bits.
#define CHAR_BIT 8
#endif
/* position of bit within character */
#define BIT_CHAR(bit)
((bit) / CHAR_BIT)
/* array index for character containing bit */
#define BIT_IN_CHAR(bit) (1 << (CHAR_BIT - 1 - ((bit) % CHAR_BIT)))
/* number of characters required to contain number of bits */
#define BITS_TO_CHARS(bits) ((((bits) - 1) / CHAR_BIT) + 1)
/* most significant bit in a character */
#define MS_BIT
(1 << (CHAR_BIT - 1))
struct bit_array_t
{
unsigned char *array;
/* pointer to array containing bits */
unsigned int numBits;
/* number of bits in array */
};
bit_array_t *BitArrayCreate(unsigned int bits)
{
bit_array_t *ba;
/* allocate structure */
ba = (bit_array_t *)malloc(sizeof(bit_array_t));
if (ba == NULL)
errno = ENOMEM;
else
{
ba->numBits = bits;
ba->array = (unsigned char *)malloc(sizeof(unsigned char) *
BITS_TO_CHARS(bits));
if (ba->array == NULL)
{
errno = ENOMEM;
free(ba);
ba = NULL;
}
}
return(ba);
}
.............................,
.............................,
.............................,
int BitArrayCompare(const bit_array_t *ba1, const bit_array_t *ba2)
{
int i;
if (ba1 == NULL)
{
if (ba2 == NULL)
return 0;
/* both are NULL */
else
return -(ba2->numBits); /* ba2 is the only Non-NULL*/
}
if (ba2 == NULL)
return (ba1->numBits);
/* ba1 is the only Non-NULL*/
if (ba1->numBits != ba2->numBits)
return(ba1->numBits - ba2->numBits);
for(i = 0; i <= BIT_CHAR(ba1->numBits - 1); i++)
50
{
if (ba1->array[i] != ba2->array[i])
return(ba1->array[i] - ba2->array[i]);
}
return 0;