Unit 5 Data Compression
Unit 5 Data Compression
Entropy encoding is a class of lossless data compression algorithms based on the statistical
properties of the data being compressed. These methods aim to represent frequent symbols
with shorter codes and rare symbols with longer codes, optimizing the representation of data
by exploiting its inherent redundancy.
Entropy encoding techniques are grounded in the concept of Shannon entropy, which
quantifies the amount of unpredictability or randomness in a dataset. By assigning shorter
codes to frequently occurring characters and longer codes to rare ones, entropy encoding
optimizes the representation of the data.
1. Huffman Coding:
o How it works: In Huffman coding, each character or symbol is assigned a
binary code. The length of the code is inversely proportional to the frequency
of the symbol. Frequent symbols get shorter codes, while rare symbols get
longer codes. The algorithm constructs a Huffman tree, which is a binary tree
used to generate these codes.
o Use in Cryptography: Huffman coding is widely used in cryptography to
compress data before encryption. By eliminating redundant characters, it
reduces the data size and increases entropy, making the encrypted data harder
to analyze.
oExample: If the symbol "A" appears frequently in a dataset, it may be
assigned the binary code 0, while less frequent symbols like "Z" may be
assigned a longer binary code like 1011.
2. Arithmetic Coding:
o How it works: Unlike Huffman coding, which assigns a fixed-length code to
each symbol, arithmetic coding represents the entire message as a single
number within a range of values. The size of this range depends on the
probability distribution of the symbols in the message. Symbols are encoded in
a sequence, and the range is progressively narrowed to represent the entire
message in one compact number.
o Use in Cryptography: Arithmetic coding can achieve better compression
ratios than Huffman coding, especially for symbols with skewed frequency
distributions. This makes it useful in cryptographic systems where data size
reduction is a priority before encryption.
o Example: In a string like "AAAAABBBB", arithmetic coding would assign a very
small range to represent "A" and a larger range to represent "B", reducing the
overall length of the compressed data.
3. Shannon-Fano Coding:
o How it works: This is another entropy-based encoding technique that is
conceptually similar to Huffman coding. It builds a binary tree by dividing the
set of symbols into two parts based on their probabilities. The process
continues recursively until each symbol is assigned a unique code.
o Use in Cryptography: Though less efficient than Huffman coding in practice,
Shannon-Fano coding is still used in certain cryptographic compression
applications.
How it works: RLE encodes consecutive occurrences of the same symbol by storing
the symbol once, followed by the number of times it repeats. For example, the string
"AAAAABBBCC" would be encoded as A4B3C2.
Use in Cryptography: RLE is used in cases where data contains long runs of
repetitive characters (e.g., in binary or image data). By reducing the size of the data
before encryption, RLE helps in optimizing the cryptographic process. However, it is
not suitable for all types of data and works best with data that contains long sequences
of repeated symbols.
Example: For a string like "AAAAAAA", RLE would encode it as "A7", meaning seven
consecutive "A"s, which is a much more efficient representation.
1. Increased Security:
o High Entropy: Before encryption, applying entropy encoding increases the
randomness (entropy) of the data. Encrypted data that has high entropy
appears random and lacks discernible patterns, which is essential for strong
cryptographic security.
o Obfuscation of Patterns: Compression methods like Huffman coding and
arithmetic coding help in eliminating predictable patterns in data, making it
more challenging for cryptanalysis techniques (such as frequency analysis) to
succeed.
2. Efficiency in Cryptography:
o Reducing Data Size: Compression before encryption reduces the amount of
data that needs to be processed by the cryptographic algorithm. This can lead
to performance improvements, especially when encrypting large datasets.
o Lower Computational Overhead: Smaller data sizes mean that encryption
and decryption algorithms run faster and use less computational resources,
making them more efficient in resource-constrained environments.
3. Cryptographic Use Cases:
o File Encryption: When encrypting files, applying entropy encoding before
encryption helps reduce the size of the file, speeding up the encryption
process.
o Secure Communication: In secure communication protocols, applying
entropy encoding to messages before encryption ensures that the messages are
transmitted in a compressed format, reducing bandwidth usage and improving
transmission speed.
While RLE is simple and effective for certain types of data, its use in cryptography comes
with both advantages and potential challenges:
Advantages:
1. Compression Efficiency: RLE works best on data with long sequences of repeated
characters (e.g., text files with many spaces, or images with large areas of a single
color). In such cases, RLE can significantly reduce the size of the data, making
encryption faster and reducing resource consumption.
2. Simplicity: RLE is easy to implement, making it a simple and efficient compression
technique before encryption.
Disadvantages:
1. Limited Effectiveness on Random Data: RLE is not effective on data that is already
random or lacks repetition (such as highly encrypted data or naturally random files).
For example, applying RLE to random data might not reduce the size, and it might
even increase the size of the data in some cases.
2. Vulnerability to Pattern Exploitation: If the plaintext contains repetitive structures
or runs that remain after encryption, it might make the ciphertext more predictable or
susceptible to attacks such as frequency analysis. For example, RLE applied to a
message with little or no redundancy could expose repeating patterns that an attacker
might exploit.
o Example: If encrypted data retains patterns like "A5B4C3", an attacker might
deduce some structure, making it easier to break the encryption.
3. Not Suitable for All Data Types: RLE works well for data with long runs of the
same symbol, but it does not work well for data that is highly varied or complex (e.g.,
binary data that has no significant repetitions). This means that RLE may not always
provide the desired reduction in size, especially for certain types of cryptographic
operations.
1. Zero/Blank Encoding
In cryptographic systems, zero encoding can be used as a preprocessing step to reduce the
size of the data before encryption. Since many types of data, particularly binary data, may
have long runs of zeros or spaces (such as padding in data), zero encoding can help optimize
storage and encryption processes.
Padding in Encryption: Many block ciphers (e.g., AES, DES) require padding when
the plaintext message is not a multiple of the block size. These padding values often
consist of zeros or blank spaces. Zero/blank encoding can be applied to eliminate the
redundancy introduced by padding, especially in scenarios where the data structure
has excessive padding.
Efficiency: Encoding zeros or spaces can help in reducing the amount of data that
needs to be encrypted. If the plaintext has a large number of zeros or spaces, zero
encoding can compress it significantly, reducing the computational load during
encryption and making the encryption process more efficient.
Huffman Coding is one of the most well-known forms of statistical encoding, and it is
widely used in data compression techniques. It is a variable-length prefix encoding algorithm
that efficiently compresses data based on the frequency of the symbols (or characters) in the
input data. In cryptography, Huffman coding can be used as a preprocessing step to reduce
the size of the data before encryption, improving efficiency and sometimes adding a layer of
obfuscation.
Huffman Coding
In many cryptographic systems, large amounts of data need to be encrypted, but encrypting
large datasets directly can be inefficient. Huffman coding can be applied as a pre-encryption
compression step to reduce the size of the data to be encrypted. This not only reduces the
computational overhead but also speeds up the encryption and decryption processes,
especially for large volumes of data.
For example, before applying a block cipher like AES or DES, Huffman coding can
compress the data so that fewer bits are encrypted, improving performance. After encryption,
the ciphertext remains in the compressed format.
Huffman coding can introduce a layer of obfuscation to the plaintext data. Since Huffman
encoding assigns variable-length codes to different symbols, the structure of the plaintext can
be altered before encryption, making it harder for attackers to recognize patterns in the
ciphertext. This can add a level of complexity to cryptanalysis efforts, especially if combined
with strong encryption algorithms.
For example, a sequence of frequently occurring symbols might get replaced by shorter
codes, making it less predictable and more difficult to infer the original data without proper
decryption.
3. Efficient Encryption in Low-Bandwidth Environments
In scenarios where bandwidth is limited, Huffman coding can help reduce the amount of
data that needs to be encrypted and transmitted. This is particularly useful in network
encryption or in secure communication systems where reducing data size can lead to more
efficient resource usage.
By reducing the size of the plaintext before encryption, Huffman coding makes encryption
more efficient in environments with bandwidth limitations.
1. Frequency Analysis: Calculate the frequency of each symbol in the input data.
2. Building the Huffman Tree: Construct a binary tree where the leaf nodes represent
the symbols, and the tree is built based on the frequency of the symbols (i.e., the less
frequent symbols will be deeper in the tree).
3. Assign Codes: Once the tree is built, assign binary codes to the symbols. The path
from the root of the tree to a symbol defines its binary code (with left branches
typically being 0 and right branches being 1).
4. Encode Data: Replace each symbol in the original data with its corresponding
Huffman code.
AABACD
1. Frequency Analysis:
o A appears 3 times.
o B appears 1 time.
o C appears 1 time.
o D appears 1 time.
2. Build the Huffman Tree:
o Start by creating nodes for each symbol with its frequency.
o Combine the two least frequent nodes and assign them a parent node with the
sum of their frequencies.
o Repeat this process until all nodes are combined into a single tree.
Arithmetic and Lempel-Ziv Coding
If the input string is "ABCA", arithmetic coding will map this sequence to a single value
between 0 and 1 based on the intervals associated with each symbol.
Lempel-Ziv (LZ) coding is a family of algorithms used for lossless data compression. The
two most common variants are LZ77 and LZ78. These algorithms work by replacing
repeated sequences of data with shorter references (pointers) to earlier occurrences in the
data. Lempel-Ziv coding is often used in modern compression formats like ZIP and GIF, and
it has applications in cryptography as well.
LZ77 works by encoding repeated strings of characters as (offset, length) pairs. These pairs
represent references to earlier occurrences of the same substring in the data.
Encoding Process:
o As the algorithm scans through the data, it identifies repeating sequences of
characters.
o Instead of storing the repeated sequence again, LZ77 stores a reference to the
previous occurrence along with its length.
o For example, if the sequence "ABAB" is found, LZ77 might encode it as (2,
2) (referring to the previous "AB" and its length).
Encoding Process:
o LZ78 builds a dictionary of substrings as it scans the data.
o When a new substring is encountered, it assigns a unique code to that
substring.
o The encoded output consists of pairs of (dictionary index, next symbol).
1. Adaptive Compression: Both LZ77 and LZ78 are adaptive and do not require
knowledge of the entire dataset in advance, making them suitable for data streams.
2. Efficient for Repetitive Data: LZ77 and LZ78 excel at compressing data with
repetitive structures, which can be common in certain types of data, such as text or
network traffic.
3. No Need for Probability Models: Unlike Huffman or Arithmetic coding, Lempel-
Ziv algorithms do not require explicit frequency models, making them easier to
implement and adapt to a wide range of data.
Application in Cryptography:
Vector Quantization (VQ) is a type of source encoding used in data compression and signal
processing. In the context of cryptography, vector quantization can be used as a
preprocessing step to reduce the size of the data before encryption, potentially improving the
efficiency of cryptographic systems. It works by grouping similar data points (vectors) into a
set of representative values, called codebooks, which can then be used to represent the
original data in a more compact form.
Vector quantization has applications in image and speech compression and can be
particularly useful for high-dimensional data. In cryptography, it helps by reducing the
amount of data to be encrypted while possibly introducing an extra layer of obfuscation
before the encryption process.
Vector quantization can be implemented in various ways, with simple vector quantization
being the basic form and vector quantization with error terms (often referred to as
predicted vector quantization or with reconstruction error) being a more advanced
version.
Application in Cryptography:
In Vector Quantization with Error Term, also known as Predicted Vector Quantization
(PVQ) or VQ with reconstruction error, the quantization process introduces a mechanism
for minimizing the error between the original data and its quantized representation. Instead of
simply replacing the data vector with a codeword from the codebook, the error (or residual)
between the data vector and the codeword is also considered.
This method tries to minimize the quantization error, resulting in a better approximation of
the original data. In cryptography, this can provide more accurate representations of the data
while still offering compression.
Application in Cryptography: