HTCS501 Unit 5
HTCS501 Unit 5
Ansh Kasaudhan
2024-2025
—
Cyber Security
—
2
INTRODUCTION
PAGE 2
Syllabus
Unit 5
Entropy Encoding
Entropy encoding is a lossless data compression technique that encodes data based on probabilities of
occurrences of symbols. It replaces frequently occurring symbols with shorter codes and less frequent
symbols with longer codes, optimizing the storage and transmission of data.
AAAABBBCCDAA
• "AAAA" → "4A"
• "BBB" → "3B"
• "CC" → "2C"
• "D" → "1D"
• "AA" → "2A"
Compressed Output:
4A3B2C1D2A
0000000000111111111100000000
Encoded as:
Advantages of RLE
✔ Simple & Fast: Requires minimal computation.
✔ Efficient for repetitive data: Works well for images with solid colors and long sequences.
✔ Lossless: 100% data recovery is possible.
Disadvantages of RLE
Ineffective for diverse data: If characters change frequently, RLE increases file size instead of
reducing it.
Not suitable for complex images or unstructured text.
Run-Length Encoding (RLE) is a lossless data compression algorithm that is efficient for data with
repeated sequences of characters or pixels. The primary goal of RLE is to reduce the size of data by
encoding consecutive occurrences of the same character as a single character and a count.
The first step is to scan the data and look for sequences of consecutive, identical characters or pixels.
For each sequence of repeated characters (or pixels), RLE replaces the sequence with a single character and
a count of how many times that character appears in the run.
The encoded data is stored as pairs of character and its run length.
4. Decompression
To reconstruct the original data, the encoded data is expanded by repeating the character according to the
specified count.
AAAABBBCCDAA
Step-by-Step Process:
4A3B2C1D2A
4. Decompression:
To decompress the encoded string, you repeat each character by the specified count:
4A3B2C1D2A → AAAABBBCCDAA
0000000000111111111100000000
Step-by-Step Process:
0000000000111111111100000000
• Image Compression: RLE is commonly used in image formats such as BMP, TIFF, and FAX,
where areas of solid colors or repeated pixel values are common.
• Simple Text Compression: Text with repeated characters, like "AAAAAAAAAAAAA", can be efficiently
compressed using RLE.
• Data Transmission: FAX machines and some early graphics file formats use RLE for transmitting
data over limited bandwidth.
Note: -
The terms Repetitive Character Encoding and Run-Length Encoding (RLE) are often used
interchangeably, but they generally refer to the same concept of compressing repeated characters. However,
there may be subtle differences in how they are applied in various contexts. Here's a breakdown of the
distinctions and similarities between them:
Repetitive Character Encoding is a more general term used to describe any encoding technique where
repeated characters or sequences are replaced by a single character and a count of how many times it
repeats. It’s often used as a high-level description for techniques like Run-Length Encoding.
• Focus: Identifying repeated characters or sequences in data and replacing them with a compressed
form.
• Context: It is generally used when describing the principle of compressing repeated characters and
can include variations of Run-Length Encoding or other techniques that work similarly.
• Focus: It strictly uses a character followed by a number indicating the number of repetitions.
• Context: RLE is a specific method commonly used in compression algorithms, especially for images
(e.g., bitmap) and fax transmission.
HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN
9
Key Differences
Zero/Blank Encoding
Zero/Blank Encoding is a simple and specialized form of lossless data compression primarily used to
handle and compress sequences of zeroes (0) or blank spaces in a dataset, especially in text and binary
data. The idea is to replace long runs of zeroes or blanks with a compressed representation, thus reducing
the overall data size.
This technique is particularly useful when data contains many consecutive zeros or blank characters, such as
in certain image formats, matrices, or sparse data.
The first step is to scan the dataset and find consecutive zeroes (0s) or blank spaces. These sequences are
the main targets for encoding.
For each sequence of consecutive zeroes or blank spaces, replace it with the count of occurrences. For
example, a sequence of 5 zeroes would be replaced with the value 5.
The compressed data consists of runs of zeroes or blanks followed by their respective count.
4. Decompression
To restore the original data, each count is expanded by repeating the zero or blank character for the specified
number of times.
Original Data:
0000000011100000000000000
Step-by-Step Process:
4. Decompression:
Expanding the counts will give us the original sequence:
0000000011100000000000000
Original Text:
Hello World
Hello5_World
4. Decompression:
Expanding the 5_ will restore the original text:
Hello World
Statistical Encoding is a class of lossless data compression techniques that rely on the probability
distribution of the input symbols. The core idea of statistical encoding is to assign shorter codes to more
frequent symbols and longer codes to less frequent ones, effectively reducing the overall size of the data.
• Frequency-based encoding: More frequent symbols are encoded with shorter codes, while less
frequent symbols get longer codes.
• Entropy: The efficiency of a statistical encoding algorithm is closely related to the entropy of the
data, which represents the amount of unpredictability or information content. The closer the
encoding approaches the entropy, the more efficient it is.
• Lossless compression: The original data can be perfectly reconstructed from the encoded data
without any loss of information.
1. Huffman Coding
2. Arithmetic Coding
3. Lempel-Ziv Coding
1. Huffman Coding
Huffman Coding is one of the most widely used lossless data compression algorithms that utilizes
variable-length encoding based on the frequency of symbols in the input data. The basic idea is to assign
shorter codes to more frequent symbols and longer codes to less frequent symbols, thereby reducing the
overall size of the data.
1. Frequency Analysis:
The first step in Huffman coding is to calculate the frequency (or probability) of each symbol
(character or byte) in the input data.
2. Building a Huffman Tree:
o Create leaf nodes: Each symbol is represented by a leaf node with its frequency.
o Build the tree: The two nodes with the lowest frequencies are merged into a new node. This
process repeats until a single tree is built, with the root node representing the entire dataset.
o Internal nodes are assigned a frequency equal to the sum of the frequencies of their child
nodes.
3. Assigning Binary Codes:
o Traverse the tree from the root to each leaf, assigning a binary code for each symbol.
o Left branches are typically assigned 0 and right branches are assigned 1.
4. Encoding:
o Replace each symbol in the input data with its corresponding binary code from the tree.
5. Decoding:
o The decoder uses the same Huffman tree to decode the binary data back into the original
symbols.
A: 5, B: 2, R: 2, C: 1, D: 1
0 10 11 0 100 0 101 0 10 11 0
• Ineffective for datasets with similar frequencies (e.g., if all symbols have nearly equal
frequencies).
• Fixed codebook: The encoding scheme is static and does not adapt to changes in symbol frequencies
during the compression process.
2. Arithmetic Coding
Arithmetic Coding is a statistical encoding technique that encodes an entire message as a single number
between 0 and 1. Unlike Huffman coding, which assigns a binary code to each symbol, Arithmetic Coding
assigns a range to the entire message, and progressively narrows down this range based on the probability of
each symbol.
1. Frequency Calculation:
Determine the probability of each symbol in the input data.
2. Assign Probability Ranges:
Each symbol is assigned a subrange of the interval [0, 1] based on its probability. For example:
o A: 0.5 (range [0, 0.5))
o B: 0.25 (range [0.5, 0.75))
o C: 0.25 (range [0.75, 1))
At the end, the final interval is [0.75, 1], and any number in this range (e.g., 0.875) can represent the
encoded message.
• More efficient than Huffman coding for datasets with a small alphabet and non-integer
probabilities.
• Achieves better compression ratios than Huffman coding.
• Adaptable to changes in symbol frequencies during encoding.
• Complex: The algorithm is more computationally intensive and requires higher precision for
managing ranges.
• Slower than Huffman coding for small datasets.
1. Sliding Window:
LZ77 maintains a sliding window over the input data. The window is split into two parts: a search
buffer (containing previously encoded data) and a look-ahead buffer (containing new data being
processed).
2. Find Matches:
The algorithm scans the look-ahead buffer and searches for the longest match in the search buffer.
3. Encoding:
o A match is encoded as a triplet (offset, length, next symbol):
▪ Offset: The position of the match in the search buffer.
▪ Length: The length of the matched string.
▪ Next symbol: The symbol following the matched string.
4. Update the Window:
The sliding window is updated, and the process continues for the next set of symbols.
Original string:
ABABABA
1. Step 1: Search for the longest match of the look-ahead buffer in the search buffer.
o First, A is encoded as (0, 0, A).
o Next, B is encoded as (0, 0, B).
o The next part of the string is "AB", which matches the start of the string, so it is encoded as
(2, 2, A).
2. Encoded Output:
The encoded string is a sequence of triplets:
• Efficient for repetitive data and commonly used in real-time applications like file compression.
• No need for prior frequency analysis, unlike statistical methods such as Huffman or Arithmetic
coding.
• Adaptive: It adapts to the input data and doesn't require an initial analysis phase.
1. Create a Codebook:
o A codebook is a collection of representative vectors or codewords. These vectors are
obtained by partitioning the input space into regions, and each region is represented by a
codeword.
o Codewords can be determined through techniques like Lloyd's algorithm (a form of k-
means clustering).
2. Encoding Process:
o For a given input vector, find the codeword in the codebook that is nearest to the input
vector, typically using the Euclidean distance.
o Replace the input vector with the index of the corresponding codeword from the codebook.
3. Decoding Process:
o To decode, simply use the index of the selected codeword to look up the corresponding
vector in the codebook and reconstruct the data.
Consider a 2-dimensional input space, and assume we have a codebook with two codewords, C1 = [1, 2]
and C2 = [5, 6].
• The input vector [3, 4] is encoded as C1 or simply the index 1 (representing the codeword in the
codebook).
Decoding:
• To decode, simply replace the index with the vector in the codebook corresponding to C1 = [1, 2].
• Efficient: The method reduces the size of the data by quantizing large vectors into smaller
codewords.
• Simple and easy to implement: The process involves only finding the closest codeword and
replacing the input vector with the codeword index.
• Lossy compression: Because the data is quantized, some loss of information occurs during
encoding.
• Codebook design: The performance heavily depends on how well the codebook is designed. A
poorly designed codebook may lead to higher distortion.
• High computational cost: If the codebook has a large number of codewords, the encoding process
can become computationally expensive.
1. Create a Codebook:
o As with simple VQ, a codebook of representative vectors is created. However, in this case,
the encoding process accounts for the quantization error.
2. Encoding Process with Error Term:
o Find the nearest codeword in the codebook as usual.
o Calculate the quantization error (or residual) as the difference between the input vector and
the chosen codeword.
E=X−C
Assume a codebook with codewords C1 = [1, 2] and C2 = [5, 6] again. Let's say the input vector is [3,
4].
E=[3️,4️]−[1️,2️]=[2️,2️]
• Reduced distortion: The error term helps reduce the quantization error, improving the
approximation of the original data.
• Efficient compression: By combining codebook lookup and error correction, the compression ratio
can be higher without significant quality loss.
• Complexity: The encoding and decoding processes are more complex compared to simple vector
quantization.
• Requires error storage: The error term must be stored or transmitted along with the codeword
index, which can add overhead.
• Lossy compression: Like simple vector quantization, some loss of information occurs, but with less
noticeable distortion.
Conclusion
• Simple Vector Quantization (SVQ) is a straightforward and efficient technique for lossy data
compression. It maps input vectors to the closest codeword in a predefined codebook.
• Vector Quantization with Error Term takes the quantization error into account, improving the
quality of the approximation by adding a residual error vector to the codeword. This method helps
reduce distortion and achieves a better reconstruction of the original data.
Both techniques are commonly used in image compression, speech encoding, and audio compression,
among other applications. While SVQ is simpler and faster, vector quantization with an error term
provides better accuracy at the cost of increased complexity.
1. Quantum Encryption
Quantum computing has the potential to break traditional encryption algorithms like RSA and ECC (Elliptic
Curve Cryptography). As a result, quantum-safe encryption or post-quantum encryption is gaining
attention.
• Quantum Key Distribution (QKD): This encryption technique leverages the principles of quantum
mechanics to create secure communication channels. The most notable protocol is BB84, which
allows for the secure exchange of encryption keys.
• Lattice-Based Cryptography: A class of cryptographic algorithms that is believed to be resistant to
quantum attacks. Examples include NTRU and FrodoKEM.
• Code-Based Cryptography: Uses error-correcting codes to create cryptographic systems that are
resistant to quantum computers. McEliece is a prominent example.
• Multivariate Polynomial Cryptography: Based on the hardness of solving systems of multivariate
quadratic equations, this approach is another potential quantum-safe technique.
• Fully Homomorphic Encryption (FHE): Allows both addition and multiplication on encrypted
data.
• Partially Homomorphic Encryption (PHE): Supports either addition or multiplication but not
both.
• Zero-Knowledge Proofs (ZKPs): Used in blockchain to prove that a statement is true without
revealing any additional information. This is important for privacy and anonymity in blockchain
transactions.
• Advanced Cryptographic Hash Functions: New and more secure hash functions are emerging to
strengthen blockchain systems, such as SHA-3 and BLAKE2.
• Algorithms like Kyber (for key exchange), NTRU (for encryption), and FrodoKEM (for key
encapsulation) are being explored as viable replacements.
HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN
5. Advanced Secure Multi-Party Computation (SMPC) 20
This technique enables multiple parties to compute a function while keeping their inputs private. The
applications are widespread, particularly in data analytics and finance.
• Secure Multiparty Computation allows for encrypted computations involving multiple parties
without revealing their private data, ensuring privacy.
• Deep Learning for Compression: Neural networks, particularly autoencoders, are being used to
learn optimal compression techniques for image, video, and text data. Algorithms like Deep Image
Prior and Variational Autoencoders (VAEs) are improving compression efficiency in tasks like
image compression.
• Reinforcement Learning for Adaptive Compression: RL algorithms are used to adapt the
compression techniques dynamically based on the characteristics of the data.
• Edge Compression: Compressing data on edge devices before transmitting it to the cloud, reducing
bandwidth and storage requirements.
• Sparse Data Compression: Data from IoT sensors and large datasets often contain a lot of sparse or
redundant information. New algorithms focus on compressing sparse datasets more efficiently, using
techniques like run-length encoding, delta encoding, and Huffman coding.
• Distributed Data Compression: In distributed systems, compressing data across multiple nodes is
gaining attention to minimize network overhead and speed up data processing.
• HEVC (High-Efficiency Video Coding): Used for 4K and 8K video compression, it provides
significantly higher compression ratios than H.264 while maintaining the same video quality.
• AV1: A newer video compression standard that is open-source and royalty-free. AV1 promises better
compression efficiency than HEVC, especially for streaming applications.
• Deep Learning-based Compression: AI techniques like Generative Adversarial Networks
(GANs) are also being used to improve image and video compression, particularly in applications
requiring high visual fidelity.
• Zstandard: A fast compression algorithm that provides a good balance between compression ratio
and speed. It's often used in modern applications like Facebook’s compression library and
Google’s Brotli (for web content compression).
• LZ4: Known for its fast compression and decompression speeds, LZ4 is widely used in database
compression and log file compression.
• Hybrid Video Compression: In many modern streaming applications, both lossy and lossless
compression methods are used. For example, lossy methods are used for initial compression,
followed by a lossless compression stage to further optimize storage.
• Brotli Compression: Brotli is gaining traction as a replacement for Gzip and deflate, offering better
compression ratios and speeds for web content, especially in HTTPS traffic.
• Cloud Compression Algorithms: Cloud service providers like AWS, Google Cloud, and Azure are
focusing on compression algorithms tailored to their specific environments, enabling more efficient
storage and faster data retrieval.