0% found this document useful (0 votes)
29 views21 pages

HTCS501 Unit 5

The document outlines a Cyber Security Honours Degree program focusing on key concepts and techniques for protecting digital assets from cyber threats, including network security, encryption, and ethical hacking. It details various data compression techniques, particularly Run-Length Encoding (RLE) and Zero/Blank Encoding, explaining their processes, advantages, and disadvantages. The report emphasizes the importance of applying theoretical knowledge to practical scenarios in cybersecurity to enhance data integrity and confidentiality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views21 pages

HTCS501 Unit 5

The document outlines a Cyber Security Honours Degree program focusing on key concepts and techniques for protecting digital assets from cyber threats, including network security, encryption, and ethical hacking. It details various data compression techniques, particularly Run-Length Encoding (RLE) and Zero/Blank Encoding, explaining their processes, advantages, and disadvantages. The report emphasizes the importance of applying theoretical knowledge to practical scenarios in cybersecurity to enhance data integrity and confidentiality.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

CYBER SECURITY HONOURS’ DEGREE

Ansh Kasaudhan

2024-2025

Cyber Security

2
INTRODUCTION

This cybersecurity explores key concepts and


techniques for protecting digital assets from
cyber threats. Through practical exercises,
we examine network security, encryption,
intrusion detection, and ethical hacking. The
report aims to enhance understanding of
vulnerabilities, threat mitigation, and security
protocols essential for safeguarding
information systems. By applying theoretical
knowledge to real-world scenarios, this lab
underscores the critical importance of robust
cybersecurity measures in maintaining the
integrity and confidentiality of digital data.

PAGE 2

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


3

Syllabus

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


4

Unit 5

Entropy Encoding
Entropy encoding is a lossless data compression technique that encodes data based on probabilities of
occurrences of symbols. It replaces frequently occurring symbols with shorter codes and less frequent
symbols with longer codes, optimizing the storage and transmission of data.

Key Features of Entropy Encoding

✔ Lossless compression technique (no data loss).


✔ Assigns variable-length codes based on symbol frequency.
✔ Works well with text, binary data, and images.
✔ Commonly used in Huffman Coding and Arithmetic Coding.

Repetitive Character Encoding (Run-Length


Encoding - RLE)
Repetitive Character Encoding, commonly known as Run-Length Encoding (RLE), is a lossless
compression technique used to reduce the size of data containing consecutive repeating characters.
Instead of storing each repeated character separately, RLE stores the character once along with its count.

Key Features of RLE


✔ Simple and efficient for repetitive data.
✔ Lossless compression – the original data can be perfectly reconstructed.
✔ Works best for monochrome images, text files with repeated characters, and uncompressed bitmap
images (BMP, TIFF, etc.).
✔ Not effective for data without long sequences of repetition.

Steps in Run-Length Encoding (RLE)


1️. Identify consecutive repeating characters in the data.
2️. Replace each sequence with the character followed by the repetition count.
3️. Store the compressed data in the new format.
4️. Decompression expands the encoded values back to the original data.

Example 1: Text Compression


Original String:

AAAABBBCCDAA

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Step-by-Step Encoding: 5

• "AAAA" → "4A"
• "BBB" → "3B"
• "CC" → "2C"
• "D" → "1D"
• "AA" → "2A"

Compressed Output:

4A3B2C1D2A

Decompression simply expands back to "AAAABBBCCDAA".

Example 2: Image Compression


Consider a black-and-white image stored in a 1D array:

0000000000111111111100000000

Encoded as:

9_0 9_1 9_0

(9 black pixels, 9 white pixels, 9 black pixels).

Used in fax transmission, BMP images, and TIFF format.

Advantages of RLE
✔ Simple & Fast: Requires minimal computation.
✔ Efficient for repetitive data: Works well for images with solid colors and long sequences.
✔ Lossless: 100% data recovery is possible.

Disadvantages of RLE
Ineffective for diverse data: If characters change frequently, RLE increases file size instead of
reducing it.
Not suitable for complex images or unstructured text.

Use Cases of RLE


Text compression (repetitive characters in logs, documents).
Image compression (fax, BMP, TIFF formats).
Video encoding (run-length encoding in motion pictures).

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Run-Length Encoding (RLE) 6

Run-Length Encoding (RLE) is a lossless data compression algorithm that is efficient for data with
repeated sequences of characters or pixels. The primary goal of RLE is to reduce the size of data by
encoding consecutive occurrences of the same character as a single character and a count.

Key Features of RLE


• Lossless compression: No data is lost during the compression process, and the original data can be
perfectly reconstructed.
• Simple and efficient: Especially effective for compressing data with many repeated characters (e.g.,
images with large areas of a single color).
• Storage reduction: Significantly reduces storage requirements when data contains long runs of
repeated characters or symbols.

Steps in Run-Length Encoding (RLE)


1. Identify Consecutive Repeated Characters

The first step is to scan the data and look for sequences of consecutive, identical characters or pixels.

2. Replace Each Run with a Count and the Character

For each sequence of repeated characters (or pixels), RLE replaces the sequence with a single character and
a count of how many times that character appears in the run.

3. Store the Encoded Data

The encoded data is stored as pairs of character and its run length.

4. Decompression

To reconstruct the original data, the encoded data is expanded by repeating the character according to the
specified count.

Example 1: Text Compression


Let’s take an example where we want to compress the string:

AAAABBBCCDAA

Step-by-Step Process:

1. Find the repeated characters:


o "AAAA" → 4️ occurrences of A
o "BBB" → 3️ occurrences of B
o "CC" → 2️ occurrences of C
o "D" → 1️ occurrence of D
o "AA" → 2️ occurrences of A

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


2. Replace the sequences with character + count: 7
o "AAAA" → "4A"
o "BBB" → "3B"
o "CC" → "2C"
o "D" → "1D"
o "AA" → "2A"
3. Compressed Output:

4A3B2C1D2A

4. Decompression:
To decompress the encoded string, you repeat each character by the specified count:

4A3B2C1D2A → AAAABBBCCDAA

Example 2: Image Compression


Suppose we have a black-and-white image represented as a 1D array:

0000000000111111111100000000

Step-by-Step Process:

1. Identify the runs:


o Nine 0s (black pixels)
o Nine 1s (white pixels)
o Nine 0s (black pixels)
2. Replace with run lengths:
o "000000000" → "9_0"
o "111111111" → "9_1"
o "000000000" → "9_0"
3. Compressed Output:

9_0 9_1 9_0

4. Decompression: Reconstructing the image would result in:

0000000000111111111100000000

Advantages of Run-Length Encoding (RLE)


• Simple and fast: RLE is a straightforward algorithm that is easy to implement and execute.
• Efficient for repetitive data: It provides significant compression when the data contains long
sequences of repeating characters (e.g., monochrome images, simple graphics, or text).
• Lossless: The original data can be restored exactly from the compressed format.

Disadvantages of Run-Length Encoding (RLE)


• Inefficient for non-repetitive data: RLE performs poorly with data that has little to no repetition. In
such cases, RLE could increase the size of the data instead of compressing it.
• Limited to specific data types: Best suited for images or text with long runs of repeated characters
(e.g., black-and-white images, documents with repeated words/characters).

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Use Cases of Run-Length Encoding (RLE) 8

• Image Compression: RLE is commonly used in image formats such as BMP, TIFF, and FAX,
where areas of solid colors or repeated pixel values are common.
• Simple Text Compression: Text with repeated characters, like "AAAAAAAAAAAAA", can be efficiently
compressed using RLE.
• Data Transmission: FAX machines and some early graphics file formats use RLE for transmitting
data over limited bandwidth.

Comparison: RLE vs Other Compression Techniques


Feature RLE Huffman Coding LZW (Lempel-Ziv-
Welch)
Compression Type Lossless Lossless Lossless
Best for Repetitive data Data with varied frequency Data with repeating
patterns
Complexity Simple Moderate Moderate
Compression Ratio Low for non-repetitive Variable (better for varied Good for structured data
data data)
Decompression Fast Moderate Moderate
Speed

Note: -

The terms Repetitive Character Encoding and Run-Length Encoding (RLE) are often used
interchangeably, but they generally refer to the same concept of compressing repeated characters. However,
there may be subtle differences in how they are applied in various contexts. Here's a breakdown of the
distinctions and similarities between them:

1. Repetitive Character Encoding

Repetitive Character Encoding is a more general term used to describe any encoding technique where
repeated characters or sequences are replaced by a single character and a count of how many times it
repeats. It’s often used as a high-level description for techniques like Run-Length Encoding.

• Focus: Identifying repeated characters or sequences in data and replacing them with a compressed
form.
• Context: It is generally used when describing the principle of compressing repeated characters and
can include variations of Run-Length Encoding or other techniques that work similarly.

2. Run-Length Encoding (RLE)

Run-Length Encoding (RLE) is a specific implementation of repetitive character encoding. It is a formal,


well-defined lossless compression algorithm that directly replaces runs of repeated characters with the
character and the count of repetitions.

• Focus: It strictly uses a character followed by a number indicating the number of repetitions.
• Context: RLE is a specific method commonly used in compression algorithms, especially for images
(e.g., bitmap) and fax transmission.
HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN
9

Key Differences

Feature Repetitive Character Encoding Run-Length Encoding (RLE)


Definition A general term for encoding repeated A specific algorithm for compressing
characters or sequences. repeated characters with a count.
Application Can be used to describe various methods of A particular algorithm with a strict
encoding repeated data, including RLE and method of encoding.
others.
Representation Can be a broad concept—may involve Typically represented as (count,
different ways of encoding repeated character). Example: "4A" for
characters. "AAAA".
Example Can involve custom schemes (not always a Example: "AAAABBBCC" → "4A3B2C".
strict count-based system).
Use Case Used when referring to any kind of encoding Primarily used in image compression,
that addresses repetition. fax, and monochrome images.
Complexity Varies, depending on the encoding scheme. Very simple algorithm with well-
defined steps.

Zero/Blank Encoding
Zero/Blank Encoding is a simple and specialized form of lossless data compression primarily used to
handle and compress sequences of zeroes (0) or blank spaces in a dataset, especially in text and binary
data. The idea is to replace long runs of zeroes or blanks with a compressed representation, thus reducing
the overall data size.

This technique is particularly useful when data contains many consecutive zeros or blank characters, such as
in certain image formats, matrices, or sparse data.

Key Features of Zero/Blank Encoding


• Efficient for Sparse Data: Primarily useful when the data consists of a large number of zeroes or
blank spaces, such as in image or matrix representations where empty regions are common.
• Lossless Compression: No information is lost during the compression process. The original data can
be perfectly reconstructed.
• Simple: The encoding technique is straightforward, requiring only the identification of zeroes or
blanks in a sequence.
• Space-saving: Helps to reduce storage and transmission costs in data that contains long sequences of
zeroes or blanks.

Steps in Zero/Blank Encoding


1. Identify Sequences of Zeroes or Blank Characters

The first step is to scan the dataset and find consecutive zeroes (0s) or blank spaces. These sequences are
the main targets for encoding.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


2. Replace Runs with a Count 10

For each sequence of consecutive zeroes or blank spaces, replace it with the count of occurrences. For
example, a sequence of 5 zeroes would be replaced with the value 5.

3. Store the Encoded Data

The compressed data consists of runs of zeroes or blanks followed by their respective count.

4. Decompression

To restore the original data, each count is expanded by repeating the zero or blank character for the specified
number of times.

Example of Zero/Blank Encoding


Example 1: Encoding Zeroes in Binary Data

Original Data:

0000000011100000000000000

Step-by-Step Process:

1. Identify the sequences:


o 7 consecutive 0s.
o 3 consecutive 1s.
o 5 consecutive 0s.
o 7 consecutive 0s.
2. Replace with count:
o "0000000" → "7_0"
o "111" → "3_1"
o "00000" → "5_0"
o "0000000" → "7_0"
3. Compressed Output:

7_0 3_1 5_0 7_0

4. Decompression:
Expanding the counts will give us the original sequence:

0000000011100000000000000

Example 2: Encoding Blank Spaces in Text

Original Text:

Hello World

(There are 5 spaces between "Hello" and "World")

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Step-by-Step Process: 11

1. Identify the sequence:


o 5 consecutive blank spaces between "Hello" and "World".
2. Replace with count:
o " " → "5_" (indicating 5 blank spaces)
3. Compressed Output:

Hello5_World

4. Decompression:
Expanding the 5_ will restore the original text:

Hello World

Applications of Zero/Blank Encoding


• Sparse Data Representation: Zero/Blank encoding is commonly used in applications involving
matrices, grids, and sparse arrays, where large portions of the data are zeroes (e.g., sparse matrices
in scientific computing).
• Image Compression: In some bitmap or image formats, blank spaces (like white pixels) or zero
values (in grayscale or binary images) are common. Zero/Blank encoding is useful in compressing
these regions.
• Data Transmission: It is used in situations where data, such as network protocols or
communication systems, includes a large number of unused or empty values (zeros or blanks).

Advantages of Zero/Blank Encoding


• Simple and Efficient: Zero/Blank encoding is a very simple compression method that doesn't
require complex algorithms.
• Highly Effective for Sparse Data: Works extremely well when data contains large areas of empty
space, such as in images or matrix data with large blocks of zeroes.
• Lossless Compression: The original data can be reconstructed exactly.

Disadvantages of Zero/Blank Encoding


• Ineffective for Dense Data: If the data has no long sequences of zeros or blanks, Zero/Blank
encoding may not offer any compression benefits and could increase the size.
• Limited Use Case: This method is not widely applicable outside of data with lots of zeros or blank
spaces.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Statistical Encoding 12

Statistical Encoding is a class of lossless data compression techniques that rely on the probability
distribution of the input symbols. The core idea of statistical encoding is to assign shorter codes to more
frequent symbols and longer codes to less frequent ones, effectively reducing the overall size of the data.

Key Principles of Statistical Encoding

• Frequency-based encoding: More frequent symbols are encoded with shorter codes, while less
frequent symbols get longer codes.
• Entropy: The efficiency of a statistical encoding algorithm is closely related to the entropy of the
data, which represents the amount of unpredictability or information content. The closer the
encoding approaches the entropy, the more efficient it is.
• Lossless compression: The original data can be perfectly reconstructed from the encoded data
without any loss of information.

Types of Statistical Encoding Techniques

1. Huffman Coding
2. Arithmetic Coding
3. Lempel-Ziv Coding

1. Huffman Coding
Huffman Coding is one of the most widely used lossless data compression algorithms that utilizes
variable-length encoding based on the frequency of symbols in the input data. The basic idea is to assign
shorter codes to more frequent symbols and longer codes to less frequent symbols, thereby reducing the
overall size of the data.

Steps in Huffman Coding:

1. Frequency Analysis:
The first step in Huffman coding is to calculate the frequency (or probability) of each symbol
(character or byte) in the input data.
2. Building a Huffman Tree:
o Create leaf nodes: Each symbol is represented by a leaf node with its frequency.
o Build the tree: The two nodes with the lowest frequencies are merged into a new node. This
process repeats until a single tree is built, with the root node representing the entire dataset.
o Internal nodes are assigned a frequency equal to the sum of the frequencies of their child
nodes.
3. Assigning Binary Codes:
o Traverse the tree from the root to each leaf, assigning a binary code for each symbol.
o Left branches are typically assigned 0 and right branches are assigned 1.
4. Encoding:
o Replace each symbol in the input data with its corresponding binary code from the tree.
5. Decoding:
o The decoder uses the same Huffman tree to decode the binary data back into the original
symbols.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Example of Huffman Coding: 13

Let's consider the string:


ABRACADABRA

1. Frequency of each symbol:

A: 5, B: 2, R: 2, C: 1, D: 1

2. Build the Huffman Tree:


o Combine C and D into a node CD with frequency 2.
o Combine B and R into a node BR with frequency 4.
o Combine the two nodes CD and BR into a node CD+BR with frequency 6.
o Finally, combine A with CD+BR into the root node.
3. Assign Binary Codes:
o A: 0
o B: 10
o R: 11
o C: 100
o D: 101
4. Encoded Data: The encoded version of ABRACADABRA is:

0 10 11 0 100 0 101 0 10 11 0

Advantages of Huffman Coding:

• Efficient for large datasets with skewed symbol distributions.


• Widely used in various formats like ZIP, JPEG, and MP3.
• Simple and fast to implement with linear time complexity for encoding and decoding.

Disadvantages of Huffman Coding:

• Ineffective for datasets with similar frequencies (e.g., if all symbols have nearly equal
frequencies).
• Fixed codebook: The encoding scheme is static and does not adapt to changes in symbol frequencies
during the compression process.

2. Arithmetic Coding
Arithmetic Coding is a statistical encoding technique that encodes an entire message as a single number
between 0 and 1. Unlike Huffman coding, which assigns a binary code to each symbol, Arithmetic Coding
assigns a range to the entire message, and progressively narrows down this range based on the probability of
each symbol.

Steps in Arithmetic Coding:

1. Frequency Calculation:
Determine the probability of each symbol in the input data.
2. Assign Probability Ranges:
Each symbol is assigned a subrange of the interval [0, 1] based on its probability. For example:
o A: 0.5 (range [0, 0.5))
o B: 0.25 (range [0.5, 0.75))
o C: 0.25 (range [0.75, 1))

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


3. Encoding: 14
o Start with the interval [0, 1].
o For each symbol in the message, narrow the range by multiplying the current range by the
symbol's probability.
o After processing all symbols, the final interval represents the encoded message.
4. Final Representation:
Choose any number in the final range and use its binary representation as the encoded message.
5. Decoding:
The decoder uses the same probability ranges to reverse the process and recover the original
message.

Example of Arithmetic Coding:

Consider a dataset with the following symbol probabilities:

A: 0.5, B: 0.25, C: 0.25

For the string "ABC":

1. Initial Range: [0, 1]


2. After encoding "A": Narrow range to [0, 0.5] (A's probability range).
3. After encoding "B": Narrow range to [0.5, 0.75].
4. After encoding "C": Narrow range to [0.75, 1].

At the end, the final interval is [0.75, 1], and any number in this range (e.g., 0.875) can represent the
encoded message.

Advantages of Arithmetic Coding:

• More efficient than Huffman coding for datasets with a small alphabet and non-integer
probabilities.
• Achieves better compression ratios than Huffman coding.
• Adaptable to changes in symbol frequencies during encoding.

Disadvantages of Arithmetic Coding:

• Complex: The algorithm is more computationally intensive and requires higher precision for
managing ranges.
• Slower than Huffman coding for small datasets.

3. Lempel-Ziv Coding (LZ Coding)


Lempel-Ziv Coding (LZ) is a dictionary-based compression technique that works by replacing repeated
occurrences of data with references to earlier occurrences. It is the basis for many compression algorithms,
including LZW (Lempel-Ziv-Welch) and LZ77, and is widely used in formats like GIF, TIFF, and ZIP.

Steps in Lempel-Ziv Coding (LZ77):

1. Sliding Window:
LZ77 maintains a sliding window over the input data. The window is split into two parts: a search
buffer (containing previously encoded data) and a look-ahead buffer (containing new data being
processed).
2. Find Matches:
The algorithm scans the look-ahead buffer and searches for the longest match in the search buffer.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


15

3. Encoding:
o A match is encoded as a triplet (offset, length, next symbol):
▪ Offset: The position of the match in the search buffer.
▪ Length: The length of the matched string.
▪ Next symbol: The symbol following the matched string.
4. Update the Window:
The sliding window is updated, and the process continues for the next set of symbols.

Example of LZ77 Encoding:

Original string:
ABABABA

1. Step 1: Search for the longest match of the look-ahead buffer in the search buffer.
o First, A is encoded as (0, 0, A).
o Next, B is encoded as (0, 0, B).
o The next part of the string is "AB", which matches the start of the string, so it is encoded as
(2, 2, A).
2. Encoded Output:
The encoded string is a sequence of triplets:

(0, 0, A), (0, 0, B), (2, 2, A)

Advantages of Lempel-Ziv Coding:

• Efficient for repetitive data and commonly used in real-time applications like file compression.
• No need for prior frequency analysis, unlike statistical methods such as Huffman or Arithmetic
coding.
• Adaptive: It adapts to the input data and doesn't require an initial analysis phase.

Disadvantages of Lempel-Ziv Coding:

• Less efficient for data with few repeated patterns.


• More complex than Huffman coding and requires maintaining a sliding window.

Comparison of Huffman, Arithmetic, and Lempel-Ziv Coding

Feature Huffman Coding Arithmetic Coding Lempel-Ziv (LZ77)


Coding
Compression Good for data with Very efficient for small Efficient for repetitive
Efficiency uneven frequencies alphabets patterns
Complexity Simple, fast Computationally More complex but efficient
intensive for certain types of data
Adaptability Static codebook Highly adaptable Adaptive, works with any
data type
Use Cases Text, JPEG, MP3 Multimedia, text ZIP, GIF, TIFF, LZW-
compression based algorithms
Encoding/Decoding Fast Slower due to Moderate, with a sliding
Speed complexity window

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


16
Source Encoding: Vector Quantization
Vector Quantization (VQ) is a lossy data compression technique used in source encoding that
compresses data by representing large data vectors with smaller, discrete sets of vectors called codewords.
VQ works by mapping continuous data points to a finite set of discrete vectors, significantly reducing the
data's size.

The central concept of VQ is that it quantizes (maps) a vector in a higher-dimensional space to a


representative vector in a lower-dimensional codebook. The goal is to minimize the difference between the
original data and the quantized data, often by minimizing the distortion between the two.

Types of Vector Quantization

1. Simple Vector Quantization (SVQ)


2. Vector Quantization with Error Term

1. Simple Vector Quantization (SVQ)


Simple Vector Quantization (SVQ) is the basic form of vector quantization. The idea is to represent data
vectors with a set of predefined representative vectors known as codewords. These codewords form the
codebook, and the encoding process involves finding the closest codeword for a given input vector.

Steps in Simple Vector Quantization:

1. Create a Codebook:
o A codebook is a collection of representative vectors or codewords. These vectors are
obtained by partitioning the input space into regions, and each region is represented by a
codeword.
o Codewords can be determined through techniques like Lloyd's algorithm (a form of k-
means clustering).
2. Encoding Process:
o For a given input vector, find the codeword in the codebook that is nearest to the input
vector, typically using the Euclidean distance.
o Replace the input vector with the index of the corresponding codeword from the codebook.
3. Decoding Process:
o To decode, simply use the index of the selected codeword to look up the corresponding
vector in the codebook and reconstruct the data.

Example of Simple Vector Quantization:

Consider a 2-dimensional input space, and assume we have a codebook with two codewords, C1 = [1, 2]
and C2 = [5, 6].

Input Vector: [3, 4]

• Calculate the Euclidean distance from [3, 4] to each codeword:


o Distance to C1 = [1, 2]: √((3-1)2 +(4-2)2)= √(4+4)= √8=2.83
o Distance to C2 = [5, 6]: √((3-5)2 +(4-6)2)= √(4+4)= √8=2.83
• The distances are the same, so either codeword can be chosen, but typically the first codeword (C1)
is selected.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Result: 17

• The input vector [3, 4] is encoded as C1 or simply the index 1 (representing the codeword in the
codebook).

Decoding:

• To decode, simply replace the index with the vector in the codebook corresponding to C1 = [1, 2].

Advantages of Simple Vector Quantization:

• Efficient: The method reduces the size of the data by quantizing large vectors into smaller
codewords.
• Simple and easy to implement: The process involves only finding the closest codeword and
replacing the input vector with the codeword index.

Disadvantages of Simple Vector Quantization:

• Lossy compression: Because the data is quantized, some loss of information occurs during
encoding.
• Codebook design: The performance heavily depends on how well the codebook is designed. A
poorly designed codebook may lead to higher distortion.
• High computational cost: If the codebook has a large number of codewords, the encoding process
can become computationally expensive.

2. Vector Quantization with Error Term


In Vector Quantization with Error Term, the encoding process is modified to take into account the
quantization error and minimize the distortion (error) between the original input vector and the encoded
vector. Instead of using just a single codeword to represent the vector, we add an error term (or residual) to
account for the difference between the original vector and the chosen codeword.

Steps in Vector Quantization with Error Term:

1. Create a Codebook:
o As with simple VQ, a codebook of representative vectors is created. However, in this case,
the encoding process accounts for the quantization error.
2. Encoding Process with Error Term:
o Find the nearest codeword in the codebook as usual.
o Calculate the quantization error (or residual) as the difference between the input vector and
the chosen codeword.

E=X−C

where X is the input vector, and C is the selected codeword.

oThe encoded data consists of:


▪ The index of the selected codeword.
▪ The residual vector representing the error.
3. Decoding Process with Error Term:
o To decode, use the index to look up the codeword from the codebook.
o Add the residual vector (the error term) back to the codeword to approximate the original
vector.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


Example of Vector Quantization with Error Term: 18

Assume a codebook with codewords C1 = [1, 2] and C2 = [5, 6] again. Let's say the input vector is [3,
4].

1. Calculate the distance from the input vector to the codewords:


o Distance to C1 = [1, 2] and C2 = [5, 6] is calculated the same way as in SVQ.
2. Suppose C1 is chosen (since the distances are equal).
3. Calculate the error term (residual):

E=[3️,4️]−[1️,2️]=[2️,2️]

4. The encoded data now consists of:


o Index of the codeword (C1 = 1).
o Error vector [2, 2].
5. Decoding:
o Use the index 1 to get the codeword C1 = [1, 2].
o Add the error vector [2, 2] to C1: [1,2]+[2,2]=[3,4]
o This approximately reconstructs the original vector.

Advantages of Vector Quantization with Error Term:

• Reduced distortion: The error term helps reduce the quantization error, improving the
approximation of the original data.
• Efficient compression: By combining codebook lookup and error correction, the compression ratio
can be higher without significant quality loss.

Disadvantages of Vector Quantization with Error Term:

• Complexity: The encoding and decoding processes are more complex compared to simple vector
quantization.
• Requires error storage: The error term must be stored or transmitted along with the codeword
index, which can add overhead.
• Lossy compression: Like simple vector quantization, some loss of information occurs, but with less
noticeable distortion.

Conclusion

• Simple Vector Quantization (SVQ) is a straightforward and efficient technique for lossy data
compression. It maps input vectors to the closest codeword in a predefined codebook.
• Vector Quantization with Error Term takes the quantization error into account, improving the
quality of the approximation by adding a residual error vector to the codeword. This method helps
reduce distortion and achieves a better reconstruction of the original data.

Both techniques are commonly used in image compression, speech encoding, and audio compression,
among other applications. While SVQ is simpler and faster, vector quantization with an error term
provides better accuracy at the cost of increased complexity.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


19

Recent Trends in Encryption and Data Compression Techniques


The fields of encryption and data compression have seen significant advancements in recent years. These
advancements aim to address the growing needs for data security, efficiency, and the handling of massive
amounts of data in modern applications such as cloud storage, big data analytics, and IoT (Internet of
Things). Below are some of the recent trends in these areas:

Recent Trends in Encryption Techniques

1. Quantum Encryption
Quantum computing has the potential to break traditional encryption algorithms like RSA and ECC (Elliptic
Curve Cryptography). As a result, quantum-safe encryption or post-quantum encryption is gaining
attention.

• Quantum Key Distribution (QKD): This encryption technique leverages the principles of quantum
mechanics to create secure communication channels. The most notable protocol is BB84, which
allows for the secure exchange of encryption keys.
• Lattice-Based Cryptography: A class of cryptographic algorithms that is believed to be resistant to
quantum attacks. Examples include NTRU and FrodoKEM.
• Code-Based Cryptography: Uses error-correcting codes to create cryptographic systems that are
resistant to quantum computers. McEliece is a prominent example.
• Multivariate Polynomial Cryptography: Based on the hardness of solving systems of multivariate
quadratic equations, this approach is another potential quantum-safe technique.

2. Homomorphic Encryption (HE)


Homomorphic encryption allows computations to be performed on encrypted data without decrypting it first.
This technique is crucial for privacy-preserving computations in cloud environments, enabling data to
remain encrypted even while being processed. Recent advancements include:

• Fully Homomorphic Encryption (FHE): Allows both addition and multiplication on encrypted
data.
• Partially Homomorphic Encryption (PHE): Supports either addition or multiplication but not
both.

Applications: Cloud computing, secure data sharing, privacy-preserving machine learning.

3. Blockchain and Encryption


Blockchain technology continues to evolve, providing strong encryption and data integrity features in
decentralized applications.

• Zero-Knowledge Proofs (ZKPs): Used in blockchain to prove that a statement is true without
revealing any additional information. This is important for privacy and anonymity in blockchain
transactions.
• Advanced Cryptographic Hash Functions: New and more secure hash functions are emerging to
strengthen blockchain systems, such as SHA-3 and BLAKE2.

4. Post-Quantum Cryptography (PQC)


As quantum computers become a real threat, researchers are working on developing quantum-resistant
algorithms. The NIST Post-Quantum Cryptography Standardization process is actively selecting new
algorithms to replace current public-key cryptosystems.

• Algorithms like Kyber (for key exchange), NTRU (for encryption), and FrodoKEM (for key
encapsulation) are being explored as viable replacements.
HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN
5. Advanced Secure Multi-Party Computation (SMPC) 20
This technique enables multiple parties to compute a function while keeping their inputs private. The
applications are widespread, particularly in data analytics and finance.

• Secure Multiparty Computation allows for encrypted computations involving multiple parties
without revealing their private data, ensuring privacy.

Recent Trends in Data Compression Techniques


1. Machine Learning and AI-Based Compression
The use of machine learning (ML) and artificial intelligence (AI) in data compression has opened up new
avenues for compression algorithms.

• Deep Learning for Compression: Neural networks, particularly autoencoders, are being used to
learn optimal compression techniques for image, video, and text data. Algorithms like Deep Image
Prior and Variational Autoencoders (VAEs) are improving compression efficiency in tasks like
image compression.
• Reinforcement Learning for Adaptive Compression: RL algorithms are used to adapt the
compression techniques dynamically based on the characteristics of the data.

2. Compression for Big Data and IoT


The explosion of IoT devices and the growth of big data have led to new compression methods optimized
for these environments.

• Edge Compression: Compressing data on edge devices before transmitting it to the cloud, reducing
bandwidth and storage requirements.
• Sparse Data Compression: Data from IoT sensors and large datasets often contain a lot of sparse or
redundant information. New algorithms focus on compressing sparse datasets more efficiently, using
techniques like run-length encoding, delta encoding, and Huffman coding.
• Distributed Data Compression: In distributed systems, compressing data across multiple nodes is
gaining attention to minimize network overhead and speed up data processing.

3. Video and Image Compression (HEVC & AV1)


Video and image compression continue to evolve with more efficient codecs and algorithms:

• HEVC (High-Efficiency Video Coding): Used for 4K and 8K video compression, it provides
significantly higher compression ratios than H.264 while maintaining the same video quality.
• AV1: A newer video compression standard that is open-source and royalty-free. AV1 promises better
compression efficiency than HEVC, especially for streaming applications.
• Deep Learning-based Compression: AI techniques like Generative Adversarial Networks
(GANs) are also being used to improve image and video compression, particularly in applications
requiring high visual fidelity.

4. Lossless Compression for Text and Data


While lossy compression techniques are becoming more prevalent, lossless compression remains essential
for certain applications, such as data storage and transmission where fidelity is critical.

• Zstandard: A fast compression algorithm that provides a good balance between compression ratio
and speed. It's often used in modern applications like Facebook’s compression library and
Google’s Brotli (for web content compression).
• LZ4: Known for its fast compression and decompression speeds, LZ4 is widely used in database
compression and log file compression.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN


5. Hybrid Compression Techniques 21
Combining lossless and lossy techniques is becoming common, especially when dealing with complex
datasets (e.g., multimedia content, big data, or scientific data).

• Hybrid Video Compression: In many modern streaming applications, both lossy and lossless
compression methods are used. For example, lossy methods are used for initial compression,
followed by a lossless compression stage to further optimize storage.

6. Web and Cloud Compression


With the increasing reliance on cloud-based storage and the web, there has been a shift toward client-side
compression and cloud-optimized compression techniques.

• Brotli Compression: Brotli is gaining traction as a replacement for Gzip and deflate, offering better
compression ratios and speeds for web content, especially in HTTPS traffic.
• Cloud Compression Algorithms: Cloud service providers like AWS, Google Cloud, and Azure are
focusing on compression algorithms tailored to their specific environments, enabling more efficient
storage and faster data retrieval.

Summary of Key Trends

• Encryption is moving towards quantum-safe algorithms and privacy-preserving technologies,


including homomorphic encryption and blockchain-based encryption.
• Machine learning is becoming central to data compression and encryption, leading to more
adaptive and efficient algorithms.
• Advances in video and image compression focus on new codecs (AV1, HEVC) and AI-based
approaches for maintaining high quality with reduced storage requirements.
• The need for compression in big data, IoT, and cloud computing is pushing forward edge
compression, distributed compression, and hybrid techniques to optimize both storage and
network bandwidth.
• The development of post-quantum cryptography ensures that encryption remains secure in the age
of quantum computing, while blockchain and zero-knowledge proofs are pushing the boundaries of
privacy and security in decentralized applications.

HTCS 501: DATA ENCRYPTION AND COMPRESSION ANSH KASAUDHAN

You might also like