0% found this document useful (0 votes)
10 views6 pages

UNIT 3 Notes

The document provides detailed notes on data compression and dynamic inverted indices, covering general-purpose data compression, modeling and coding techniques, and specific algorithms like Huffman and Arithmetic coding. It also discusses the importance of symbolwise text compression, methods for compressing postings lists, and the structure of dynamic inverted indices for efficient updates. Key concepts include various compression techniques, advantages and disadvantages of each, and strategies for maintaining index performance during updates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

UNIT 3 Notes

The document provides detailed notes on data compression and dynamic inverted indices, covering general-purpose data compression, modeling and coding techniques, and specific algorithms like Huffman and Arithmetic coding. It also discusses the importance of symbolwise text compression, methods for compressing postings lists, and the structure of dynamic inverted indices for efficient updates. Key concepts include various compression techniques, advantages and disadvantages of each, and strategies for maintaining index performance during updates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

UNIT 3 SUPER-DETAILED NOTES — Data Compression & Dynamic Inverted Indices

1. General-Purpose Data Compression

 Definition: Data compression is the process of transforming data into a compact


form such that it occupies less space and can be transmitted more efficiently, without
losing the original information (in case of lossless compression).

 Objectives:

o Reduce data storage requirements.

o Improve transmission efficiency.

o Optimize performance for both storage and retrieval.

 Types:

o Lossless Compression:

 Ensures that the original data can be perfectly reconstructed.

 Examples: ZIP (file compression), PNG (image compression), FLAC


(audio compression).

o Lossy Compression:

 Irreversibly removes less-important information to achieve higher


compression.

 Examples: JPEG (images), MP3 (audio), MPEG (video).

 Applications:

o Archival storage (backup systems, databases)

o Multimedia systems (images, audio, video streaming)

o Web services (reduced data transfer time)

o Scientific and medical imaging

2. Data Compression: Modeling and Coding

 Definition: A two-step process in data compression that aims to represent data


compactly by exploiting statistical redundancies.

 Modeling:

o Identifies statistical patterns in the data.


o Determines the probability distribution of symbols.

o Example: In English text, letters like ‘e’ and ‘t’ appear more frequently.

 Coding:

o Assigns binary codes to symbols based on their probabilities.

o Frequent symbols → shorter codes

o Rare symbols → longer codes

 Steps:

1. Analyze data to create a statistical model.

2. Design a coding scheme based on model (Huffman, Arithmetic, etc.)

3. Huffman Coding

 Definition: A lossless data compression algorithm that assigns variable-length codes


to input characters, with shorter codes assigned to more frequent characters.

 Working:

1. Count frequency of each symbol.

2. Build priority queue of symbols based on frequency.

3. Combine least frequent nodes into a binary tree.

4. Assign binary codes to branches (0-left, 1-right).

 Advantages:

o Simple and widely used.

o Optimal when symbol probabilities are known.

 Disadvantages:

o Inefficient for small alphabets.

o Requires symbol table to decode.

 Applications:

o File compression (ZIP)

o Text compression

o Multimedia (JPEG, MP3)


4. Arithmetic Coding

 Definition: A form of entropy encoding used in lossless data compression that


represents an entire message as a single number between 0 and 1.

 Working:

1. Assign probability ranges to symbols.

2. Narrow interval for each symbol.

3. Final fractional number represents full message.

 Advantages:

o Better compression than Huffman in some cases.

o Handles fractional probabilities efficiently.

 Disadvantages:

o Computationally intensive.

o Sensitive to floating-point precision.

 Applications:

o JPEG 2000

o Multimedia file compression

5. Symbolwise Text Compression

 Definition: Text compression methods that assign codes to individual symbols rather
than blocks of text.

 Key Methods:

o Huffman Coding

o Arithmetic Coding

 Importance:

o Essential for search engines, file storage, communication protocols.

6. Compressing Postings Lists (Inverted Index Compression)


 Definition: Postings lists store document IDs where terms occur. Compressing them
reduces storage and accelerates query processing.

Nonparametric Gap Compression

 Concept: Store gaps (difference between consecutive docIDs) instead of raw IDs.

 Techniques:

o Variable Byte Encoding

o Elias Gamma & Delta Coding

 Advantages: Simple, adaptive, good for small gaps.

 Disadvantages: Less efficient for large gaps.

Parametric Gap Compression

 Concept: Encode gaps using statistical models assuming known distribution.

 Example: Golomb Coding (for geometric distribution)

 Advantages: Compact if model fits data well.

 Disadvantages: Less flexible; needs parameter tuning.

Context-Aware Compression Methods

 Concept: Tailor compression method based on local properties of postings list.

 Example: PForDelta (Partitioned Frame of Reference + Delta Encoding)

 Advantages: High decompression speed, hardware-friendly.

Index Compression for High Query Performance

 Techniques:

o Block-wise Compression (compress blocks instead of whole list)

o SIMD-friendly Methods (Single Instruction Multiple Data)

 Advantages:

o Faster decompression during queries.

o Balance between space and speed.

Compression Effectiveness

 Metric: Compression Ratio = Compressed Size / Original Size

 Goal: Achieve high compression ratio without sacrificing performance.


Decoding Performance

 Metric: Time to decompress compressed postings.

 Trade-Off: Higher compression often slows down decoding; balance needed.

Document Reordering

 Concept: Reorder documents to minimize gaps between document IDs.

 Techniques:

o Sorting by similarity, URL, frequency.

 Benefits:

o Smaller gaps → higher compression.

o Better cache locality → faster queries.

7. Dynamic Inverted Indices

 Definition: Index structures that can be efficiently updated with additions, deletions,
or modifications.

Incremental Index Updates

 Concept: Update index continuously without full rebuild.

 Techniques:

o Maintain auxiliary index for new documents.

o Periodically merge with main index.

 Advantages:

o Real-time updates possible.

o Reduced downtime.

Contiguous Inverted Lists

 Concept: Store postings contiguously in memory.

 Advantages:

o Faster sequential access.

o Better cache utilization.

 Disadvantages:
o Harder to update incrementally.

Noncontiguous Inverted Lists

 Concept: Store postings in linked segments.

 Advantages:

o Flexible for updates.

 Disadvantages:

o Slower query performance due to pointer chasing.

Document Deletions: Invalidation List

 Concept: Maintain a list of invalid (deleted) document IDs.

 Advantages:

o Avoids immediate expensive updates.

o Easy to exclude deleted docs during query.

Garbage Collection

 Concept: Periodically clean up deleted document data.

 Advantages:

o Frees space.

o Optimizes index performance.

Document Modifications

 Concept: Modify a document by deleting old version and inserting new one.

 Advantages:

o Maintains index consistency.

 Disadvantages:

o Slightly more overhead than direct modification.

You might also like