0% found this document useful (0 votes)

57 views32 pages

Lecture 4 Index Compression

The document discusses different techniques for compressing postings lists in inverted indexes, including gap encoding, delta encoding, unary coding, binary coding, and variable byte encoding. It provides examples of how to encode and decode numbers using different variable-length encoding schemes. The goal is to use fewer bits to store more common numbers and gaps between document IDs to save space.

Uploaded by

Asma MSCS 2022 FAST NU LHR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views32 pages

Lecture 4 Index Compression

Uploaded by

Asma MSCS 2022 FAST NU LHR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Index Compression

Lecture 4
Why compression (in general)?
• Use less disk space
– Save a little money; give users more space
• Keep more stuff in memory
– Increases speed
• Increase speed of data transfer from disk to
memory
– [read compressed data | decompress] is faster than
[read uncompressed data]

• Premise: Decompression algorithms are fast

– True of the decompression algorithms we use
2
Postings compression
• A posting for our purposes is a docID.
• For Reuters (800,000 documents), we would
use 32 bits per docID when using 4-byte
integers.
• Alternatively, we can use log2 800,000 ≈ 20
bits per docID.
• Our goal: use far fewer than 20 bits per docID.

3
Compression Example
• Fixed Length encoding –clear how to decode
since number of bits is fixed
Number Fixed Length
Code (5 bit)
1 00001
01010011101011001101
2 00010
15 01111
17 10001
01010, 01110, 10110, 01101
25 11001
31 11111
10, 14, 22, 13

4
Fixed Length Encoding
• How many total bits are required for encoding
for 10 million numbers using 20 bits for each
number?
• 200 million bits = 2*108 bits
• What if most numbers are small ?
• We are wasting many bits by storing small
numbers in 20 bits.

5
Variable Length Encoding
• Decode 10110011

• 1011, 0,0,11 = 11, 0,0,3 Number Variable Length

code
1 1
• 10, 1100, 11 = 2, 12, 3 2 10
10 1010
16 10000
• 101,100, 11 = 5, 4, 3
• 10,11,0,0,11 = 2, 3, 0, 0, 3
•

6
Compression Example

Decode 0101011101100
• use unambiguous code:

Prefix free code

0, 1, 0, 3, 0, 2, 0

• which gives:

7
Postings: two conflicting forces
• A term like arachnocentric occurs in maybe
one doc out of a million – we would like to
store this posting using log2 1M ≈ 20 bits.
• A term like the occurs in virtually every doc, so
20 bits/posting ≈ 2MB is too expensive.

8
Gap encoding of postings file entries
• We store the list of docs containing a term in
increasing order of docID.
– computer: 33,47,154,159,202 …
• Consequence: it suffices to store gaps.
– 33,14,107,5,43 …
• Hope: most gaps can be encoded/stored with
far fewer than 20 bits.
– Especially for common words

9
Three postings entries

10
Index compression
• Observation of posting files
– Instead of storing docID in posting, we store gap
between docIDs, since they are ordered
– Zipf’s law again:
• The more frequent a word is, the smaller the gaps are
• The less frequent a word is, the shorter the posting list
is
– Heavily biased distribution gives us great
opportunity of compression!
Information theory: entropy measures compression difficulty.
11
Delta Encoding
• Word count data is good candidate for
compression
– many small numbers and few larger numbers
– encode small numbers with small codes
• Document numbers are less predictable
– but differences between numbers in an ordered
list are smaller and more predictable
• Delta encoding:
– encoding differences between document numbers
(d-gaps)
12
Delta Encoding
– Inverted list (without counts)

– Differences between adjacent numbers

– Differences for a high-frequency word are easier to

compress, e.g.,

– Differences for a low-frequency word are large, e.g.,

13
Practice Question
• Delta encode following numbers:
• 40, 45, 405, 411, 416
• 40, 5, 360, 6, 5

• Decode following numbers encoded using

delta encoding
• 20, 10, 30, 4, 8
• 20, 30, 60, 64, 72

14
Variable length encoding
• Aim:
– For arachnocentric, we will use ~20 bits/gap entry.
– For the, we will use ~1 bit/gap entry.
• If the average gap for a term is G, we want to use
~log2G bits/gap entry.
• Key challenge: encode every integer (gap) with
about as few bits as needed for that integer.
• This requires a variable length encoding
• Variable length codes achieve this by using short
codes for small numbers

15
Unary Codes
• Breaks between encoded numbers can occur
after any bit position
• Unary code
– Encode k by k 1s followed by 0
– 0 at end makes code unambiguous
1110111010110

1110, 1110, 10, 110

3, 3, 1, 2
16
Unary Codes

• Unary is very efficient for small numbers such

as 0 and 1, but quickly becomes very
expensive
– 1023 can be represented in 10 binary bits, but
requires 1024 bits in unary

17
Binary Codes
• Variable Length Binary Codes:
– Variable length Binary is more efficient for large
numbers, but it may be ambiguous
– E.g. 2 encoded as 10 and 5 encoded as 101
– 10101, how to decode, we don’t know the word
bound
• Fixed Length Binary Codes:
– For example use 15 bits to encode each number
– Cannot encode large numbers that required more
than 15 bits
– Too much space is wasted for encoding small numbers

18
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary

• 12,
• 1100 , 4 bits
• Encode length of the code in unary in begining
• Code of 12 = 111101100

• 23
• 10111 , 5 bits
• Code of 23 = 11111010111

19
Elias-γ Code (Elias Gamma Code)
• Decode 1110 101 110 10

• 1110101 , 11010
• 5 , 2

20
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary

• 12,
• 1100 , 4 bits
• 100, leave the leftmost 1 bit, length 3 bits
• Code of 12 = 1110100

• 23
• 10111 , 5 bits
• 0111, 4 bits, leave the leftmost 1 bit
• Code of 23 = 111100111

21
Elias-γ Code (Elias Gamma Code)
• Decode 111001111000

• 1110 011, 110 00

• 011 was actually 1011 = 11

• 00 was actually 100 = 4

• 11, 4

22
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute

– Since the leftmost bit is always 1 in binary code so we

do not encode it

– The remaining number becomes kr = k – 2 lgk

– We use Unary code for kd and binary code for kr

23
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute

Unary code for kd and binary code for kr

24
Gamma seldom used in practice
• Machines have word boundaries – 8, 16, 32, 64
bits
– Operations that cross word boundaries are slower
• Compressing and manipulating at the granularity
of bits can be too slow

• All modern practice is to use byte or word aligned

codes
– Variable byte encoding is a faster, conceptually
simpler compression scheme, with decent
compression

25
Variable Byte (VB) codes
• For a gap value G, we want to use close to the
fewest bytes needed to hold log2 G bits
• Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
• If G ≤127, binary-encode it in the 7 available bits
and set c =1
• Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
• At the end set the continuation bit of the last
byte to 1 (c =1) – and for the other bytes c = 0.
26
Variable Byte Code

Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:

Δ(L) = ⟨80, 320, 31, 255⟩

11010000 01000000 10000010 10011111
vByte(L) =
01111111 10000001

27
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001

28
RCV1 compression
Data structure Size in MB

collection (text, xml markup etc) 3,600.0

collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0

Index and dictionary compression for Reuters-RCV1.

(Manning et al. Introduction to Information Retrieval)
29
Group Variable Integer code

• Used by Google around turn of millennium….

– Jeff Dean, keynote at WSDM 2009
– Encodes 4 integers in blocks of size 5–17 bytes
• First byte: four 2-bit binary length fields
• L1 L2 L3 L4 , Lj{1,2,3,4}
• Then, L1+L2+L3+L4 bytes (between 4–16) hold 4 numbers
– Each number can use 8/16/24/32 bits. Max gap length ~4 billion
• It was suggested that this was about twice as fast as VB
encoding
– Decoding gaps is much simpler – no bit masking
– First byte can be decoded with lookup table or switch

30
Group Variable Integer code

Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:

Δ(L) = ⟨80, 320, 31, 255⟩

11010000 01000000 10000010 10011111 01111111
vByte(L) =
10000001

For the same postings list as before, the Group VarInt representation is:

GroupVarInt(L) = 00010000 01010000 01000000 00000001 00011111 11111111

31
Compression Techniques Effectiveness
(All Data from Gov2 Collection)

Decoding (ns per position) Cumulative Overhead (decoding +

disk I/O)

Gamma 12.81 32.11 ns

vByte 4.34 20.82 ns

Group VarInt 1.9 19.85 ns

Chapter 6, Information Retrieval: Implementing and Evaluating

Search Engines, by S. Büttcher, C. Clarke, and G. Cormack. 32

Building Git
No ratings yet
Building Git
733 pages
Hands On Question Interstellar Git
100% (1)
Hands On Question Interstellar Git
9 pages
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
Testing_document_sourjyendra_Data Compression Techniques - Lecture 4 - Integer Codes 2 - University of Helsinky - Slides (DCT2015-Lecture4)
No ratings yet
Testing_document_sourjyendra_Data Compression Techniques - Lecture 4 - Integer Codes 2 - University of Helsinky - Slides (DCT2015-Lecture4)
56 pages
Chapter 4 Lossless Compression Algorithims
No ratings yet
Chapter 4 Lossless Compression Algorithims
30 pages
Algorithms in The Real World: Data Compression: Lectures 1 and 2
No ratings yet
Algorithms in The Real World: Data Compression: Lectures 1 and 2
55 pages
Sheet 5
No ratings yet
Sheet 5
1 page
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Chapter 5 Data Compression
No ratings yet
Chapter 5 Data Compression
17 pages
20 Compression
No ratings yet
20 Compression
58 pages
Entropy & Run Length Coding
No ratings yet
Entropy & Run Length Coding
45 pages
Gamma
No ratings yet
Gamma
5 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Foundations of Information Processing: Information and Data Compression
No ratings yet
Foundations of Information Processing: Information and Data Compression
35 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
2017 May 24 Huffman Lecture1
No ratings yet
2017 May 24 Huffman Lecture1
24 pages
Huffman
No ratings yet
Huffman
13 pages
Notes 07 Compression PDF
No ratings yet
Notes 07 Compression PDF
193 pages
L15 Compression
No ratings yet
L15 Compression
63 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Chapter 2-Compression Techniques
No ratings yet
Chapter 2-Compression Techniques
63 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Module IV
No ratings yet
Module IV
37 pages
Dce Easy Solution
0% (1)
Dce Easy Solution
87 pages
Computer Science Revision
No ratings yet
Computer Science Revision
73 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Huffman Coding, RLE, LZW
No ratings yet
Huffman Coding, RLE, LZW
41 pages
Chapter 4 - Introduction To Source Coding
No ratings yet
Chapter 4 - Introduction To Source Coding
72 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Ic23 Unit02 Script
No ratings yet
Ic23 Unit02 Script
29 pages
Basic Concepts of Encoding
No ratings yet
Basic Concepts of Encoding
34 pages
Ultimedia OF ATA Ompression: IS502:M D I S
No ratings yet
Ultimedia OF ATA Ompression: IS502:M D I S
29 pages
7.file Compression
No ratings yet
7.file Compression
20 pages
11 Huffman Coding
No ratings yet
11 Huffman Coding
25 pages
1-Data Compression-2022
No ratings yet
1-Data Compression-2022
24 pages
Elias Codes
No ratings yet
Elias Codes
6 pages
Class 2
No ratings yet
Class 2
36 pages
Compression: Some Slides Courtesy James Allan@umass
No ratings yet
Compression: Some Slides Courtesy James Allan@umass
47 pages
Comp 1
No ratings yet
Comp 1
15 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
Lecture 6
No ratings yet
Lecture 6
22 pages
Chapter 2
No ratings yet
Chapter 2
13 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
37 pages
MMC Chap3
100% (1)
MMC Chap3
22 pages
Coding Theory
No ratings yet
Coding Theory
49 pages
Kap 5
No ratings yet
Kap 5
29 pages
Data Compression Explained
No ratings yet
Data Compression Explained
110 pages
Compression
100% (1)
Compression
38 pages
M2 Prefixcode
No ratings yet
M2 Prefixcode
44 pages
Chapter10 Part1 Huffman
No ratings yet
Chapter10 Part1 Huffman
17 pages
Introduction To Information Technology: Lecture #6
No ratings yet
Introduction To Information Technology: Lecture #6
22 pages
Forouzan6e ch11 PPTs Accessible
No ratings yet
Forouzan6e ch11 PPTs Accessible
119 pages
UNIT-5 Entropy Encoding
No ratings yet
UNIT-5 Entropy Encoding
8 pages
Chapter 7
No ratings yet
Chapter 7
36 pages
M2 Rle
No ratings yet
M2 Rle
28 pages
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
MULTICAST IP ROUTING: MULTICAST IP ROUTING- Part 1
From Everand
MULTICAST IP ROUTING: MULTICAST IP ROUTING- Part 1
Ummed Singh
No ratings yet
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
An Introduction To Digital Design
From Everand
An Introduction To Digital Design
Jason King
2/5 (1)
精通Python自然语言处理: Chinese Edition
From Everand
精通Python自然语言处理: Chinese Edition
Posts & Telecom Press
No ratings yet
Essential Computer Hardware: The Illustrated Guide to Understanding Computer Systems
From Everand
Essential Computer Hardware: The Illustrated Guide to Understanding Computer Systems
Kevin Wilson
No ratings yet
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Neural IR
No ratings yet
Neural IR
45 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Lecture 4 Index Compression
No ratings yet
Lecture 4 Index Compression
32 pages
Delta Flashing of An ECU in The Automotive Industr
No ratings yet
Delta Flashing of An ECU in The Automotive Industr
6 pages
Delta Encoding in Data Compression
No ratings yet
Delta Encoding in Data Compression
18 pages
Vcdiff Algorithm
No ratings yet
Vcdiff Algorithm
29 pages
Data Services: Inventory Specification
No ratings yet
Data Services: Inventory Specification
25 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Erasure Coding and Data Deduplication-Survey
No ratings yet
Erasure Coding and Data Deduplication-Survey
8 pages
A Course in In-Memory Data Management: Prof. Hasso Plattner
No ratings yet
A Course in In-Memory Data Management: Prof. Hasso Plattner
12 pages
Xdfs File System Support For Delta Compression
No ratings yet
Xdfs File System Support For Delta Compression
32 pages

Lecture 4 Index Compression

Uploaded by

Lecture 4 Index Compression

Uploaded by

Index Compression

• Premise: Decompression algorithms are fast

• 1011, 0,0,11 = 11, 0,0,3 Number Variable Length

Prefix free code

– Differences between adjacent numbers

– Differences for a high-frequency word are easier to

– Differences for a low-frequency word are large, e.g.,

• Decode following numbers encoded using

1110, 1110, 10, 110

• Unary is very efficient for small numbers such

• 1110 011, 110 00

• 011 was actually 1011 = 11

• 00 was actually 100 = 4

– Since the leftmost bit is always 1 in binary code so we

– The remaining number becomes kr = k – 2 lgk

– We use Unary code for kd and binary code for kr

Unary code for kd and binary code for kr

• All modern practice is to use byte or word aligned

Δ(L) = ⟨80, 320, 31, 255⟩

collection (text, xml markup etc) 3,600.0

Index and dictionary compression for Reuters-RCV1.

• Used by Google around turn of millennium….

Δ(L) = ⟨80, 320, 31, 255⟩

GroupVarInt(L) = 00010000 01010000 01000000 00000001 00011111 11111111

Decoding (ns per position) Cumulative Overhead (decoding +

Gamma 12.81 32.11 ns

vByte 4.34 20.82 ns

Group VarInt 1.9 19.85 ns

Chapter 6, Information Retrieval: Implementing and Evaluating

You might also like