0% found this document useful (0 votes)
38 views

Lecture 4 Index Compression

The document discusses different techniques for compressing postings lists in inverted indexes, including gap encoding, delta encoding, unary coding, binary coding, and variable byte encoding. It provides examples of how to encode and decode numbers using different variable-length encoding schemes. The goal is to use fewer bits to store more common numbers and gaps between document IDs to save space.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lecture 4 Index Compression

The document discusses different techniques for compressing postings lists in inverted indexes, including gap encoding, delta encoding, unary coding, binary coding, and variable byte encoding. It provides examples of how to encode and decode numbers using different variable-length encoding schemes. The goal is to use fewer bits to store more common numbers and gaps between document IDs to save space.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Index Compression

Lecture 4
Why compression (in general)?
• Use less disk space
– Save a little money; give users more space
• Keep more stuff in memory
– Increases speed
• Increase speed of data transfer from disk to
memory
– [read compressed data | decompress] is faster than
[read uncompressed data]

• Premise: Decompression algorithms are fast


– True of the decompression algorithms we use
2
Postings compression
• A posting for our purposes is a docID.
• For Reuters (800,000 documents), we would
use 32 bits per docID when using 4-byte
integers.
• Alternatively, we can use log2 800,000 ≈ 20
bits per docID.
• Our goal: use far fewer than 20 bits per docID.

3
Compression Example
• Fixed Length encoding –clear how to decode
since number of bits is fixed
Number Fixed Length
Code (5 bit)
1 00001
01010011101011001101
2 00010
15 01111
17 10001
01010, 01110, 10110, 01101
25 11001
31 11111
10, 14, 22, 13

4
Fixed Length Encoding
• How many total bits are required for encoding
for 10 million numbers using 20 bits for each
number?
• 200 million bits = 2*108 bits
• What if most numbers are small ?
• We are wasting many bits by storing small
numbers in 20 bits.

5
Variable Length Encoding
• Decode 10110011

• 1011, 0,0,11 = 11, 0,0,3 Number Variable Length


code
1 1
• 10, 1100, 11 = 2, 12, 3 2 10
10 1010
16 10000
• 101,100, 11 = 5, 4, 3
• 10,11,0,0,11 = 2, 3, 0, 0, 3

6
Compression Example

Decode 0101011101100
• use unambiguous code:

Prefix free code

0, 1, 0, 3, 0, 2, 0

• which gives:

7
Postings: two conflicting forces
• A term like arachnocentric occurs in maybe
one doc out of a million – we would like to
store this posting using log2 1M ≈ 20 bits.
• A term like the occurs in virtually every doc, so
20 bits/posting ≈ 2MB is too expensive.

8
Gap encoding of postings file entries
• We store the list of docs containing a term in
increasing order of docID.
– computer: 33,47,154,159,202 …
• Consequence: it suffices to store gaps.
– 33,14,107,5,43 …
• Hope: most gaps can be encoded/stored with
far fewer than 20 bits.
– Especially for common words

9
Three postings entries

10
Index compression
• Observation of posting files
– Instead of storing docID in posting, we store gap
between docIDs, since they are ordered
– Zipf’s law again:
• The more frequent a word is, the smaller the gaps are
• The less frequent a word is, the shorter the posting list
is
– Heavily biased distribution gives us great
opportunity of compression!
Information theory: entropy measures compression difficulty.
11
Delta Encoding
• Word count data is good candidate for
compression
– many small numbers and few larger numbers
– encode small numbers with small codes
• Document numbers are less predictable
– but differences between numbers in an ordered
list are smaller and more predictable
• Delta encoding:
– encoding differences between document numbers
(d-gaps)
12
Delta Encoding
– Inverted list (without counts)

– Differences between adjacent numbers

– Differences for a high-frequency word are easier to


compress, e.g.,

– Differences for a low-frequency word are large, e.g.,

13
Practice Question
• Delta encode following numbers:
• 40, 45, 405, 411, 416
• 40, 5, 360, 6, 5

• Decode following numbers encoded using


delta encoding
• 20, 10, 30, 4, 8
• 20, 30, 60, 64, 72

14
Variable length encoding
• Aim:
– For arachnocentric, we will use ~20 bits/gap entry.
– For the, we will use ~1 bit/gap entry.
• If the average gap for a term is G, we want to use
~log2G bits/gap entry.
• Key challenge: encode every integer (gap) with
about as few bits as needed for that integer.
• This requires a variable length encoding
• Variable length codes achieve this by using short
codes for small numbers

15
Unary Codes
• Breaks between encoded numbers can occur
after any bit position
• Unary code
– Encode k by k 1s followed by 0
– 0 at end makes code unambiguous
1110111010110

1110, 1110, 10, 110


3, 3, 1, 2
16
Unary Codes

• Unary is very efficient for small numbers such


as 0 and 1, but quickly becomes very
expensive
– 1023 can be represented in 10 binary bits, but
requires 1024 bits in unary

17
Binary Codes
• Variable Length Binary Codes:
– Variable length Binary is more efficient for large
numbers, but it may be ambiguous
– E.g. 2 encoded as 10 and 5 encoded as 101
– 10101, how to decode, we don’t know the word
bound
• Fixed Length Binary Codes:
– For example use 15 bits to encode each number
– Cannot encode large numbers that required more
than 15 bits
– Too much space is wasted for encoding small numbers

18
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary

• 12,
• 1100 , 4 bits
• Encode length of the code in unary in begining
• Code of 12 = 111101100

• 23
• 10111 , 5 bits
• Code of 23 = 11111010111

19
Elias-γ Code (Elias Gamma Code)
• Decode 1110 101 110 10

• 1110101 , 11010
• 5 , 2

20
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary

• 12,
• 1100 , 4 bits
• 100, leave the leftmost 1 bit, length 3 bits
• Code of 12 = 1110100

• 23
• 10111 , 5 bits
• 0111, 4 bits, leave the leftmost 1 bit
• Code of 23 = 111100111

21
Elias-γ Code (Elias Gamma Code)
• Decode 111001111000

• 1110 011, 110 00

• 011 was actually 1011 = 11

• 00 was actually 100 = 4


• 11, 4

22
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute

– Since the leftmost bit is always 1 in binary code so we


do not encode it

– The remaining number becomes kr = k – 2 lgk

– We use Unary code for kd and binary code for kr

23
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute

Unary code for kd and binary code for kr

24
Gamma seldom used in practice
• Machines have word boundaries – 8, 16, 32, 64
bits
– Operations that cross word boundaries are slower
• Compressing and manipulating at the granularity
of bits can be too slow

• All modern practice is to use byte or word aligned


codes
– Variable byte encoding is a faster, conceptually
simpler compression scheme, with decent
compression

25
Variable Byte (VB) codes
• For a gap value G, we want to use close to the
fewest bytes needed to hold log2 G bits
• Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
• If G ≤127, binary-encode it in the 7 available bits
and set c =1
• Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
• At the end set the continuation bit of the last
byte to 1 (c =1) – and for the other bytes c = 0.
26
Variable Byte Code

Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:

Δ(L) = ⟨80, 320, 31, 255⟩


11010000 01000000 10000010 10011111
vByte(L) =
01111111 10000001

27
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001

28
RCV1 compression
Data structure Size in MB

collection (text, xml markup etc) 3,600.0


collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0

Index and dictionary compression for Reuters-RCV1.


(Manning et al. Introduction to Information Retrieval)
29
Group Variable Integer code

• Used by Google around turn of millennium….


– Jeff Dean, keynote at WSDM 2009
– Encodes 4 integers in blocks of size 5–17 bytes
• First byte: four 2-bit binary length fields
• L1 L2 L3 L4 , Lj{1,2,3,4}
• Then, L1+L2+L3+L4 bytes (between 4–16) hold 4 numbers
– Each number can use 8/16/24/32 bits. Max gap length ~4 billion
• It was suggested that this was about twice as fast as VB
encoding
– Decoding gaps is much simpler – no bit masking
– First byte can be decoded with lookup table or switch

30
Group Variable Integer code

Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:

Δ(L) = ⟨80, 320, 31, 255⟩


11010000 01000000 10000010 10011111 01111111
vByte(L) =
10000001

For the same postings list as before, the Group VarInt representation is:

GroupVarInt(L) = 00010000 01010000 01000000 00000001 00011111 11111111

31
Compression Techniques Effectiveness
(All Data from Gov2 Collection)

Decoding (ns per position) Cumulative Overhead (decoding +


disk I/O)

Gamma 12.81 32.11 ns

vByte 4.34 20.82 ns

Group VarInt 1.9 19.85 ns

Chapter 6, Information Retrieval: Implementing and Evaluating


Search Engines, by S. Büttcher, C. Clarke, and G. Cormack. 32

You might also like