Lecture 4 Index Compression
Lecture 4 Index Compression
Lecture 4
Why compression (in general)?
• Use less disk space
– Save a little money; give users more space
• Keep more stuff in memory
– Increases speed
• Increase speed of data transfer from disk to
memory
– [read compressed data | decompress] is faster than
[read uncompressed data]
3
Compression Example
• Fixed Length encoding –clear how to decode
since number of bits is fixed
Number Fixed Length
Code (5 bit)
1 00001
01010011101011001101
2 00010
15 01111
17 10001
01010, 01110, 10110, 01101
25 11001
31 11111
10, 14, 22, 13
4
Fixed Length Encoding
• How many total bits are required for encoding
for 10 million numbers using 20 bits for each
number?
• 200 million bits = 2*108 bits
• What if most numbers are small ?
• We are wasting many bits by storing small
numbers in 20 bits.
5
Variable Length Encoding
• Decode 10110011
6
Compression Example
Decode 0101011101100
• use unambiguous code:
0, 1, 0, 3, 0, 2, 0
• which gives:
7
Postings: two conflicting forces
• A term like arachnocentric occurs in maybe
one doc out of a million – we would like to
store this posting using log2 1M ≈ 20 bits.
• A term like the occurs in virtually every doc, so
20 bits/posting ≈ 2MB is too expensive.
8
Gap encoding of postings file entries
• We store the list of docs containing a term in
increasing order of docID.
– computer: 33,47,154,159,202 …
• Consequence: it suffices to store gaps.
– 33,14,107,5,43 …
• Hope: most gaps can be encoded/stored with
far fewer than 20 bits.
– Especially for common words
9
Three postings entries
10
Index compression
• Observation of posting files
– Instead of storing docID in posting, we store gap
between docIDs, since they are ordered
– Zipf’s law again:
• The more frequent a word is, the smaller the gaps are
• The less frequent a word is, the shorter the posting list
is
– Heavily biased distribution gives us great
opportunity of compression!
Information theory: entropy measures compression difficulty.
11
Delta Encoding
• Word count data is good candidate for
compression
– many small numbers and few larger numbers
– encode small numbers with small codes
• Document numbers are less predictable
– but differences between numbers in an ordered
list are smaller and more predictable
• Delta encoding:
– encoding differences between document numbers
(d-gaps)
12
Delta Encoding
– Inverted list (without counts)
13
Practice Question
• Delta encode following numbers:
• 40, 45, 405, 411, 416
• 40, 5, 360, 6, 5
14
Variable length encoding
• Aim:
– For arachnocentric, we will use ~20 bits/gap entry.
– For the, we will use ~1 bit/gap entry.
• If the average gap for a term is G, we want to use
~log2G bits/gap entry.
• Key challenge: encode every integer (gap) with
about as few bits as needed for that integer.
• This requires a variable length encoding
• Variable length codes achieve this by using short
codes for small numbers
15
Unary Codes
• Breaks between encoded numbers can occur
after any bit position
• Unary code
– Encode k by k 1s followed by 0
– 0 at end makes code unambiguous
1110111010110
17
Binary Codes
• Variable Length Binary Codes:
– Variable length Binary is more efficient for large
numbers, but it may be ambiguous
– E.g. 2 encoded as 10 and 5 encoded as 101
– 10101, how to decode, we don’t know the word
bound
• Fixed Length Binary Codes:
– For example use 15 bits to encode each number
– Cannot encode large numbers that required more
than 15 bits
– Too much space is wasted for encoding small numbers
18
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary
• 12,
• 1100 , 4 bits
• Encode length of the code in unary in begining
• Code of 12 = 111101100
• 23
• 10111 , 5 bits
• Code of 23 = 11111010111
19
Elias-γ Code (Elias Gamma Code)
• Decode 1110 101 110 10
• 1110101 , 11010
• 5 , 2
20
Elias-γ Code (Elias Gamma Code)
• Encode number in minimum bits in binary
• 12,
• 1100 , 4 bits
• 100, leave the leftmost 1 bit, length 3 bits
• Code of 12 = 1110100
• 23
• 10111 , 5 bits
• 0111, 4 bits, leave the leftmost 1 bit
• Code of 23 = 111100111
21
Elias-γ Code (Elias Gamma Code)
• Decode 111001111000
22
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute
23
Elias-γ Code (Elias Gamma Code)
• To encode a number k, compute
24
Gamma seldom used in practice
• Machines have word boundaries – 8, 16, 32, 64
bits
– Operations that cross word boundaries are slower
• Compressing and manipulating at the granularity
of bits can be too slow
25
Variable Byte (VB) codes
• For a gap value G, we want to use close to the
fewest bytes needed to hold log2 G bits
• Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
• If G ≤127, binary-encode it in the 7 available bits
and set c =1
• Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
• At the end set the continuation bit of the last
byte to 1 (c =1) – and for the other bytes c = 0.
26
Variable Byte Code
Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:
27
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001
28
RCV1 compression
Data structure Size in MB
30
Group Variable Integer code
Consider the vByte representation of the postings list L = ⟨80, 400, 431, 686⟩:
For the same postings list as before, the Group VarInt representation is:
31
Compression Techniques Effectiveness
(All Data from Gov2 Collection)