lecture5-compression
lecture5-compression
Introduction to
Information Retrieval
Lecture 5: Index Compression
Introduction to Information Retrieval
Today
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns?
Introduction to Information Retrieval Sec. 5.1
Exercises
▪ What is the effect of including spelling errors, vs.
automatically correcting spelling errors on Heaps’
law?
▪ Compute the vocabulary size M for this scenario:
▪ Looking at a collection of web pages, you find that there
are 3000 different terms in the first 10,000 tokens and
30,000 different terms in the first 1,000,000 tokens.
▪ Assume a search engine indexes a total of 20,000,000,000
(2 × 1010) pages, containing 200 tokens on average
▪ What is the size of the vocabulary of the indexed collection
as predicted by Heaps’ law?
Introduction to Information Retrieval Sec. 5.1
Zipf’s law
▪ Heaps’ law gives the vocabulary size in collections.
▪ We also study the relative frequencies of terms.
▪ In natural language, there are a few very frequent
terms and very many very rare terms.
▪ Zipf’s law: The ith most frequent term has frequency
proportional to 1/i .
▪ cfi ∝ 1/i = K/i where K is a normalizing constant
▪ cfi is collection frequency: the number of
occurrences of the term ti in the collection.
Introduction to Information Retrieval Sec. 5.1
Zipf consequences
▪ If the most frequent term (the) occurs cf1 times
▪ then the second most frequent term (of) occurs cf1/2 times
▪ the third most frequent term (and) occurs cf1/3 times …
▪ Equivalent: cfi = K/i where K is a normalizing factor,
so
▪ log cfi = log K - log i
▪ Linear relationship between log cfi and log i
15
Introduction to Information Retrieval Ch. 5
Compression
▪ Now, we will consider compressing the space
for the dictionary and postings
▪ Basic Boolean index only
▪ No study of positional indexes, etc.
▪ We will consider compression schemes
Introduction to Information Retrieval Sec. 5.2
DICTIONARY COMPRESSION
Introduction to Information Retrieval Sec. 5.2
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Blocking
▪ Store pointers to every kth term string.
▪ Example below: k=4.
▪ Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Net
▪ Example for block size k = 4
▪ Where we used 3 bytes/pointer without blocking
▪ 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.
Exercise
▪ Estimate the space usage (and savings compared to
7.6 MB) with blocking, for block sizes of k = 4, 8 and
16.
Introduction to Information Retrieval Sec. 5.2
▪ Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2∙2+4∙3+4)/8 ~2.6
Exercise
▪ Estimate the impact on search performance (and
slowdown compared to k=1) with blocking, for block
sizes of k = 4, 8 and 16.
Introduction to Information Retrieval Sec. 5.2
Front coding
▪ Front-coding:
▪ Sorted words commonly have long common prefix – store
differences only
▪ (for last k-1 in a block of k)
8automata8automate9automatic10automation
→8automat*a1e2ic3ion
POSTINGS COMPRESSION
Introduction to Information Retrieval Sec. 5.3
Postings compression
▪ The postings file is much larger than the dictionary,
factor of at least 10.
▪ Key desideratum: store each posting compactly.
▪ A posting for our purposes is a docID.
▪ For Reuters (800,000 documents), we would use 32
bits per docID when using 4-byte integers.
▪ Alternatively, we can use log2 800,000 ≈ 20 bits per
docID.
▪ Our goal: use a lot less than 20 bits per docID.
Introduction to Information Retrieval Sec. 5.3
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001
Unary code
▪ Represent n as n 1s with a final 0.
▪ Unary code for 3 is 1110.
▪ Unary code for 40 is
11111111111111111111111111111111111111110 .
▪ Unary code for 80 is:
11111111111111111111111111111111111111111111
1111111111111111111111111111111111110
Gamma codes
▪ We can compress better with bit-level codes
▪ The Gamma code is the best known of these.
▪ Represent a gap G as a pair length and offset
▪ offset is G in binary, with the leading bit cut off
▪ For example 13 → 1101 → 101
▪ length is the length of offset
▪ For 13 (offset 101), this is 3.
▪ We encode length with unary code: 1110.
▪ Gamma code of 13 is the concatenation of length
and offset: 1110101
Introduction to Information Retrieval Sec. 5.3
RCV1 compression
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, −encoded 101.0
Introduction to Information Retrieval Sec. 5.3