Lecture4 Compression
Lecture4 Compression
David Kauchak
cs160
Fall 2009
adapted from:
http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
Administrative
Homework 2
Assignment 1
Assignment 2
Pair programming?
% cumul
%
Unfiltered
484
No numbers
474
-2
-2
Case folding
392 -17
-19
30 stopwords
391
-0
-19
150 stopwords
391
-0
-19
stemming
322 -17
-33
terms
% change
none
120K
number folding
117K
3%
lowercasing
100K
17%
stemming
95K
25%
stoplist
120K
0%
97K
20%
all
78K
35%
non-positional
postings
positional postings
dictionary
non-positional index
positional index
Size
(K)
Size (K)
% cumul
%
109,971
cumul
%
Unfiltered
484
197,879
No numbers
474
-2
-2
100,680
-8
-8
179,158
-9
-9
Case folding
392 -17
-19
96,969
-3
-12
179,158
-9
30 stopwords
391
-0
-19
83,390 -14
-24
121,858 -31
-38
150 stopwords
391
-0
-19
67,002 -30
-39
94,517 -47
-52
stemming
322 -17
-33
63,812
-42
94,517
-52
-4
non-positional
postings
positional postings
dictionary
non-positional index
positional index
Size
(K)
Size (K)
% cumul
%
109,971
cumul
%
Unfiltered
484
197,879
No numbers
474
-2
-2
100,680
-8
-8
179,158
-9
-9
Case folding
392 -17
-19
96,969
-3
-12
179,158
-9
30 stopwords
391
-0
-19
83,390 -14
-24
121,858 -31
-38
150 stopwords
391
-0
-19
67,002 -30
-39
94,517 -47
-52
stemming
322 -17
-33
63,812
-42
94,517
-52
-4
non-positional
postings
positional postings
dictionary
non-positional index
positional index
Size
(K)
Size (K)
% cumul
%
109,971
cumul
%
Unfiltered
484
197,879
No numbers
474
-2
-2
100,680
-8
-8
179,158
-9
-9
Case folding
392 -17
-19
96,969
-3
-12
179,158
-9
30 stopwords
391
-0
-19
83,390 -14
-24
121,858 -31
-38
150 stopwords
391
-0
-19
67,002 -30
-39
94,517 -47
-52
stemming
322 -17
-33
63,812
-42
94,517
-52
-4
Corpora statistics
statistic
documents
avg. # of tokens
per doc
terms
non-positional
postings
TDT
16K
400
Reuters RCV1
800K
200
100K
?
400K
100M
vocabulary size
number of documents
Heaps law
Vocab size = k (tokens)b
M = k Tb
Discussion
word frequency
Zipfs law
frequencyi c/i
where c is a constant
word frequency
Index compression
Inverted index
word1
word2
wordn
20 bytes
4 bytes each
Any ideas?
Dictionary-as-a-String
Fixed-width
As a string
20% reduction!
Still a long way from 60%. Any way we can store
less pointers?
Blocking
Blocking
Example below: k = 4
Save 9 bytes
on 3
pointers.
Lose 4 bytes on
term lengths.
Net
Binary search
Assuming each
dictionary term is equally
likely in query (not really
so in practice!), average
number of comparisons
=?
(1+22+43+4)/8 ~2.6
More improvements
8automata8automate9automatic10automation
Front coding
Front-coding:
Sorted words commonly have long common prefix
store differences only
(for last k-1 in a block of k)
8automata8automate9automatic10automation
8automat*a1e2ic3ion
Encodes automat
Extra length
beyond automat
Size in MB
11.2
7.6
Blocking k = 4
7.1
5.9
Postings compression
computer: 33,47,154,159,202
14 = 47-33
107 = 154 47
5 = 159 - 154
Fixed-width
Aim:
1 byte
2 bytes
00000001000001010000001001110001
?
VB codes
For each byte used, how many bits of the gap are
we storing?
Example
docIDs
824
gaps
VB code
00000110
10111000
829
215406
214577
10000101
00001101
00001100
10110001
More codes
100000011000010100000100 11110001
Gamma codes
13 1101 101
17 10001 0001
50 110010 10010
13 (offset 101), it is 3
17 (offset 0001), it is 4
50 (offset 10010), it is 5
Any ideas?
Unary code
length
0
1
2
3
4
9
13
24
511
1025
offset
-code
length
offset
-code
none
10
10,0
10
10,1
110
00
110,00
1110
001
1110,001
13
1110
101
1110,101
24
11110
1000
11110,1000
511
111111110
11111111
111111110,11111111
1025
11111111110
0000000001
11111111110,0000000001
log2 (gap)
RCV1 compression
Data structure
dictionary, fixed-width
Size in MB
11.2
7.6
with blocking, k = 4
7.1
5.9
3,600.0
960.0
40,000.0
400.0
250.0
116.0
postings, -encoded
101.0
Resources
IIR 5
F. Scholer, H.E. Williams and J. Zobel. 2002.
Compression of Inverted Indexes For Fast Query
Evaluation. Proc. ACM-SIGIR 2002.
V. N. Anh and A. Moffat. 2005. Inverted Index
Compression Using Word-Aligned Binary Codes.
Information Retrieval 8: 151166.