Indexing 1
Indexing 1
Index Construction
Locations, dates, usernames, and other metadata are common search criteria,
especially in search functions of web and mobile applications.
When these fields contain text, they are ultimately stored using the same inverted list
structure.
Next, we’ll see how to compress inverted lists to reduce storage needs and
filesystem I/O.
Postings may be sorted from largest to smallest score, in order to quickly find
the most relevant documents. This is especially useful when you want to quickly
find the approximate-best documents rather than the exact-best.
Indexing scores makes queries much faster, but gives less flexibility in updating
your retrieval function. It is particularly efficient for single term queries.
For Machine Learning based retrieval, it’s common to store per-term scores
such as BM25 as features.
Fields and Extents
Some indexes have distinct fields with their own inverted lists. For instance,
an index of e-mails may contain fields for common e-mail headers (from,
subject, date, …).
Others store document regions such as the title or headers using extent lists.
extent list
Index Schemas
As the information stored in an inverted
index grows more complex, it becomes
useful to represent it using some form of
schema.
An indexing algorithm needs to address hardware limitations (e.g., memory usage), OS limitations
(the maximum number of files the filesystem can efficiently handle), and algorithmic concerns.
When considering whether your algorithm is sufficient, consider how it would perform on a
document collection a few orders of magnitude larger than it was designed for.
Instead, we can run a search against both old and new indexes and merge the
result lists at search time. Once enough changes have accumulated, we can
merge the old and new indexes in a large batch.
If a document is modified, we place its docid into the delete list and place the new
version in the new index.
Updating Indexes
If each term’s inverted list is stored in a separate file, updating the index
is straightforward: we simply merge the postings from the old and new
index.
There are ways to update live indexes efficiently, but it’s often simpler to
simply write a new index, then redirect queries to the new index and
delete the old one.
Compressing Indexes
The best any compression scheme can do depends on the entropy of the
probability distribution over the data. More random data is less compressible.
Huffman Codes meet the entropy limit and can be built in linear time, so are a
common choice. Other schemes can do better, generally by interpreting the
input sequence differently (e.g. encoding sequences of characters as if they
were a single input symbol – different distribution, different entropy limit).
• e.g., 25-50% of the size of the raw documents for TREC collections
with the Indri search engine
• Expected code length is approximately 5.4 bits per character, for a 32.8%
compression ratio.
• average code length on WSJ89 is 5.8 bits per character, for a 27.9% compression ratio
restricted variable length codes: more symbols
• Use more than 2 cases.
3
• 1xxx for 2 = 8 most frequent symbols, and
6
• 0xxx1xxx for next 2 = 64 symbols, and
9
• 0xxx0xxx1xxx for next 2 = 512 symbols, and
• ...
• ...
• average code length on WSJ89 is 8.0 bits per symbol, for a 0.0%
compression ratio (!!).
• simple
• all but the basic method can handle any size dictionary
Entropy and Compressibility
The entropy of a probability
distribution is a measure of its
randomness.
!
( )= log
b 1/4 10 0.5
The more probable a symbol is to occur,
the smaller its code should be. By this c 1/8 110 0.375
view, UTF-32 assumes a uniform
distribution over all unicode symbols; d 1/16 1110 0.25
UTF-8 assumes ASCII characters are more
common. e 1/16 1111 0.25
Huffman Codes achieve the best possible Plaintext:!! aedbbaae (64 bits in UTF-8)
compression ratio when the distribution is Ciphertext:! 0111111101010001111
known and when no code can stand for
multiple symbols.
Building Huffman Codes
Huffman Codes are built using a binary tree
which always joins the least probable 1
remaining nodes. 1
• No need to store dictionary; encoder and decoder each know how to build it on the fly.
• decode references
• decode the pointers with log(?) bits : 0|1,1|01,1 |00,1|011,0 |001,1 |011,1|101,0 |0010,?!
P
• +-#" #0'# cl !-/ cl 6( 2-*/0!5 #0" <"=;"!>6?
l
"+,-36+/ !"+/#0 (- #0 "6+")*'!6#5 2"'3(
nH "# LZencoding B06,0 6( #- ('5 H #" LZrateC
Bit-aligned Codes
Bit-aligned codes allow us to minimize the storage used to encode integers.
We can use just a few bits for small integers, and still represent arbitrarily large
numbers.
Next, we’ll see how to encode integers using a variable byte code, which is
more convenient for processing.
To further reduce the index size, we want High-frequency words compress more easily:
to ensure that docids, positions, etc. in 1, 1, 2, 1, 5, 1, 4, 1, 1, 3, ...
our lists are small (for smaller encodings)
and repetitive (for better compression). Low-frequency words have larger deltas:
109, 3766, 453, 1867, 992, ...
We can do this by sorting the lists and
encoding the difference, or delta,
between the current number and the last.
Byte-Aligned Codes
In production systems, inverted lists are stored using byte-aligned
codes for delta-encoded integer sequences.
k 1 1 1 0000001 81
2 2 6 1 0000110 86
2 3 127 1 1111111 FF
81 82 81 86 81 82 86 8B 01 B4 81 81 81
Alternative Codes
Although vbyte is often adequate, we can do better for high-performance
decoding.
Vbyte requires a conditional branch at every byte and a lot of bit shifting.