0% found this document useful (0 votes)
10 views36 pages

Information Retrieval - 3

Uploaded by

wogarigj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views36 pages

Information Retrieval - 3

Uploaded by

wogarigj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

1

MODERN INFORMATION RETRIEVAL

DISTRIBUTED INDEXING AND INDEX COMPRESSION

UNIT – 3

Prepared by : SRIDHAR U
Outline
2

 Distributed Indexing
 Index Compression
 Dictionary compression
 Postings compression
3 Distributed indexing
Distributed indexing
4

 For web-scale indexing


must use a distributed computing cluster

 Individual machines are fault-prone


 Can unpredictably slow down or fail

 How do we exploit such a pool of machines?


Google data centers (2007 estimates;
5
Gartner)
 Google data centers mainly contain commodity
machines and are distributed around the world.
 Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
 Estimate: Google installs 100,000 servers each
quarter.
 Based on expenditures of 200–250 million dollars per
year
 This would be 10% of the computing capacity of the
world!?!
Distributed indexing
6

 Maintain a master machine directing the indexing


job – considered “safe”.

 Break up indexing into sets of (parallel) tasks.

 Master machine assigns each task to an idle


machine from a pool.
Parallel tasks
7

 We will use two sets of parallel tasks


 Parsers

 Inverters

 Break the input document corpus into splits

 Each split is a subset of documents (corresponding to


blocks in BSBI/SPIMI)
Parsers
8

 Master assigns a split to an idle parser machine

 Parser reads a document at a time and emits


(term, doc) pairs
 Parser writes pairs into j partitions

 Each partition is for a range of terms’ first letters


 (e.g., a-f, g-p, q-z) – here j=3.
 Now to complete the index inversion
Inverters
9

 An inverter collects all (term,doc) pairs (= postings)


for one term-partition.

 Sorts and writes to postings lists


Data flow
10
11 Index Compression
Why compression? (in general)
12

 Keep more stuff in memory (increases speed,


caching effect)
 enables deployment on smaller/mobile devices
 Increase data transfer from disk to memory
 [read compressed data and decompress] is faster than
[read uncompressed data]

 Premise: Decompression algorithms are fast


 True of the decompression algorithms we use
Why compression in information
13
retrieval?
 First, we will consider space for dictionary
 Make it small enough to keep in main memory
 Then the postings
 Reduce disk space needed, decrease time to read
from disk
 Large search engines keep a significant part of
postings in memory
 (Each postings entry is a docID)
Lossy vs. lossless compression
14

 Lossless compression: All information is preserved.


 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be viewed as
lossy compression: case folding, stop words, stemming,
number elimination.
Model collection: The Reuters collection
15
16 Dictionary compression
Why Dictionary compression
17
Recall: Dictionary as array of fixed-
18
width entries
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.


a 656,265
aachen 65
…. ….
zulu 221

20 bytes 4 bytes each


Dictionary search
structure
Fixed-width entries are bad.
19

 Most of the bytes in the Term column are wasted


– we allot 20 bytes for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious.

 Written English averages ~4.5 characters/word.


 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?

 Short words dominate token counts but not token


type (term) average.
Dictionary as a string
20

Store dictionary as a (long) string of characters:



Pointer to next word shows end of current word
Hope to save up to 60% of dictionary space.

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.


Total string length =
33 400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126 positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
21

 4 bytes per term for Freq.


 4 bytes per term for pointer to Postings.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19  7.6 MB (against 11.2MB for
fixed width)
Dictionary as a string with blocking
22

 Store pointers to every kth term string.


 Example below: k=4.
 Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.


33
29  Save 9 bytes
44  on 3 Lose 4 bytes on
 pointers. term lengths.
126
7 22
Space for dictionary as a string with
23
blocking
Front coding
24

 Front-coding:
 Sorted words commonly have long common prefix –
store differences only
 (for last k-1 in a block of k)

8automata8automate9automatic10automation
8automat*a1e2ic3ion

Extra length
Encodes automat
beyond automat.

Begins to resemble general string compression.


Dictionary compression for Reuters:
25
Summary

Technique Size in MB

Fixed width 11.2

String with pointers to every term 7.6

Blocking k=4 7.1

Blocking + front coding 5.9


26 Postings compression
Postings compression
27

 The postings file is much larger than the


dictionary, by a factor of at least 10.
 Key desideratum: store each posting compactly.
 A posting for our purposes is a docID.
 For Reuters (800,000 documents), we would use
32 bits per docID when using 4-byte integers.
 Alternatively, we can use log2 800,000 ≈ 20 bits
per docID.
 Our goal: use a lot less than 20 bits per docID.
Key idea: Store gaps instead of docIDs
28
Gap encoding
29
Variable length encoding
30

 Aim:
 For arachnocentric, we will use ~20 bits/gap entry.
 For the, we will use ~1 bit/gap entry.

 If the average gap for a term is G, we want to


use ~log2G bits/gap entry.
 Key challenge: encode every integer (gap) with
~ as few bits as needed for that integer.
 Variable length codes achieve this by using short
codes for small numbers
Variable byte (VB) code
31

 For a gap value G, use close to the fewest bytes


needed to hold log2 G bits
 Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
 If G ≤127, binary-encode it in the 7 available bits
and set c =1
 Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
 At the end set the continuation bit of the last byte
to 1 (c =1) and of the other bytes to 0 (c =0).
VB code examples
32

docIDs 824 829 215406


gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001

Postings stored as the byte concatenation


00000110 10111000 10000101 00001101 00001100 10110001

Key property: VB-encoded postings are


uniquely prefix-decodable.

For a small gap (5), VB


uses a whole byte.
Other variable codes
33

 Instead of bytes, we can also use a different “unit


of alignment”: 32 bits (words), 16 bits, 4 bits
(nibbles) etc.

 Variable byte alignment wastes space if you have


many small gaps – nibbles do better in such cases.
Gamma codes for gap encoding
34

 Can compress better with bit-level codes


 The Gamma code is the best known of these.
 Represent a gap G as a pair length and offset
 offset is G in binary, with the leading bit cut off
 For example 13 → 1101 → 101
 length is the length of offset
 For 13 (offset 101), this is 3.
 Encode length in unary code: 1110.

 Gamma code of 13 is the concatenation of length


and offset: 1110101
Gamma code examples
35

number length offset g-code


0 none
1 0 0
2 10 0 10,0
3 10 1 10,1
4 110 00 110,00
9 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
Compression of Reuters
36

Data structure Size in MB


dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0

You might also like