0% found this document useful (0 votes)

10 views36 pages

Information Retrieval - 3

Uploaded by

wogarigj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views36 pages

Information Retrieval - 3

Uploaded by

wogarigj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

1

MODERN INFORMATION RETRIEVAL

DISTRIBUTED INDEXING AND INDEX COMPRESSION

UNIT – 3

Prepared by : SRIDHAR U
Outline
2

 Distributed Indexing
 Index Compression
 Dictionary compression
 Postings compression
3 Distributed indexing
Distributed indexing
4

 For web-scale indexing

must use a distributed computing cluster

 Individual machines are fault-prone

 Can unpredictably slow down or fail

 How do we exploit such a pool of machines?

Google data centers (2007 estimates;
5
Gartner)
 Google data centers mainly contain commodity
machines and are distributed around the world.
 Estimate: a total of 1 million servers, 3 million
processors/cores (Gartner 2007)
 Estimate: Google installs 100,000 servers each
quarter.
 Based on expenditures of 200–250 million dollars per
year
 This would be 10% of the computing capacity of the
world!?!
Distributed indexing
6

 Maintain a master machine directing the indexing

job – considered “safe”.

 Break up indexing into sets of (parallel) tasks.

 Master machine assigns each task to an idle

machine from a pool.
Parallel tasks
7

 We will use two sets of parallel tasks

 Parsers

 Inverters

 Break the input document corpus into splits

 Each split is a subset of documents (corresponding to

blocks in BSBI/SPIMI)
Parsers
8

 Master assigns a split to an idle parser machine

 Parser reads a document at a time and emits

(term, doc) pairs
 Parser writes pairs into j partitions

 Each partition is for a range of terms’ first letters

 (e.g., a-f, g-p, q-z) – here j=3.
 Now to complete the index inversion
Inverters
9

 An inverter collects all (term,doc) pairs (= postings)

for one term-partition.

 Sorts and writes to postings lists

Data flow
10
11 Index Compression
Why compression? (in general)
12

 Keep more stuff in memory (increases speed,

caching effect)
 enables deployment on smaller/mobile devices
 Increase data transfer from disk to memory
 [read compressed data and decompress] is faster than
[read uncompressed data]

 Premise: Decompression algorithms are fast

 True of the decompression algorithms we use
Why compression in information
13
retrieval?
 First, we will consider space for dictionary
 Make it small enough to keep in main memory
 Then the postings
 Reduce disk space needed, decrease time to read
from disk
 Large search engines keep a significant part of
postings in memory
 (Each postings entry is a docID)
Lossy vs. lossless compression
14

 Lossless compression: All information is preserved.

 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be viewed as
lossy compression: case folding, stop words, stemming,
number elimination.
Model collection: The Reuters collection
15
16 Dictionary compression
Why Dictionary compression
17
Recall: Dictionary as array of fixed-
18
width entries
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265
aachen 65
…. ….
zulu 221

20 bytes 4 bytes each

Dictionary search
structure
Fixed-width entries are bad.
19

 Most of the bytes in the Term column are wasted

– we allot 20 bytes for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious.

 Written English averages ~4.5 characters/word.

 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?

 Short words dominate token counts but not token

type (term) average.
Dictionary as a string
20

Store dictionary as a (long) string of characters:


Pointer to next word shows end of current word
Hope to save up to 60% of dictionary space.

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

Total string length =
33 400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126 positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
21

 4 bytes per term for Freq.

 4 bytes per term for pointer to Postings.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19  7.6 MB (against 11.2MB for
fixed width)
Dictionary as a string with blocking
22

 Store pointers to every kth term string.

 Example below: k=4.
 Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33
29  Save 9 bytes
44  on 3 Lose 4 bytes on
 pointers. term lengths.
126
7 22
Space for dictionary as a string with
23
blocking
Front coding
24

 Front-coding:
 Sorted words commonly have long common prefix –
store differences only
 (for last k-1 in a block of k)

8automata8automate9automatic10automation
8automat*a1e2ic3ion

Extra length
Encodes automat
beyond automat.

Begins to resemble general string compression.

Dictionary compression for Reuters:
25
Summary

Technique Size in MB

Fixed width 11.2

String with pointers to every term 7.6

Blocking k=4 7.1

Blocking + front coding 5.9

26 Postings compression
Postings compression
27

 The postings file is much larger than the

dictionary, by a factor of at least 10.
 Key desideratum: store each posting compactly.
 A posting for our purposes is a docID.
 For Reuters (800,000 documents), we would use
32 bits per docID when using 4-byte integers.
 Alternatively, we can use log2 800,000 ≈ 20 bits
per docID.
 Our goal: use a lot less than 20 bits per docID.
Key idea: Store gaps instead of docIDs
28
Gap encoding
29
Variable length encoding
30

 Aim:
 For arachnocentric, we will use ~20 bits/gap entry.
 For the, we will use ~1 bit/gap entry.

 If the average gap for a term is G, we want to

use ~log2G bits/gap entry.
 Key challenge: encode every integer (gap) with
~ as few bits as needed for that integer.
 Variable length codes achieve this by using short
codes for small numbers
Variable byte (VB) code
31

 For a gap value G, use close to the fewest bytes

needed to hold log2 G bits
 Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
 If G ≤127, binary-encode it in the 7 available bits
and set c =1
 Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
 At the end set the continuation bit of the last byte
to 1 (c =1) and of the other bytes to 0 (c =0).
VB code examples
32

docIDs 824 829 215406

gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001

Postings stored as the byte concatenation

00000110 10111000 10000101 00001101 00001100 10110001

Key property: VB-encoded postings are

uniquely prefix-decodable.

For a small gap (5), VB

uses a whole byte.
Other variable codes
33

 Instead of bytes, we can also use a different “unit

of alignment”: 32 bits (words), 16 bits, 4 bits
(nibbles) etc.

 Variable byte alignment wastes space if you have

many small gaps – nibbles do better in such cases.
Gamma codes for gap encoding
34

 Can compress better with bit-level codes

 The Gamma code is the best known of these.
 Represent a gap G as a pair length and offset
 offset is G in binary, with the leading bit cut off
 For example 13 → 1101 → 101
 length is the length of offset
 For 13 (offset 101), this is 3.
 Encode length in unary code: 1110.

 Gamma code of 13 is the concatenation of length

and offset: 1110101
Gamma code examples
35

number length offset g-code

0 none
1 0 0
2 10 0 10,0
3 10 1 10,1
4 110 00 110,00
9 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
Compression of Reuters
36

Data structure Size in MB

dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0

05 Index Construction
No ratings yet
05 Index Construction
47 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
100% (1)
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
56 pages
Lecture 4 Index Compression
No ratings yet
Lecture 4 Index Compression
32 pages
Data Compression Btech Notes
No ratings yet
Data Compression Btech Notes
32 pages
Unit 2
No ratings yet
Unit 2
157 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Fun With Flexible Indexing
No ratings yet
Fun With Flexible Indexing
41 pages
05comp Flat
No ratings yet
05comp Flat
59 pages
Pression
No ratings yet
Pression
44 pages
Index Compression
100% (1)
Index Compression
38 pages
Chapter 5 - Index Compression
No ratings yet
Chapter 5 - Index Compression
28 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Unit5 - Data Compression and Cryptography
No ratings yet
Unit5 - Data Compression and Cryptography
59 pages
IR Unit 3
No ratings yet
IR Unit 3
66 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
47 pages
Chap 5
No ratings yet
Chap 5
64 pages
Elective: Data Compression and Encryption V Extc ECCDLO 5014
No ratings yet
Elective: Data Compression and Encryption V Extc ECCDLO 5014
60 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
Compression File TypesV7 2022
No ratings yet
Compression File TypesV7 2022
31 pages
Lecture 4 - Index Construction - Compressing
No ratings yet
Lecture 4 - Index Construction - Compressing
90 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
DM 1
No ratings yet
DM 1
31 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
IR
No ratings yet
IR
8 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
10 1016@j Aei 2008 05 001
No ratings yet
10 1016@j Aei 2008 05 001
8 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
ISR Chap... 4
No ratings yet
ISR Chap... 4
43 pages
Lecture4 Compression 1per
No ratings yet
Lecture4 Compression 1per
50 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
C4 Compression
No ratings yet
C4 Compression
44 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Lecture4 Compression
No ratings yet
Lecture4 Compression
61 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Compression: Some Slides Courtesy James Allan@umass
No ratings yet
Compression: Some Slides Courtesy James Allan@umass
47 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
48 pages
Compression
No ratings yet
Compression
46 pages
UNIT 3 Notes
No ratings yet
UNIT 3 Notes
6 pages
Difference Between Lossless Compression and Lossy Compression
No ratings yet
Difference Between Lossless Compression and Lossy Compression
15 pages
EC2029-Digital Image Processing Two Marks Questions and Answers - New PDF
No ratings yet
EC2029-Digital Image Processing Two Marks Questions and Answers - New PDF
20 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
L15 Compression
No ratings yet
L15 Compression
63 pages
Caie Igcse Computer Science 0478 Theory v1
No ratings yet
Caie Igcse Computer Science 0478 Theory v1
21 pages
Summative Test Empowerment (Module 3&4)
No ratings yet
Summative Test Empowerment (Module 3&4)
2 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Data Compression
No ratings yet
Data Compression
22 pages
Data Compression Engine Enhancement Using Huffman Coding Algorithm2
No ratings yet
Data Compression Engine Enhancement Using Huffman Coding Algorithm2
65 pages
Image Processig Mcqs QIP
No ratings yet
Image Processig Mcqs QIP
29 pages
A New Approach For Compression On Textual Data
No ratings yet
A New Approach For Compression On Textual Data
4 pages
Midterm Sol
No ratings yet
Midterm Sol
6 pages
Biomedical Signal Processing Jan 2014
100% (1)
Biomedical Signal Processing Jan 2014
1 page
Main Techniques and Performance of Each Compression
No ratings yet
Main Techniques and Performance of Each Compression
23 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
Xyz
No ratings yet
Xyz
69 pages
Aadel Veri
No ratings yet
Aadel Veri
37 pages
Chap5 Index Construction
No ratings yet
Chap5 Index Construction
38 pages
Test Data Compression Using Bitmask & Dictionary Selection
No ratings yet
Test Data Compression Using Bitmask & Dictionary Selection
13 pages
What Will We Learn?
No ratings yet
What Will We Learn?
43 pages
Rainbow Technology
No ratings yet
Rainbow Technology
23 pages
Data Compressionquestion Bank kcs064
No ratings yet
Data Compressionquestion Bank kcs064
51 pages
Data Compression-Unit 1 (Information Theory) Practice Questions
No ratings yet
Data Compression-Unit 1 (Information Theory) Practice Questions
10 pages
Computer Science: Mcqs and Answers
No ratings yet
Computer Science: Mcqs and Answers
4 pages
Algorithms in The Real World: Data Compression: Lectures 1 and 2
No ratings yet
Algorithms in The Real World: Data Compression: Lectures 1 and 2
55 pages
DIgital Image Processing
No ratings yet
DIgital Image Processing
74 pages
Chap 5 Compression
No ratings yet
Chap 5 Compression
43 pages
DCDR Question Bank
No ratings yet
DCDR Question Bank
4 pages
Photosounder User Guide
No ratings yet
Photosounder User Guide
20 pages
Data Compression Report
No ratings yet
Data Compression Report
12 pages
Image Compression Using DWT
No ratings yet
Image Compression Using DWT
16 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Comparison of Open Source Compression Algorithms o PDF
No ratings yet
Comparison of Open Source Compression Algorithms o PDF
7 pages
Image Compression
100% (1)
Image Compression
47 pages
Image Compression-Decompression Technique Using Arithmetic Coding
No ratings yet
Image Compression-Decompression Technique Using Arithmetic Coding
12 pages
Unit 5
No ratings yet
Unit 5
6 pages
Unit 5
No ratings yet
Unit 5
41 pages
Comprehensive Review On Lossy and Lossless Compression Techniques
No ratings yet
Comprehensive Review On Lossy and Lossless Compression Techniques
10 pages
O Level Revision Notes V22
No ratings yet
O Level Revision Notes V22
409 pages
Unit 3ir
No ratings yet
Unit 3ir
28 pages
Lecture14 Digital Image Processing For Notes
No ratings yet
Lecture14 Digital Image Processing For Notes
7 pages
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
From Everand
Assembly Programming:Simple, Short, And Straightforward Way Of Learning Assembly Language
Sherwyn Allibang
5/5 (2)
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Rust Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #5
From Everand
Rust Mini Reference: A Hitchhiker's Guide to the Modern Programming Languages, #5
Harry Yoon
No ratings yet

Information Retrieval - 3

Uploaded by

Information Retrieval - 3

Uploaded by

1

MODERN INFORMATION RETRIEVAL

DISTRIBUTED INDEXING AND INDEX COMPRESSION

 For web-scale indexing

 Individual machines are fault-prone

 How do we exploit such a pool of machines?

 Maintain a master machine directing the indexing

 Break up indexing into sets of (parallel) tasks.

 Master machine assigns each task to an idle

 We will use two sets of parallel tasks

 Break the input document corpus into splits

 Each split is a subset of documents (corresponding to

 Master assigns a split to an idle parser machine

 Parser reads a document at a time and emits

 Each partition is for a range of terms’ first letters

 An inverter collects all (term,doc) pairs (= postings)

 Sorts and writes to postings lists

 Keep more stuff in memory (increases speed,

 Premise: Decompression algorithms are fast

 Lossless compression: All information is preserved.

Terms Freq. Postings ptr.

20 bytes 4 bytes each

 Most of the bytes in the Term column are wasted

 Written English averages ~4.5 characters/word.

 Short words dominate token counts but not token

Store dictionary as a (long) string of characters:

Freq. Postings ptr. Term ptr.

 4 bytes per term for Freq.

 Store pointers to every kth term string.

Freq. Postings ptr. Term ptr.

Begins to resemble general string compression.

Fixed width 11.2

String with pointers to every term 7.6

Blocking k=4 7.1

Blocking + front coding 5.9

 The postings file is much larger than the

 If the average gap for a term is G, we want to

 For a gap value G, use close to the fewest bytes

docIDs 824 829 215406

Postings stored as the byte concatenation

Key property: VB-encoded postings are

For a small gap (5), VB

 Instead of bytes, we can also use a different “unit

 Variable byte alignment wastes space if you have

 Can compress better with bit-level codes

 Gamma code of 13 is the concatenation of length

number length offset g-code

Data structure Size in MB

You might also like