Week 4 - Information Retrieval Indexing
Week 4 - Information Retrieval Indexing
Information Retrieval- 1
Architecture of Text Retrieval Systems
User Text
Interface
user need Text
1. feature extraction
Text Operations
Searching Index
Information Retrieval- 2
Term Search
Problem: text retrieval algorithms need to find words in
documents efficiently
– Boolean, probabilistic and vector space retrieval
– Given index term ki, find document dj
Information Retrieval- 3
1.3.1 Inverted Files
An inverted file is a word-oriented mechanism for indexing a
text collection in order to speed up the term search task
– Addressing of documents and word positions within documents
– Most frequently used indexing technique for large text databases
– Appropriate when text collection is large and semi-static
Information Retrieval- 4
Inverted Files
Inverted list for a term
Information Retrieval- 5
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory
Information Retrieval- 6
Example
1 Algorithms 3 : 3 5 7
2 Application 2 : 3 17
3 Delay 2 : 11 12
4 Differential 8 : 4 8 10 11 12 13 14 15
5 Equations 10 : 1 2 4 8 10 11 12 13 14 15
6 Implementation 2 : 3 7
7 Integral 2 : 16 17
8 Introduction 2 : 5 6
9 Methods 2 : 8 14
10 Nonlinear 2 : 9 13
11 Ordinary 2 : 8 10
12 Oscillation 2 : 11 12
13 Partial 2 : 4 13
14 Problem 2 : 6 7
15 Systems 3 : 6 8 9
16 Theory 4 : 3 11 12 17
Information Retrieval- 7
Physical Organization of Inverted Files
Access Index file Posting file
structure Key, #Docs, Pos Doc# Document file
k1 f1 p1 Di D1 abcdefghijkl
k2 f2 p2 Dj D2 abcdefghijkl
. . D3 abcdefghijkl
. . .
. . Di abcdefghijkl
km fm pm . .
. .
access structure to one entry for . Dj abcdefghijkl .
the vocabulary can be each term of . .
B+-Tree, Hashing or the vocabulary . .
Sorted Array Dk Dn abcdefghijkl
space requirement space requirement O(nβ)
O(nβ) 0.4<β<0.6
occurrences of words are documents stored
(Heap's law)
stored ordered lexicographically in a contiguous file
space requirement O(n) space requirement O(n)
main memory
Information Retrieval- 8
secondary storage
Heap’s Law
An empirical law that describes the relation between the size
of a collection and the size of its vocabulary
log10
• Stemming, lower case decrease
m
vocabulary size
• Numbers, spelling errors increase
log10n
Information Retrieval- 9
Example
Index file Posting file Document file
Information Retrieval- 10
Addressing Granularity
By default, the entries in postings file point to the document
in the document file
Information Retrieval- 11
Payload of Postings
Postings consist of the document identifier
•For Boolean retrieval this is sufficient
Information Retrieval- 12
A posting indicates ...
1. The frequency of a term in the vocabulary
2. The frequency of a term in a document
3. The occurrence of a term in a document
4. The list of terms occurring in a document
Information Retrieval- 13
When indexing a document collection using an inverted file,
the main space requirement is implied by ...
1. The access structure
2. The vocabulary
3. The index file
4. The postings file
Information Retrieval- 14
Searching the Inverted File
Step 1: Vocabulary search
– the words present in the query are searched in the index file
Step 2: Retrieval of occurrences
– the lists of the occurrences of all words found are retrieved from
the posting file
Step 3: Manipulation of occurrences
– the occurrences are processed in the document file to process
the query
Information Retrieval- 15
Example
Index file Posting file Document file
Step 1 Step 2 Step 3
Algorithms 3 3 B1 A Course on Integral Equations
Application 2 5 B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory,
Delay 2 7 Implementation, and Application
Differential 8 3 B4 Geometrical Aspects of Partial Differential Equations
17 B5 Ideals, Varieties, and Algorithms: An Introduction to
Computational Algebraic Geometry and Commutative
11 Algebra
12 B6 Introduction to Hamiltonian Dynamical Systems and the
N-Body Problem
4
B7 Knapsack Problems: Algorithms and Computer
8 Implementations
10 B8 Methods of Solving Singular Systems of Ordinary
Differential Equations
11 B9 Nonlinear Systems
12 B10 Ordinary Differential Equations
Searching “Differential” 13 B11 Oscillation Theory for Neutral Differential Equations
with Delay
14
15
Information Retrieval- 16
Construction of the Inverted File – Step 1
Step 1: Search phase
– The vocabulary is kept in an ordered data structure, e.g., a trie or
sorted array, storing for each word a list of its occurrences
– Each word of the text is read sequentially and searched in the
vocabulary
– If a word is not found in the ordered data structure, it is added to
the vocabulary with an empty list of occurrences
– If a word is found, the word position is added to the end of its list
of occurrences
Information Retrieval- 17
Construction of the Inverted File – Step 2
Step 2: Storage phase (once the text is exhausted)
– The list of occurrences is written contiguously to the disk (posting
file)
– The vocabulary is stored in lexicographical order (index file) in
main memory together with a pointer for each word to its list in
the posting file
Information Retrieval- 18
Tries
A trie is a tree data structure to index strings
• For each prefix of each length in the set of strings a
separate path is created
• Strings are looked up by following the prefix path
Example: trie for {a, to, tea, ted, ten, i, in, inn}
a i t
i
a
n o e
in to
n a d n
inn tea ten
ted Information Retrieval- 19
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70
the house has a garden. the garden has many flowers. the flowers are beautiful
(each word = one document, position = document identifier)
the: 1
h t h t a h t
a g h t a g h t
the house has a garden. the garden has many flowers. the flowers are beautiful
a b f g h m t
a: 16
beautiful: 70 garden: 18, 29
many: 40 the: 1, 25, 54
r flowers: 45, 58
a o
a: 16
are: 66
beautiful: 70
flowers: 45, 58
inverted file I garden: 18, 29 16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54
has: 12, 36
house: 6 postings file
many: 40
the: 1, 25, 54
Information Retrieval- 21
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70
the house has a garden. the garden has many flowers. the flowers are beautiful
a b f g h m t
a: 1
beautiful: 3 garden: 6
many: 11 the: 12
r flowers: 4
a o
16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54
postings file
Information Retrieval- 22
Index Construction in Practice
When using a single node not all index information can be kept in
main memory → Index merging
– When no more memory is available, a partial index Ii is written to disk
– The main memory is erased before continuing with the rest of the text
– Once the text is exhausted, a number of partial indices Ii exist on disk
– The partial indices are merged to obtain the final index
Information Retrieval- 23
Index Merging
BSBI: Blocked sort-based Indexing
I 1...8 final index
7 level 3
I 1...4 I 5...8
3 6 level 2
1 2 4 5 level 1
I1 I2 I3 I4 I5 I6 I7 I8 initial indices
D1 D2 D3 D4 D5 D6 D7 D8
Information Retrieval- 24
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70
the house has a garden. the garden has many flowers. the flowers are beautiful
a: 16 are: 66
inverted garden: 18, 29 beautiful: 70 inverted
file I1 has: 12, 36 flowers: 45, 58 file I2
house: 6 many: 40
the: 1, 25 the: 54
a: 16
are: 66
beautiful: 70
flowers: 45, 58
garden: 18, 29
has: 12, 36
house: 6
many: 40
1, 25 + 54 -> 1, 25, 54 the: 1, 25, 54
total cost: O(n log2(n/M))
concatenate inverted lists
M size of memory
Information Retrieval- 25
Addressing Granularity
Documents can be addressed at different granularities
– coarser: text blocks spanning multiple documents
– finer: paragraph, sentence, word level
General rule
– the finer the granularity the less post-processing but the larger
the index
Example: index size in % of document collection size
Index Small collection Medium collection Large collection
(1Mb) (200Mb) (2Gb)
Addressing words 73% 64% 63%
Addressing documents 26% 32% 47%
Addressing 256K blocks 25% 2.4% 0.7%
Information Retrieval- 26
Index Compression
Documents are ordered and each document identifier dij is
replaced by the difference to the preceding document
identifier
– Document identifiers are encoded using fewer bits for smaller,
common numbers
lk f k : di1 ,..., di fk
Information Retrieval- 27
Using a trie in index construction …
1. Helps to quickly find words that have been seen before
2. Helps to quickly decide whether a word has not seen
before
3. Helps to maintain the lexicographic order of words seen in
the documents
4. All of the above
Information Retrieval- 28
1.3.2 Web-Scale Indexing: Map-Reduce
Pioneered by Google: 20PB of data per day (2008)
– Scan 100 TB on 1 node @ 50 MB/s = 23 days
– Scan on 1000-node cluster = 33 minutes
Cost-efficiency
– Commodity nodes, network (cheap, but unreliable)
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)
Information Retrieval- 29
Map-Reduce Programming Model
Data type: key-value pairs
Map function:
Analyses some input, and produces a list of results
Reduce function:
Takes all results belonging to one key, and computes
aggregates
Information Retrieval- 30
Example
Basic word counter program
Information Retrieval- 31
Map-Reduce Processing Model
The input data is
partitioned into
subsets
Mappers extract
word occurrences
1 1 1 1 1 1 1
Reducers aggregate
word occurrences
Output is written
2 2 3 to stable storage
Important: the reducers can only start after all mappers have finished!
Information Retrieval- 32
Refined Map-Reduce Programming Model
Information Retrieval- 33
What the Programmer Controls (and not)
The programmer controls
• Key-value data structures (can be complex)
• Maintenance of state in mappers and reducers
• Sort order of intermediate key-value pairs
• Partitioning scheme on the key space
The map-reduce platform controls
• where the mappers and reducers run
• when a mapper and reducer starts and terminates
• which input data is assigned to a specific mapper
• which intermediate key-value pairs are processed by a specific
reducer
Information Retrieval- 34
Inverted File Construction Using Map-Reduce
Information Retrieval- 35
Inverted File Construction Program
def mapper(document, text):
f = {}
for word in text.split(): f[word] +=
1
for word in f.keys():
output(word, (document, f[word]))
Information Retrieval- 37
Maintaining the order of document identifiers for
vocabulary construction when partitioning the document
collection is important ...
Information Retrieval- 38
1.3.3 Distributed Retrieval
Centralized retrieval
– Aggregate the weights for ALL documents by scanning the
posting lists of the query terms
– Scanning is relatively efficient
– Computationally quite expensive (memory, processing)
scan
Posting list of t1
Posting list of t2
Posting list of t3
Information Retrieval- 39
Distributed Retrieval
Distributed retrieval
– Posting lists for different terms stored on different nodes
– The transfer of complete posting lists can become prohibitively
expensive in terms of bandwidth consumption
Posting list of t3
Information Retrieval- 41
Example 1
Finding the top-2 elements for a two-term query
Information Retrieval- 42
Example 2
Finding the top-2 elements for a two-term query
d2 0.1 d7 0.15
d7 0.0 d4 0.0
Information Retrieval- 43
Example 3
Finding the top-2 elements for a two-term query
Information Retrieval- 44
Discussion
Complexity
– O((k n)1/2) entries are read in each list for n documents
– Assuming that entries are uncorrelated
– Improves if they are positively correlated
In distributed settings optimizations to reduce the
number of roundtrips
– Send a longer prefix of one list to the other node
Fagin’s algorithm may behave poorly in practical cases
– Alternative algorithm: Threshold Algorithm
Information Retrieval- 45
1.3.3.2 Threshold Algorithm
Threshold Algorithm
- Access sequentially elements in each list
- At each round
- lookup missing weights of current elements in other lists using
random access
- Keep the top k elements seen so far
- Compute threshold as aggregate value of the different elements
seen in current round
- If at least k documents have aggregate value higher than
threshold, halt
Information Retrieval- 46
Example
Finding the top-2 elements for a two-term query
Threshold
Information Retrieval- 47
Example
Threshold
d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 48
Example
Threshold
d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 49
Example
Threshold
Information Retrieval- 51
Applications
Beyond distributed document retrieval these algorithms
have wider applications
– Multimedia, image retrieval
– Top-k processing in relational databases
– Document filtering
– Sensor data processing
Information Retrieval- 52
When applying Fagin’s algorithm for a query with
three different terms for finding the k top
documents, the algorithm will scan ...
1. 2 different lists
2. 3 different lists
3. k different lists
4. it depends how many
rounds are taken
Information Retrieval- 53
With Fagin’s algorithm, once k documents have
been identified that occur in all of the lists ...
1. These are the top-k
documents
2. The top-k documents are
among the documents seen
so far
3. The search has to continue
in round-robin till the top-k
documents are identified
4. Other documents have to be
searched to complete the
top-k list
Information Retrieval- 54
References
Lin, J., Dyer, C. (2010). Inverted Indexing for Text Retrieval. In: Data-
Intensive Text Processing with MapReduce. Synthesis Lectures on
Human Language Technologies. Springer, Cham.
Information Retrieval- 55