0% found this document useful (0 votes)
4 views55 pages

Week 4 - Information Retrieval Indexing

The document discusses the architecture and mechanisms of text retrieval systems, focusing on indexing methods such as inverted files to enhance search efficiency. It explains the construction and organization of inverted files, including vocabulary management and posting files, as well as the process of searching through these files. Additionally, it covers practical aspects of index construction and merging techniques for large text collections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views55 pages

Week 4 - Information Retrieval Indexing

The document discusses the architecture and mechanisms of text retrieval systems, focusing on indexing methods such as inverted files to enhance search efficiency. It explains the construction and organization of inverted files, including vocabulary management and posting files, as well as the process of searching through these files. Additionally, it covers practical aspects of index construction and merging techniques for large text collections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

1.

3 INDEXING FOR INFORMATION


RETRIEVAL

Information Retrieval- 1
Architecture of Text Retrieval Systems
User Text
Interface
user need Text
1. feature extraction
Text Operations

query repr. document repr.


Query DB Manager
Indexing
user feedback Operations Module

query inverted file

Searching Index

retrieved docs 3. efficient data access


Text
Database
Ranking
ranked docs
2. ranking system

Information Retrieval- 2
Term Search
Problem: text retrieval algorithms need to find words in
documents efficiently
– Boolean, probabilistic and vector space retrieval
– Given index term ki, find document dj

B1 A Course on Integral Equations


B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
application B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B3, B17 B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Information Retrieval- 3
1.3.1 Inverted Files
An inverted file is a word-oriented mechanism for indexing a
text collection in order to speed up the term search task
– Addressing of documents and word positions within documents
– Most frequently used indexing technique for large text databases
– Appropriate when text collection is large and semi-static

Information Retrieval- 4
Inverted Files
Inverted list for a term

– fk number of documents in which occurs


– di1,…,difk list of documents’ identifiers of documents containing

Inverted File: lexicographically ordered sequence of


inverted lists

Information Retrieval- 5
Example: Documents
B1 A Course on Integral Equations
B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory, Implementation, and Application
B4 Geometrical Aspects of Partial Differential Equations
B5 Ideals, Varieties, and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra
B6 Introduction to Hamiltonian Dynamical Systems and the N-Body Problem
B7 Knapsack Problems: Algorithms and Computer Implementations
B8 Methods of Solving Singular Systems of Ordinary Differential Equations
B9 Nonlinear Systems
B10 Ordinary Differential Equations
B11 Oscillation Theory for Neutral Differential Equations with Delay
B12 Oscillation Theory of Delay Differential Equations
B13 Pseudodifferential Operators and Nonlinear Partial Differential Equations
B14 Sinc Methods for Quadrature and Differential Equations
B15 Stability of Stochastic Differential Equations with Respect to Semi-Martingales
B16 The Boundary Integral Approach to Static and Dynamic Contact Problems
B17 The Double Mellin-Barnes Type Integrals and Their Applications to Convolution Theory

Information Retrieval- 6
Example
1 Algorithms 3 : 3 5 7
2 Application 2 : 3 17
3 Delay 2 : 11 12
4 Differential 8 : 4 8 10 11 12 13 14 15
5 Equations 10 : 1 2 4 8 10 11 12 13 14 15
6 Implementation 2 : 3 7
7 Integral 2 : 16 17
8 Introduction 2 : 5 6
9 Methods 2 : 8 14
10 Nonlinear 2 : 9 13
11 Ordinary 2 : 8 10
12 Oscillation 2 : 11 12
13 Partial 2 : 4 13
14 Problem 2 : 6 7
15 Systems 3 : 6 8 9
16 Theory 4 : 3 11 12 17

Information Retrieval- 7
Physical Organization of Inverted Files
Access Index file Posting file
structure Key, #Docs, Pos Doc# Document file

k1 f1 p1 Di D1 abcdefghijkl
k2 f2 p2 Dj D2 abcdefghijkl
. . D3 abcdefghijkl
. . .
. . Di abcdefghijkl
km fm pm . .
. .
access structure to one entry for . Dj abcdefghijkl .
the vocabulary can be each term of . .
B+-Tree, Hashing or the vocabulary . .
Sorted Array Dk Dn abcdefghijkl
space requirement space requirement O(nβ)
O(nβ) 0.4<β<0.6
occurrences of words are documents stored
(Heap's law)
stored ordered lexicographically in a contiguous file
space requirement O(n) space requirement O(n)
main memory
Information Retrieval- 8
secondary storage
Heap’s Law
An empirical law that describes the relation between the size
of a collection and the size of its vocabulary

Typical values observed:

Parameters depend on collection type


and preprocessing

log10
• Stemming, lower case decrease

m
vocabulary size
• Numbers, spelling errors increase
log10n
Information Retrieval- 9
Example
Index file Posting file Document file

Algorithms 3 3 B1 A Course on Integral Equations


Application 2 5 B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory,
Delay 2 7 Implementation, and Application
Differential 8 3 B4 Geometrical Aspects of Partial Differential Equations
17 B5 Ideals, Varieties, and Algorithms: An Introduction to
Computational Algebraic Geometry and Commutative
11 Algebra
12 B6 Introduction to Hamiltonian Dynamical Systems and the
N-Body Problem
4
B7 Knapsack Problems: Algorithms and Computer
8 Implementations
10 B8 Methods of Solving Singular Systems of Ordinary
Differential Equations
11 B9 Nonlinear Systems
12 B10 Ordinary Differential Equations
13 B11 Oscillation Theory for Neutral Differential Equations
with Delay
14
15

Information Retrieval- 10
Addressing Granularity
By default, the entries in postings file point to the document
in the document file

Other granularities can be used


• Pointing to a specific position within the document
• Example: (B1, 4) points to “Integral”
• Grouping multiple documents to one
• Example: create groups of 4 documents, then G1 points to the
group (B1,B2,B3,B4)

Information Retrieval- 11
Payload of Postings
Postings consist of the document identifier
•For Boolean retrieval this is sufficient

In addition, other data can be stored with the posting


•For VS retrieval term frequency can be stored
•Other data: positions of occurrence, term properties

Information Retrieval- 12
A posting indicates ...
1. The frequency of a term in the vocabulary
2. The frequency of a term in a document
3. The occurrence of a term in a document
4. The list of terms occurring in a document

Information Retrieval- 13
When indexing a document collection using an inverted file,
the main space requirement is implied by ...
1. The access structure
2. The vocabulary
3. The index file
4. The postings file

Information Retrieval- 14
Searching the Inverted File
Step 1: Vocabulary search
– the words present in the query are searched in the index file
Step 2: Retrieval of occurrences
– the lists of the occurrences of all words found are retrieved from
the posting file
Step 3: Manipulation of occurrences
– the occurrences are processed in the document file to process
the query

Information Retrieval- 15
Example
Index file Posting file Document file
Step 1 Step 2 Step 3
Algorithms 3 3 B1 A Course on Integral Equations
Application 2 5 B2 Attractors for Semigroups and Evolution Equations
B3 Automatic Differentiation of Algorithms: Theory,
Delay 2 7 Implementation, and Application
Differential 8 3 B4 Geometrical Aspects of Partial Differential Equations
17 B5 Ideals, Varieties, and Algorithms: An Introduction to
Computational Algebraic Geometry and Commutative
11 Algebra
12 B6 Introduction to Hamiltonian Dynamical Systems and the
N-Body Problem
4
B7 Knapsack Problems: Algorithms and Computer
8 Implementations
10 B8 Methods of Solving Singular Systems of Ordinary
Differential Equations
11 B9 Nonlinear Systems
12 B10 Ordinary Differential Equations
Searching “Differential” 13 B11 Oscillation Theory for Neutral Differential Equations
with Delay
14
15

Information Retrieval- 16
Construction of the Inverted File – Step 1
Step 1: Search phase
– The vocabulary is kept in an ordered data structure, e.g., a trie or
sorted array, storing for each word a list of its occurrences
– Each word of the text is read sequentially and searched in the
vocabulary
– If a word is not found in the ordered data structure, it is added to
the vocabulary with an empty list of occurrences
– If a word is found, the word position is added to the end of its list
of occurrences

Information Retrieval- 17
Construction of the Inverted File – Step 2
Step 2: Storage phase (once the text is exhausted)
– The list of occurrences is written contiguously to the disk (posting
file)
– The vocabulary is stored in lexicographical order (index file) in
main memory together with a pointer for each word to its list in
the posting file

Overall cost O(n)

Information Retrieval- 18
Tries
A trie is a tree data structure to index strings
• For each prefix of each length in the set of strings a
separate path is created
• Strings are looked up by following the prefix path
Example: trie for {a, to, tea, ted, ten, i, in, inn}

a i t
i
a
n o e
in to

n a d n
inn tea ten
ted Information Retrieval- 19
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful
(each word = one document, position = document identifier)

the: 1
h t h t a h t

house: 6 the: 1 the: 1 a: 16 the: 1


a o a o

has: 12 house: 6 has: 12 house: 6

a g h t a g h t

a: 16 garden: 18 the: 1 a: 16 garden: 18 the: 1, 25


a o a o

has: 12 house: 6 has: 12 house: 6


Information Retrieval- 20
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a b f g h m t
a: 16
beautiful: 70 garden: 18, 29
many: 40 the: 1, 25, 54
r flowers: 45, 58
a o

are: 66 has: 12, 36 house: 6

a: 16
are: 66
beautiful: 70
flowers: 45, 58
inverted file I garden: 18, 29 16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54
has: 12, 36
house: 6 postings file
many: 40
the: 1, 25, 54
Information Retrieval- 21
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a b f g h m t
a: 1
beautiful: 3 garden: 6
many: 11 the: 12
r flowers: 4
a o

are: 2 has: 8 house: 10

16, 66, 70, 45, 58, 18, 29, 12, 36, 6, 40, 1, 25, 54

postings file

Information Retrieval- 22
Index Construction in Practice
When using a single node not all index information can be kept in
main memory → Index merging
– When no more memory is available, a partial index Ii is written to disk
– The main memory is erased before continuing with the rest of the text
– Once the text is exhausted, a number of partial indices Ii exist on disk
– The partial indices are merged to obtain the final index

Information Retrieval- 23
Index Merging
BSBI: Blocked sort-based Indexing
I 1...8 final index

7 level 3

I 1...4 I 5...8

3 6 level 2

I 1...2 I 3...4 I 5...6 I 7...8

1 2 4 5 level 1

I1 I2 I3 I4 I5 I6 I7 I8 initial indices

D1 D2 D3 D4 D5 D6 D7 D8

Information Retrieval- 24
Example
1 6 12 16 18 25 29 36 40 45 54 58 66 70

the house has a garden. the garden has many flowers. the flowers are beautiful

a: 16 are: 66
inverted garden: 18, 29 beautiful: 70 inverted
file I1 has: 12, 36 flowers: 45, 58 file I2
house: 6 many: 40
the: 1, 25 the: 54

a: 16
are: 66
beautiful: 70
flowers: 45, 58
garden: 18, 29
has: 12, 36
house: 6
many: 40
1, 25 + 54 -> 1, 25, 54 the: 1, 25, 54
total cost: O(n log2(n/M))
concatenate inverted lists
M size of memory

Information Retrieval- 25
Addressing Granularity
Documents can be addressed at different granularities
– coarser: text blocks spanning multiple documents
– finer: paragraph, sentence, word level
General rule
– the finer the granularity the less post-processing but the larger
the index
Example: index size in % of document collection size
Index Small collection Medium collection Large collection
(1Mb) (200Mb) (2Gb)
Addressing words 73% 64% 63%
Addressing documents 26% 32% 47%
Addressing 256K blocks 25% 2.4% 0.7%

Information Retrieval- 26
Index Compression
Documents are ordered and each document identifier dij is
replaced by the difference to the preceding document
identifier
– Document identifiers are encoded using fewer bits for smaller,
common numbers
lk  f k : di1 ,..., di fk 

lk '  f k : di1 , di2  di1 ,..., di fk  di fk  1


– Use of varying length compression further reduces
space requirement
– In practice index is reduced to 10- 15% of database size

Information Retrieval- 27
Using a trie in index construction …
1. Helps to quickly find words that have been seen before
2. Helps to quickly decide whether a word has not seen
before
3. Helps to maintain the lexicographic order of words seen in
the documents
4. All of the above

Information Retrieval- 28
1.3.2 Web-Scale Indexing: Map-Reduce
Pioneered by Google: 20PB of data per day (2008)
– Scan 100 TB on 1 node @ 50 MB/s = 23 days
– Scan on 1000-node cluster = 33 minutes

Cost-efficiency
– Commodity nodes, network (cheap, but unreliable)
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)

Information Retrieval- 29
Map-Reduce Programming Model
Data type: key-value pairs

Map function:
Analyses some input, and produces a list of results

Reduce function:
Takes all results belonging to one key, and computes
aggregates

Information Retrieval- 30
Example
Basic word counter program

def mapper(document, line):


for word in line.split(): output(word,
1)

def reducer(key, values):


output(key, sum(values))

Information Retrieval- 31
Map-Reduce Processing Model
The input data is
partitioned into
subsets

Mappers extract
word occurrences

1 1 1 1 1 1 1 1 The assigned reduce


process is chosen

1 1 1 1 1 1 1
Reducers aggregate
word occurrences

Output is written
2 2 3 to stable storage

Important: the reducers can only start after all mappers have finished!
Information Retrieval- 32
Refined Map-Reduce Programming Model

Combiners work like reducers,


but only on the local data of a mapper

def combiner(key, values):


output(key, sum(values))

Partitioners allow to control the strategy


for distributing keys to reducers

Information Retrieval- 33
What the Programmer Controls (and not)
The programmer controls
• Key-value data structures (can be complex)
• Maintenance of state in mappers and reducers
• Sort order of intermediate key-value pairs
• Partitioning scheme on the key space
The map-reduce platform controls
• where the mappers and reducers run
• when a mapper and reducer starts and terminates
• which input data is assigned to a specific mapper
• which intermediate key-value pairs are processed by a specific
reducer
Information Retrieval- 34
Inverted File Construction Using Map-Reduce

Mappers extract postings from document

Postings are provided to reducers

Reducers aggregate posting lists

Information Retrieval- 35
Inverted File Construction Program
def mapper(document, text):
f = {}
for word in text.split(): f[word] +=
1
for word in f.keys():
output(word, (document, f[word]))

def reducer(key, postings):


p = []
for d, f in postings: p.append((d, f))
p.sort()
output(key, p)
Information Retrieval- 36
Other Applications of Map-Reduce
Framework is used in many other tasks, particular for
text and Web data processing
• Graph processing (e.g., PageRank)
• Processing relational joins
• Learning probabilistic models

Information Retrieval- 37
Maintaining the order of document identifiers for
vocabulary construction when partitioning the document
collection is important ...

1. in the index merging approach for single node machines


2. in the map-reduce approach for parallel clusters
3. in both
4. in neither of the two

Information Retrieval- 38
1.3.3 Distributed Retrieval
Centralized retrieval
– Aggregate the weights for ALL documents by scanning the
posting lists of the query terms
– Scanning is relatively efficient
– Computationally quite expensive (memory, processing)

Query = t1, t2 ,t3 …


List of docs with t1 or t2 or t3 (plus weights)

scan

Posting list of t1
Posting list of t2
Posting list of t3

Information Retrieval- 39
Distributed Retrieval
Distributed retrieval
– Posting lists for different terms stored on different nodes
– The transfer of complete posting lists can become prohibitively
expensive in terms of bandwidth consumption

Query = t1, t2 ,t3 …

Posting list of t3

Posting list of t1 Posting list of t3


Send over network
Posting list of t2

Is it necessary to transfer the complete posting list to


identify the top-k documents?
Information Retrieval- 40
1.3.3.1 Fagin’s Algorithm
Entries in posting lists are sorted according to the tf-idf
weights
– Scan in parallel all lists in round-robin till k documents are
detected that occur in all lists
– Lookup the missing weights for documents that have not been
seen in all lists
– Select the top-k elements
Algorithm provably returns the top-k documents for
monotone aggregation functions!

Information Retrieval- 41
Example 1
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.9


d4 0.82 d2 0.7 d6 0.81 0.81
d3 0.8 d5 0.66 d4 0.82 0.82
d2 0.7 0.7
d5 0.65 d1 0.45
d3 0.8 0.8
….. …..
d5 0.66 0.66
d6 0.51 d3 0.33
d2 0.1 d7 0.15
d7 0.0 d4 0.0

Information Retrieval- 42
Example 2
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.45 1.35


0.9
d4 0.82 d2 0.7 d6 0.81 0.81
d3 0.8 d5 0.66 d4 0.82 0.82
d2 0.7 0.7
d5 0.65 d1 0.45
d3 0.8 0.8
….. …..
d5 0.66 0.66
d6 0.51 d3 0.33 d5 0.65 0.66 1.34

d2 0.1 d7 0.15
d7 0.0 d4 0.0

Information Retrieval- 43
Example 3
Finding the top-2 elements for a two-term query

d1 0.9 d6 0.81 d1 0.9 0.45 1.35


0.9
d4 0.82 d2 0.7 d6 0.51 0.81 1.32
0.8
1
d3 0.8 d5 0.66 0.0 0.82
d4 0.82 0.8
d5 0.65 d1 0.45 0.1 0.8
2
….. ….. 0.33 1.13
d2 0.7 0.7
d6 0.51 d3 0.33 d3 0.65
0.8 1.31
0.8
d2 0.1 d7 0.15 d5 0.66 0.6
d7 0.0 d4 0.0 6

Information Retrieval- 44
Discussion
Complexity
– O((k n)1/2) entries are read in each list for n documents
– Assuming that entries are uncorrelated
– Improves if they are positively correlated
In distributed settings optimizations to reduce the
number of roundtrips
– Send a longer prefix of one list to the other node
Fagin’s algorithm may behave poorly in practical cases
– Alternative algorithm: Threshold Algorithm

Information Retrieval- 45
1.3.3.2 Threshold Algorithm
Threshold Algorithm
- Access sequentially elements in each list
- At each round
- lookup missing weights of current elements in other lists using
random access
- Keep the top k elements seen so far
- Compute threshold as aggregate value of the different elements
seen in current round
- If at least k documents have aggregate value higher than
threshold, halt

Information Retrieval- 46
Example
Finding the top-2 elements for a two-term query
Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35


d4 0.82 d2 0.7 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
d5 0.65 d1 0.45 d1 and d6 have
….. ….. aggregate weights
d6 0.51 d3 0.33 lower than the
d2 0.1 d7 0.15 threshold, therefore
d7 0.0 d4 0.0
continue

Information Retrieval- 47
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35


d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
d5 0.65 d1 0.45 The documents d4,
….. ….. d2 have lower
d6 0.51 d3 0.33 aggregate weights
d2 0.1 d7 0.15 and are therefore
d7 0.0 d4 0.0
dismissed

d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 48
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35


d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
1.46
d5 0.65 d1 0.45 The documents d3,
….. ….. d5 have lower
d6 0.51 d3 0.33 aggregate weights
d2 0.1 d7 0.15 and are therefore
d7 0.0 d4 0.0
dismissed

d1 and d6 have
aggregate weights
lower than the
threshold, therefore
continue
Information Retrieval- 49
Example

Threshold

d1 0.9 d6 0.81 1.71 d1 0.9 0.45 1.35


d4 0.82 d2 0.7 1.52 d6 0.81 0.81
0.51 0.81
1.32
d3 0.8 d5 0.66
1.46
d5 0.65 d1 0.45
1.1 The document d5 has
….. …..
d6 0.51 d3 0.33 lower aggregate
d2 0.1 d7 0.15 weights and is
d7 0.0 d4 0.0 therefore dismissed

In this round d1 and


d6 have aggregate
weights higher than
the threshold and the
algorithm terminates
Information Retrieval- 50
Discussion
TA in general performs fewer rounds the FA
- Therefore, fewer document accesses
- But more random accesses
TA is also provably correct for monotone aggregation
functions

Information Retrieval- 51
Applications
Beyond distributed document retrieval these algorithms
have wider applications
– Multimedia, image retrieval
– Top-k processing in relational databases
– Document filtering
– Sensor data processing

Information Retrieval- 52
When applying Fagin’s algorithm for a query with
three different terms for finding the k top
documents, the algorithm will scan ...

1. 2 different lists
2. 3 different lists
3. k different lists
4. it depends how many
rounds are taken

Information Retrieval- 53
With Fagin’s algorithm, once k documents have
been identified that occur in all of the lists ...
1. These are the top-k
documents
2. The top-k documents are
among the documents seen
so far
3. The search has to continue
in round-robin till the top-k
documents are identified
4. Other documents have to be
searched to complete the
top-k list

Information Retrieval- 54
References
Lin, J., Dyer, C. (2010). Inverted Indexing for Text Retrieval. In: Data-
Intensive Text Processing with MapReduce. Synthesis Lectures on
Human Language Technologies. Springer, Cham.

Ronald Fagin, Amnon Lotem, Moni Naor. Optimal aggregation


algorithms for middleware, PODS 2001.

Information Retrieval- 55

You might also like