IR - Lecture 2
IR - Lecture 2
● A simple IR task
● Boolean Retrieval models
● Indexing
Query
IR System
Unstructured
Corpus
Retrieved
Documents
Query
IR System
Unstructured
Corpus
Retrieved
Documents
Example of query:
Virat AND Anushka
Linear scan: For every query, scan the corpus and find
relevant documents.
Major limitation:
● For every query, need to process whole corpus, i.e.,
N * Awpd words.
● If N is large (in range of million and more documents),
not feasible to implement a practical IR system on a
decent computer.
BITS Pilani, Pilani Campus
Naive solution: Linear scan
Query
Word 1 1 0 1 0 1
Word 2 0 1 0 0 0
Word 3 0 0 0 0 0
Word 4 0 1 1 0 0
Word 5 1 0 1 1 1
Word 6 1 1 0 1 1
Word M 0 0 1 0 1
AND
Word 6 1 1 0 1 ... 1
Result 1 0 0 0 ... 1
AND
Word 6 1 1 0 1 ... 1
Result 1 0 0 0 ... 1
Limitations:
The matrix size will be huge for general corpus.
● For every new document, the matrix size will increase by
at least M.
intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90
=
Result Doc 90
intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90
=
Result Doc 90
Observations:
● The query processing time for inverted index is
asymptotically same as that of incidence matrix.
● However, in practice it takes less time because not every
query will have the posting list size close to N.
● Query optimization can be done to reduce time further.
Query Optimization:
Eg. Query: word 1 AND word 2 AND word 3
Let posting list size for word 1, word 2 and word 3 are 100,
50 and 10, respectively.
https://nlp.stanford.edu/IR-book/
Chapter 1