0% found this document useful (0 votes)
69 views35 pages

IR - Lecture 2

The document describes a lecture on information retrieval systems. It discusses the key components of an IR system including documents, queries, and retrieved documents. It then covers different models for IR including boolean retrieval and introduces solutions for indexing documents like incidence matrices and inverted indexes to improve search efficiency over a linear scan. The benefits and limitations of incidence matrices and inverted indexes are described. Finally, it provides an example of building an inverted index from sample documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views35 pages

IR - Lecture 2

The document describes a lecture on information retrieval systems. It discusses the key components of an IR system including documents, queries, and retrieved documents. It then covers different models for IR including boolean retrieval and introduces solutions for indexing documents like incidence matrices and inverted indexes to improve search efficiency over a linear scan. The benefits and limitations of incidence matrices and inverted indexes are described. Finally, it provides an example of building an inverted index from sample documents.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Information Retrieval

BITS Pilani Abhishek


Pilani Campus January 2020
BITS Pilani
Pilani Campus

CS F469, Information Retrieval


Lecture No. 2
Recap of Lecture 1

● What is Information Retrieval (IR)?


● Why do we need IR?
● Why IR is important?
● Course overview

BITS Pilani, Pilani Campus


Today’s Lecture

● A simple IR task
● Boolean Retrieval models
● Indexing

BITS Pilani, Pilani Campus


Overview of IR system

Query

IR System
Unstructured
Corpus
Retrieved
Documents

BITS Pilani, Pilani Campus


Overview of IR system

Query

IR System
Unstructured
Corpus
Retrieved
Documents

● Depends on the IR model


● Specific need of the IR
System

BITS Pilani, Pilani Campus


Key terminologies

● Document: A unit that we have decided to


build a retrieval system. It can be web
pages, book chapters, research papers etc.
● Collection/Corpus: A group of documents
over which we perform retrieval.

BITS Pilani, Pilani Campus


Key terminologies

● Information Need: A topic about which a


user wants to know more.
● Query: Some words or phrases that the user
writes in the computer in an attempt to
communicate the information need.

BITS Pilani, Pilani Campus


Boolean Retrieval Model

An IR model, in which user can pose


queries as boolean expressions.

Example of query:
Virat AND Anushka

BITS Pilani, Pilani Campus


An Example IR problem

Documents: D1, D2, …., DN


Average words per document: Awpd
Unique words: M
Model: Boolean retrieval model.

BITS Pilani, Pilani Campus


Naive solution: Linear scan

Linear scan: For every query, scan the corpus and find
relevant documents.

Major limitation:
● For every query, need to process whole corpus, i.e.,
N * Awpd words.
● If N is large (in range of million and more documents),
not feasible to implement a practical IR system on a
decent computer.
BITS Pilani, Pilani Campus
Naive solution: Linear scan

Advantage: Linear scan is the only solution if we only have


access to the corpus with no additional memory or storage
space.

Better solutions require an intermediate data structure.

BITS Pilani, Pilani Campus


Overview of IR system

Query

Unstructured Intermediate IR System


Corpus Data
Structure Retrieved
Documents

BITS Pilani, Pilani Campus


Solution 2: Incidence Matrix

Document 1 Document 2 Document 3 Document 4 ... Document N

Word 1 1 0 1 0 1

Word 2 0 1 0 0 0

Word 3 0 0 0 0 0

Word 4 0 1 1 0 0

Word 5 1 0 1 1 1

Word 6 1 1 0 1 1

Word M 0 0 1 0 1

BITS Pilani, Pilani Campus


Solution 2: Incidence Matrix

Advantage: For every query, the system needs to access


few rows and perform boolean operations on the rows.

BITS Pilani, Pilani Campus


Solution 2: Incidence Matrix

Eg. query: word 1 AND word 6


Word 1 1 0 1 0 ... 1

AND
Word 6 1 1 0 1 ... 1

Result 1 0 0 0 ... 1

BITS Pilani, Pilani Campus


Solution 2: Incidence Matrix

Eg. query: word 1 AND word 6


Word 1 1 0 1 0 ... 1

AND
Word 6 1 1 0 1 ... 1

Result 1 0 0 0 ... 1

For every query: N * (query words - 1) AND operations

BITS Pilani, Pilani Campus


Solution 2: Incidence Matrix

Limitations:
The matrix size will be huge for general corpus.
● For every new document, the matrix size will increase by
at least M.

Observation in text corpus: Other than few common


words, every other words does not appears in every
document.

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 Doc 44 Doc 99

Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Advantage: For every query, the system needs to access


few elements of dictionary and perform intersection of the
posting lists.

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Eg. query: word 1 AND word 3


Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

=
Result Doc 90

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Eg. query: word 1 AND word 3


Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

=
Result Doc 90

If the posting lists are sorted, then for every query:


O(N) * (query words) operations to find intersection.

BITS Pilani, Pilani Campus


Posting List Intersection Algo.

Source: Figure 1.6, Introduction to Information Retrieval

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Observations:
● The query processing time for inverted index is
asymptotically same as that of incidence matrix.
● However, in practice it takes less time because not every
query will have the posting list size close to N.
● Query optimization can be done to reduce time further.

BITS Pilani, Pilani Campus


Solution 3: Inverted Index

Query Optimization:
Eg. Query: word 1 AND word 2 AND word 3

Let posting list size for word 1, word 2 and word 3 are 100,
50 and 10, respectively.

In what order we should process this query?


1. (word 1 AND word 2) AND word 3
2. word 1 AND (word 2 AND word 3)

BITS Pilani, Pilani Campus


Inverted Index with Term
Frequency

Word 1 45 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 2 Doc 44 Doc 99

Word 3 5 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus


Building an Inverted Index

Step 1: Collect the documents to be indexed.


Example with two documents.
Doc 1:
A quick brown fox jumps over a lazy dog.
Doc 2:
The quick sly fox jumped over the lazy brown dog.

BITS Pilani, Pilani Campus


Building an Inverted Index

Step 2: Tokenize the documents into list of tokens.


Doc 1:
A quick brown fox jumps over a lazy dog .
Doc 2:
The quick sly fox jumped over the lazy brown dog .

BITS Pilani, Pilani Campus


Building an Inverted Index

Step 3: Do some linguistic preprocessing, eg. lowercase


Doc 1:
a quick brown fox jumps over a lazy dog
Doc 2:
the quick sly fox jumped over the lazy brown dog

BITS Pilani, Pilani Campus


Building an Inverted Index

Step 4: Build the inverted index considering the tokens as


terms.
a 1 the 2
quick 1 quick 2
brown 1 sly 2
fox 1 fox 2
jumps 1 jumped 2
over 1 over 2
a 1 the 2
lazy 1 lazy 2
dog 1 brown 2
dog 2

BITS Pilani, Pilani Campus


Building an Inverted Index

Step 4: Build the inverted index considering the tokens as


terms.
a 1
brown 1, 2
dog 1, 2
fox 1, 2
jumped 1
jumps 1
lazy 1, 2
over 1, 2
quick 1, 2
sly 2
the 2

BITS Pilani, Pilani Campus


Boolean Retrieval system using
Google
Exercise: What is the syntax of AND, OR
and NOT operators in google search?

BITS Pilani, Pilani Campus


Reference

https://nlp.stanford.edu/IR-book/
Chapter 1

BITS Pilani, Pilani Campus


Thank You!

BITS Pilani, Pilani Campus

You might also like