0% found this document useful (0 votes)

69 views35 pages

IR - Lecture 2

The document describes a lecture on information retrieval systems. It discusses the key components of an IR system including documents, queries, and retrieved documents. It then covers different models for IR including boolean retrieval and introduces solutions for indexing documents like incidence matrices and inverted indexes to improve search efficiency over a linear scan. The benefits and limitations of incidence matrices and inverted indexes are described. Finally, it provides an example of building an inverted index from sample documents.

Uploaded by

Shagufta Gurmukhdas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views35 pages

IR - Lecture 2

Uploaded by

Shagufta Gurmukhdas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Information Retrieval

BITS Pilani Abhishek

Pilani Campus January 2020
BITS Pilani
Pilani Campus

CS F469, Information Retrieval

Lecture No. 2
Recap of Lecture 1

● What is Information Retrieval (IR)?

● Why do we need IR?
● Why IR is important?
● Course overview

BITS Pilani, Pilani Campus

Today’s Lecture

● A simple IR task
● Boolean Retrieval models
● Indexing

BITS Pilani, Pilani Campus

Overview of IR system

Query

IR System
Unstructured
Corpus
Retrieved
Documents

BITS Pilani, Pilani Campus

Overview of IR system

Query

IR System
Unstructured
Corpus
Retrieved
Documents

● Depends on the IR model

● Specific need of the IR
System

BITS Pilani, Pilani Campus

Key terminologies

● Document: A unit that we have decided to

build a retrieval system. It can be web
pages, book chapters, research papers etc.
● Collection/Corpus: A group of documents
over which we perform retrieval.

BITS Pilani, Pilani Campus

Key terminologies

● Information Need: A topic about which a

user wants to know more.
● Query: Some words or phrases that the user
writes in the computer in an attempt to
communicate the information need.

BITS Pilani, Pilani Campus

Boolean Retrieval Model

An IR model, in which user can pose

queries as boolean expressions.

Example of query:
Virat AND Anushka

BITS Pilani, Pilani Campus

An Example IR problem

Documents: D1, D2, …., DN

Average words per document: Awpd
Unique words: M
Model: Boolean retrieval model.

BITS Pilani, Pilani Campus

Naive solution: Linear scan

Linear scan: For every query, scan the corpus and find
relevant documents.

Major limitation:
● For every query, need to process whole corpus, i.e.,
N * Awpd words.
● If N is large (in range of million and more documents),
not feasible to implement a practical IR system on a
decent computer.
BITS Pilani, Pilani Campus
Naive solution: Linear scan

Advantage: Linear scan is the only solution if we only have

access to the corpus with no additional memory or storage
space.

Better solutions require an intermediate data structure.

BITS Pilani, Pilani Campus

Overview of IR system

Query

Unstructured Intermediate IR System

Corpus Data
Structure Retrieved
Documents

BITS Pilani, Pilani Campus

Solution 2: Incidence Matrix

Document 1 Document 2 Document 3 Document 4 ... Document N

Word 1 1 0 1 0 1

Word 2 0 1 0 0 0

Word 3 0 0 0 0 0

Word 4 0 1 1 0 0

Word 5 1 0 1 1 1

Word 6 1 1 0 1 1

Word M 0 0 1 0 1

BITS Pilani, Pilani Campus

Solution 2: Incidence Matrix

Advantage: For every query, the system needs to access

few rows and perform boolean operations on the rows.

BITS Pilani, Pilani Campus

Solution 2: Incidence Matrix

Eg. query: word 1 AND word 6

Word 1 1 0 1 0 ... 1

AND
Word 6 1 1 0 1 ... 1

Result 1 0 0 0 ... 1

BITS Pilani, Pilani Campus

Solution 2: Incidence Matrix

Eg. query: word 1 AND word 6

Word 1 1 0 1 0 ... 1

AND
Word 6 1 1 0 1 ... 1

Result 1 0 0 0 ... 1

For every query: N * (query words - 1) AND operations

BITS Pilani, Pilani Campus

Solution 2: Incidence Matrix

Limitations:
The matrix size will be huge for general corpus.
● For every new document, the matrix size will increase by
at least M.

Observation in text corpus: Other than few common

words, every other words does not appears in every
document.

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 Doc 44 Doc 99

Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Advantage: For every query, the system needs to access

few elements of dictionary and perform intersection of the
posting lists.

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Eg. query: word 1 AND word 3

Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

=
Result Doc 90

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Eg. query: word 1 AND word 3

Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

intersection
Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

=
Result Doc 90

If the posting lists are sorted, then for every query:

O(N) * (query words) operations to find intersection.

BITS Pilani, Pilani Campus

Posting List Intersection Algo.

Source: Figure 1.6, Introduction to Information Retrieval

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Observations:
● The query processing time for inverted index is
asymptotically same as that of incidence matrix.
● However, in practice it takes less time because not every
query will have the posting list size close to N.
● Query optimization can be done to reduce time further.

BITS Pilani, Pilani Campus

Solution 3: Inverted Index

Query Optimization:
Eg. Query: word 1 AND word 2 AND word 3

Let posting list size for word 1, word 2 and word 3 are 100,
50 and 10, respectively.

In what order we should process this query?

1. (word 1 AND word 2) AND word 3
2. word 1 AND (word 2 AND word 3)

BITS Pilani, Pilani Campus

Inverted Index with Term
Frequency

Word 1 45 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 2 Doc 44 Doc 99

Word 3 5 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus

Building an Inverted Index

Step 1: Collect the documents to be indexed.

Example with two documents.
Doc 1:
A quick brown fox jumps over a lazy dog.
Doc 2:
The quick sly fox jumped over the lazy brown dog.

BITS Pilani, Pilani Campus

Building an Inverted Index

Step 2: Tokenize the documents into list of tokens.

Doc 1:
A quick brown fox jumps over a lazy dog .
Doc 2:
The quick sly fox jumped over the lazy brown dog .

BITS Pilani, Pilani Campus

Building an Inverted Index

Step 3: Do some linguistic preprocessing, eg. lowercase

Doc 1:
a quick brown fox jumps over a lazy dog
Doc 2:
the quick sly fox jumped over the lazy brown dog

BITS Pilani, Pilani Campus

Building an Inverted Index

Step 4: Build the inverted index considering the tokens as

terms.
a 1 the 2
quick 1 quick 2
brown 1 sly 2
fox 1 fox 2
jumps 1 jumped 2
over 1 over 2
a 1 the 2
lazy 1 lazy 2
dog 1 brown 2
dog 2

BITS Pilani, Pilani Campus

Building an Inverted Index

Step 4: Build the inverted index considering the tokens as

terms.
a 1
brown 1, 2
dog 1, 2
fox 1, 2
jumped 1
jumps 1
lazy 1, 2
over 1, 2
quick 1, 2
sly 2
the 2

BITS Pilani, Pilani Campus

Boolean Retrieval system using
Google
Exercise: What is the syntax of AND, OR
and NOT operators in google search?

BITS Pilani, Pilani Campus

Reference

https://nlp.stanford.edu/IR-book/
Chapter 1

BITS Pilani, Pilani Campus

Thank You!

BITS Pilani, Pilani Campus

Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
IR Berhampore Sukomalpal
No ratings yet
IR Berhampore Sukomalpal
82 pages
Machine Learning Q and AI: 30 Essential Questions and Answers on Machine Learning and AI
From Everand
Machine Learning Q and AI: 30 Essential Questions and Answers on Machine Learning and AI
Sebastian Raschka
5/5 (1)
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
Practical Psychology 5-6 (Henry Knight Miller, 1924)
No ratings yet
Practical Psychology 5-6 (Henry Knight Miller, 1924)
28 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Lecture 1
No ratings yet
Lecture 1
53 pages
IR Mergred
No ratings yet
IR Mergred
401 pages
Lec 3
No ratings yet
Lec 3
17 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Scientific Computing with Python: Mastering Numpy and Scipy
From Everand
Scientific Computing with Python: Mastering Numpy and Scipy
John Smith
No ratings yet
Lect 8 Phrase Query
No ratings yet
Lect 8 Phrase Query
13 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Kathleen Graves - Articulting Bilief
No ratings yet
Kathleen Graves - Articulting Bilief
19 pages
IRS Lec06 24
No ratings yet
IRS Lec06 24
13 pages
7 Phrase Queries and Positional Indexes
No ratings yet
7 Phrase Queries and Positional Indexes
25 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Unit I
No ratings yet
Unit I
83 pages
01 Intro
No ratings yet
01 Intro
145 pages
Questionnaire
No ratings yet
Questionnaire
5 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Interview Preparation Questions 1706886033
No ratings yet
Interview Preparation Questions 1706886033
12 pages
Unit 2
No ratings yet
Unit 2
58 pages
Game Design Principles Lecture 1
No ratings yet
Game Design Principles Lecture 1
29 pages
Aguilos Mario Jr. G. LDM Portfolio
No ratings yet
Aguilos Mario Jr. G. LDM Portfolio
22 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Smarter Decisions – The Intersection of Internet of Things and Decision Science
From Everand
Smarter Decisions – The Intersection of Internet of Things and Decision Science
Jojo Moolayil
No ratings yet
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
A Sociolinguistic Survey On Code Switching & Code Mixing by The Native Speakers of Bangladesh Shaima Quyyum
No ratings yet
A Sociolinguistic Survey On Code Switching & Code Mixing by The Native Speakers of Bangladesh Shaima Quyyum
18 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
MGT Assignment Reflection Essay
No ratings yet
MGT Assignment Reflection Essay
9 pages
SYST 469 - Assignment 1 - ID Plan
No ratings yet
SYST 469 - Assignment 1 - ID Plan
5 pages
Unit 6 - Simple Sentence
No ratings yet
Unit 6 - Simple Sentence
24 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Common Skills For Resume
100% (2)
Common Skills For Resume
7 pages
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
60 pages
Grade 3 Nonfiction Virtual Reading Performance Post-Assessment Tools
No ratings yet
Grade 3 Nonfiction Virtual Reading Performance Post-Assessment Tools
7 pages
C10 IR M2021 IndexConstruction SimpleandDistributed
No ratings yet
C10 IR M2021 IndexConstruction SimpleandDistributed
42 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
47 pages
A Semi-Detailed Lesson Plan in Science 7
100% (1)
A Semi-Detailed Lesson Plan in Science 7
3 pages
Social Cognitive Theory
No ratings yet
Social Cognitive Theory
1 page
Handout Competitive Programming
100% (1)
Handout Competitive Programming
4 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
Motivation and Attitudes Toward English Language Learning
No ratings yet
Motivation and Attitudes Toward English Language Learning
7 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Deep Learning ASSIGNMENT 2
No ratings yet
Deep Learning ASSIGNMENT 2
1 page
Information Retrieval: BITS Pilani
No ratings yet
Information Retrieval: BITS Pilani
17 pages
Unit 3 - AI - Knowledge
No ratings yet
Unit 3 - AI - Knowledge
2 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
RL1 1
No ratings yet
RL1 1
15 pages
How Hard Can It Be? Designing and Implementing A Deployable Multipath TCP
No ratings yet
How Hard Can It Be? Designing and Implementing A Deployable Multipath TCP
14 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Grammar and Beyond 2, Units 25-28, Final Exam Review
No ratings yet
Grammar and Beyond 2, Units 25-28, Final Exam Review
4 pages
Funda MODULE 1
No ratings yet
Funda MODULE 1
28 pages
Main Complementary: Page Text Book/ Activity Book: 4
No ratings yet
Main Complementary: Page Text Book/ Activity Book: 4
2 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
ARW2 - Midterm Exam Virtual 202210 Avanzado 11 19 30-21 00
No ratings yet
ARW2 - Midterm Exam Virtual 202210 Avanzado 11 19 30-21 00
8 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
81 Managemnt
No ratings yet
81 Managemnt
11 pages
High Flyer
No ratings yet
High Flyer
5 pages
Summary of Elements of Style
No ratings yet
Summary of Elements of Style
10 pages
Transition Words and Phrase
No ratings yet
Transition Words and Phrase
3 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
ML Quiz
No ratings yet
ML Quiz
3 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
1 Kushartanti
No ratings yet
1 Kushartanti
3 pages
Station Preference Details
No ratings yet
Station Preference Details
9 pages
CS F469 Handout
No ratings yet
CS F469 Handout
4 pages
Materi Report Text Dan Exercise 11
No ratings yet
Materi Report Text Dan Exercise 11
3 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Asking For Help Activity
No ratings yet
Asking For Help Activity
5 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Rubric For Oral Presentation
No ratings yet
Rubric For Oral Presentation
2 pages
Research Methodology - Syllabus
100% (1)
Research Methodology - Syllabus
1 page
T1 PDF
No ratings yet
T1 PDF
2 pages

IR - Lecture 2

Uploaded by

IR - Lecture 2

Uploaded by

Information Retrieval

BITS Pilani Abhishek

CS F469, Information Retrieval

● What is Information Retrieval (IR)?

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

● Depends on the IR model

BITS Pilani, Pilani Campus

● Document: A unit that we have decided to

BITS Pilani, Pilani Campus

● Information Need: A topic about which a

BITS Pilani, Pilani Campus

An IR model, in which user can pose

BITS Pilani, Pilani Campus

Documents: D1, D2, …., DN

BITS Pilani, Pilani Campus

Advantage: Linear scan is the only solution if we only have

Better solutions require an intermediate data structure.

BITS Pilani, Pilani Campus

Unstructured Intermediate IR System

BITS Pilani, Pilani Campus

Document 1 Document 2 Document 3 Document 4 ... Document N

BITS Pilani, Pilani Campus

Advantage: For every query, the system needs to access

BITS Pilani, Pilani Campus

Eg. query: word 1 AND word 6

BITS Pilani, Pilani Campus

Eg. query: word 1 AND word 6

For every query: N * (query words - 1) AND operations

BITS Pilani, Pilani Campus

Observation in text corpus: Other than few common

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Word 1 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 Doc 44 Doc 99

Word 3 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus

Advantage: For every query, the system needs to access

BITS Pilani, Pilani Campus

Eg. query: word 1 AND word 3

BITS Pilani, Pilani Campus

Eg. query: word 1 AND word 3

If the posting lists are sorted, then for every query:

BITS Pilani, Pilani Campus

Source: Figure 1.6, Introduction to Information Retrieval

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

In what order we should process this query?

BITS Pilani, Pilani Campus

Word 1 45 Doc 1 Doc 20 Doc 34 Doc 90 ...

Word 2 2 Doc 44 Doc 99

Word 3 5 Doc 3 Doc 40 Doc 44 Doc 55 Doc 90

Dictionary/Vocabulary Posting List

BITS Pilani, Pilani Campus

Step 1: Collect the documents to be indexed.

BITS Pilani, Pilani Campus

Step 2: Tokenize the documents into list of tokens.

BITS Pilani, Pilani Campus

Step 3: Do some linguistic preprocessing, eg. lowercase

BITS Pilani, Pilani Campus

Step 4: Build the inverted index considering the tokens as

BITS Pilani, Pilani Campus

Step 4: Build the inverted index considering the tokens as

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

You might also like