0% found this document useful (0 votes)
5 views14 pages

Information Retrieval: Prof: Ehab Ezzat Hassanein

The document outlines the process of constructing and querying inverted indexes in information retrieval systems. Key stages include tokenization, normalization, stemming, and handling stop words, followed by indexing steps that involve sorting and creating a dictionary and postings. Additionally, it discusses efficient storage requirements and query processing techniques, particularly for AND queries.

Uploaded by

yahia mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views14 pages

Information Retrieval: Prof: Ehab Ezzat Hassanein

The document outlines the process of constructing and querying inverted indexes in information retrieval systems. Key stages include tokenization, normalization, stemming, and handling stop words, followed by indexing steps that involve sorting and creating a dictionary and postings. Additionally, it discusses efficient storage requirements and query processing techniques, particularly for AND queries.

Uploaded by

yahia mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Information Retrieval

Prof: Ehab Ezzat Hassanein


1 / 14
Constructing And Querying Inverted
Indexes

2 / 14
Inverted Index construction

3 / 14
Initial stages of text processing

Tokenization
– Cut character sequence into words tokens
Deal with “John’s”, a state-of-the-art solution


Normalization
– Map text and query term to the same form
USA and U.S.A to match


Stemming
– We may wish different forms of a root to match
authorize and authorization


Stop words
– We may omit very common words (or not!)

The, a, to, of 4 / 14
– Query the song to be or not to be!!
Indexer Steps:
Token Sequence

The Sequence of (modified tokens,


document ID) pairs

5 / 14
Indexer Steps:
Sort

6 / 14
Indexer Steps:
Dictionary And
Postings

Multiple term entries in a
single document are
merged

Split into Dictionary and
Postings

Doc Frequency
information is added

7 / 14
Where do we pay in
Storage?


Terms ~ 500 K

Pointer ~ 500 K

Posting list are bounded by the
number of terms so in our
example 1M documnts * 1000
average words pr document
==>> less than 1 billion item 8 / 14
Efficient IR System Implementation


How do we index efficiently?

How much storage do we need.

9 / 14
Query Processing with an Inverted
Index

10 / 14
Query Processing: AND

Consider Processing query:
Brutus and Caesar

1. Locate Brutus in the Dictionary

2. Retrieve its postings

3. Locate Caesar in the Dictionary

4. Retrieve its postings

5. Merge the two postings lists (intersect the document sets):

11 / 14
Algorithm for the merging of two
postings lists

2, 8

12 / 14
Algorithm for the merging of two
postings lists

13 / 14
14 / 14

You might also like