Information Retrieval: Prof: Ehab Ezzat Hassanein
Information Retrieval: Prof: Ehab Ezzat Hassanein
2 / 14
Inverted Index construction
3 / 14
Initial stages of text processing
●
Tokenization
– Cut character sequence into words tokens
Deal with “John’s”, a state-of-the-art solution
●
●
Normalization
– Map text and query term to the same form
USA and U.S.A to match
●
●
Stemming
– We may wish different forms of a root to match
authorize and authorization
●
●
Stop words
– We may omit very common words (or not!)
●
The, a, to, of 4 / 14
– Query the song to be or not to be!!
Indexer Steps:
Token Sequence
5 / 14
Indexer Steps:
Sort
6 / 14
Indexer Steps:
Dictionary And
Postings
●
Multiple term entries in a
single document are
merged
●
Split into Dictionary and
Postings
●
Doc Frequency
information is added
7 / 14
Where do we pay in
Storage?
●
Terms ~ 500 K
●
Pointer ~ 500 K
●
Posting list are bounded by the
number of terms so in our
example 1M documnts * 1000
average words pr document
==>> less than 1 billion item 8 / 14
Efficient IR System Implementation
●
How do we index efficiently?
●
How much storage do we need.
9 / 14
Query Processing with an Inverted
Index
10 / 14
Query Processing: AND
●
Consider Processing query:
Brutus and Caesar
●
1. Locate Brutus in the Dictionary
●
2. Retrieve its postings
●
3. Locate Caesar in the Dictionary
●
4. Retrieve its postings
●
5. Merge the two postings lists (intersect the document sets):
11 / 14
Algorithm for the merging of two
postings lists
2, 8
12 / 14
Algorithm for the merging of two
postings lists
13 / 14
14 / 14