Preprocessing, Inverted Index
Preprocessing, Inverted Index
PART 2
Text and Web Page Pre-Processing
Text preprocessing tasks
For text documents : Stopword removal, stemming, and handling of digits, hyphens,
punctuations, and cases of letters.
For Web pages : HTML tag removal and identification of main content blocks.
"The inverted index of a document collection is basically a data structure that attaches
each distinctive term with a list of all documents that contains the term."
Given a set of documents, D = {d1, d2, …, dN}, each document has a unique identifier.
Inverted Index
An inverted index consists of two parts:
1. Vocabulary V : contains all the distinct terms (ti) in the document set.
2. Posting : stores the ID (denoted by idj) of the document dj that contains term ti in
document dj.
Inverted Index
Assume that we have three documents in our database.
We obtain the final ranking of (c3, c1, c2, c4, c5, m4, m2, m3, m1).