3 Indexing
3 Indexing
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents
for users query
–Effectiveness of the system is measured in terms of
precision, recall.
–Main emphasis: Stemming, stop words removal, weighting
schemes, matching algorithms
–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting
Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked
Stop word non-stoplist
document
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language
Indexe
Index File r friend 2 4
(Inverted
roman 1 2
file).
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword
• Space overhead
–Computer storage space consumed.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization capitol 1 3 brutus 2
and stemming brutus
killed
1
1 4 capitol 1
are applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index with 2
caesar 2 8 enact 1
terms in the 2
9 julius 1
sequential file noble
brutus
2
2 10 kill 1
are sorted in hath
told
2
2 11 kill 1
alphabetical you 2
12 noble 2
order caesar
was
2
2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic
order.
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is
added. Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it.
• Content of the inverted file:
–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
Inverted file
• The occurrence: contains one record per term,
listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….
•Why location?
– Having information about the location of each term
within the document helps for:
•highlight location of search term
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
• An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in
the posting file allows:
– the vocabulary to be kept in memory at search
time even for large text collection, and
– Posting file to be kept on disk for accessing to
documents
Vocabulary file
• A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any
of the documents (in lexicographical order) and
–For each word a pointer to posting file
• Records kept for each term j in the word list contains the
following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
• For each distinct term in the vocabulary, stores
a list of pointers to the documents that contain
that term.
• Each element in an inverted list is called a
posting, i.e., the occurrence of a term in a
document
• It is stored as a separate inverted list for each
column, i.e., a list corresponding to each term
in the index file.
– Each list consists of one or many individual
postings related to Document ID, TF and location
information about a given term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents julius 1 caesar 1
caesar 2
have been caesar
I
1
1 caesar 2
tokenized the was
killed
1
1
did
enact
1
1
inverted file is I 1 has 1
the 1
sorted by capitol 1
I
I
1
1
terms brutus
killed
1
1
I 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming &
compute term frequency
•Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 1
enact 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections
noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1
brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Exercises
1) Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise