Indexing Structure: Chapter Four
Indexing Structure: Chapter Four
Indexing structure
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reading memory space requirement.
used to speed up access to desired information from
document collection as per users query such that:
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenize
stream. Friends Romans countrymen
friend 2 4
roman 1 2
Inverted file countryman 13 16
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search
key from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an
attempt to add new terms or delete existing unnecessary
terms?
Is the indexing structure allows incremental update or
re-indexing?
Space overhead
Computer storage space consumed for keeping the list.
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term.
An index file is a list of search terms that are organized
for associative lookup, i.e., to answer user’s query.
For organizing index file for a collection of documents,
there are various option are available.
Decide what data structure and/or file structure to use.
Is it sequential file, inverted file, suffix tree, etc. ?
Sequential File
Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
The records are generally arranged serially, one after
another, but in lexicographic order on the value of some
key field.
• a particular attribute is chosen as primary key whose
value will determine the order of the records.
• when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did 1
Sequential file
enact 1
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stop words are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index terms with 2
caesar 2 8 enact 1
in sequential file the 2
9 julius 1
noble 2
are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
To access records search serially;
starting at the first record read and investigate all the
succeeding records until the required record is found or end
of the file is reached.
Its main advantages:
Easy to implement
Provides fast access to the next record using lexicographic
order.
Can be searched quickly, using binary search.
Its disadvantages:
No weights attached to terms.
Random access is slow: since similar terms are indexed
individually.
Inverted file
A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
Building and maintaining an inverted index is a relatively low cost
risk.
On a text of n words an inverted index can be built in O(n) time
This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.
Original
Documents •W1:d1,d2,d3
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•W2:d2,d4,d7,d9
•…
•…
•Wn :d ,…dn
•Wn :di i,…dn
Document IDs
•Inverted
•InvertedFiles
Files
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms): is the set of all distinct
words (index terms) in the text collection.
Location: all the text locations/positions where the word
occurs.
frequency of occurrence of terms in a document
collection
TFij, Number of occurrences of term tj in document di
DFj, Number of documents containing tj
TCF, total frequency of tj in the corpus n
mi, Maximum frequency of any term in di
n, Total number of documents in a collection ………
Inverted file
Records kept for each term j in the word list contains the following:
term j
Number of documents in which term j occurs (DFj)
Collection frequency of term j
Pointer to inverted (postings) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
Each list consists of one or many individual postings
Vocabulary
Postings
(word list) Documents
(inverted list)
Pointer
Term DF TF To
posting
term 1 3 3 Inverted
term 2 3 4 lists
term 3 1 1
term 4 2 3
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID .
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term Doc #
Term Doc # ambitious 2
I 1 be 2
After all documents did 1 brutus 1
enact 1 brutus 2
have been tokenized julius 1 capitol 1
the inverted file is caesar
I
1
1
caesar
caesar
1
2
sorted by terms. was
killed
1
1
caesar 2
did 1
I 1 enact 1
the 1 has 1
capitol 1 I 1
brutus 1 I 1
killed 1 I 1
me 1 it 2
so 2
julius 1
let 2
killed 1
it 2
killed 1
be 2
let 2
with 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stop words, stemming & compute
frequency
Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
Counting
enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Searching on Inverted File
Since the whole index file is divided into two, searching can be
done faster by loading vocabulary list which takes less memory
even for large document collection