IR Chapter Three
IR Chapter Three
Indexing structure
Indexing: Basic Concepts
• Indexing is an arrangement of index terms to permit fast searching and
reducing memory space requirement. It is an offline process of organizing
document using keywords extracted from the collection
• used to speed up access to desired information from document collection as per
users query such that
– It enhances efficiency in terms of time for retrieval.
o Relevant documents are searched and retrieved quick
– Index file usually has index terms in a sorted order.
Example: author catalog in library
• Example: Consider the following list of terms:
A: fox pig zebra hen ant cat dog lion ox
B : ant cat dog fox hen lion ox pig zebra
Which list is easier to search? A or B? •2
Cont….
• An index file consists of records, called index entries.
• Index files are much smaller than the original file.
– This size is reduced by Linguistic pre-processing(or text
operations).
Token Tokenizer
stream. Friends Romans countrymen
friend 2 4
roman 1 2
Inverted file countryman 13 16
•6
Indexing structure By: Abenet A.
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
Example:
•For organizing index file for a collection of documents, there are various
options available:
• Indexing time
• How much is the running time to find the required search key from the list?
– How much time does it take to update existing records in an attempt to add new
terms or delete existing unnecessary terms?
3)Space overhead
• Records kept for each term j in the word list contains the following:
– term j
•18
Cont…
Example:
Inverted index storage
• Separation of inverted file into vocabulary and posting
file is a good idea.
• Vocabulary: For searching purpose we need only word list. This
allows the vocabulary to be kept in memory at search time since
the space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct
words. Hence, the size of index is 100 MBs, which can easily be held in
memory of a dedicated computer.
• Posting file requires much more space.
• For each word appearing in the text we are keeping statistical information
related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
• How to speed up access to inverted file?
Example: inverted File
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.
I did enact Julius
Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Indexing structure By: Abenet A. •21
Example(Cont…)
Steps to build Inverted Files
• To build inverted files, follow the following steps:
•26
Complexity Analysis
The inverted index can be built in O(n) + O(n log n) time, where
Doc 2 : The Department launched its first BSc in Computer Studies in 2002.
Doc 3 : Followed by the MSc in Information Technology which was started in 2003.
Doc 4 : The Department also produced its first MSc graduate in 2005.
Doc 5: Our staff have contributed intellectually and professionally to the
TEXT: G O O G O L $
POSITION: 1 2 3 4 5 6 7
•35
Suffix tree(Cont…)
• We store pointers rather than words Example
in the leaves.
(1,2) for GO
(7,7) for $ as shown in the
figure. •36
Example: Suffix tree
Let s=abab, a suffix tree of s is a compressed trie of all suffixes
of s=abab$
• We label each leaf with the starting
•{ point of the corresponding suffix.
• $ •$
• b$ •ab
•b •5
$
• ab$ •$ ab
abab
• bab$ •$ •ab$ •4 b
• abab$ •ab$
bab
•3
•} •2
•1
• The suffix tree for a string has been built in O(n2) time.
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
•{
•a •$ •#
• $ #
•b
• b$ b#
•# •5 •4
• ab$ ab# •b
•ab$ •ab$ •$
• bab$ aab#
•3
• abab$ •ab$ •# •4
•$ •1
•} •2
•1 •3 •2
– Start at root
• the places where S can be found are given by the pointers in all the
leaves in the subtree rooted at x.
– If S encountered a NIL pointer before reaching the end, then S is not in
the tree
Indexing structure By: Abenet A. •47
Cont…
Example:
GOOGOL$,GOL$.
–text-editing,
–free-text search,
–etc.
–String matching
–Palindromes
–etc..
Indexing structure By: Abenet A. •50
Suffix tree Drawbacks
Suffix trees consume a lot of space
– Even if word beginnings are indexed, space overhead of 120% -
240% over the text size is produced.
– How much space is required at each node for English word indexing
based on alphabets a to z.
Example:
The suffix array gives the indices of the suffixes in sorted order
•53
Example: Building suffix array
•54
Example 2:Building a suffix array
Given the string S =“GOOGOL”, construct suffix array
• Sort the suffixes in lexicographical order and store in a table all the
indices of the given string S.
•50
How do we search for a pattern?
If P occurs in T then all its occurrences are consecutive(sequential) in the
suffix array.
•51
Example
•Let S = mississippi
L 11 i
8 ippi
5 issippi
•Let P = issa
2 ississippi
1 mississippi
M 10 pi
9 ppi
7 sippi
4 sisippi
6 ssippi
R 3 ssissippi •52
Thank you