Chapter 4 IR
Chapter 4 IR
Chapter Four
Indexing structure
Target Group –IT 3rd year students
Injibara, Ethiopia
Brain Storming Questions
2
Introduction
File structures:
A fundamental decision in the design of IR systems is which
type of file structure to use for the underlying document
database.
The file structures used in IR systems are
flat files,
inverted files,
signature files,
PAT trees, and
Graphs,
Though it is possible to keep file structures in main memory, in
practice IR databases are usually stored on disk because of their
size. 3
Designing an IR System
Our focus during IR system design is:
1. improving Effectiveness of the system
– The concern here is retrieving more relevant documents for users query
– Effectiveness of the system is measured in terms of precision, recall, …
– Main emphasis: Stemming, stop words removal, weighting schemes,
matching algorithms
2. improving Efficiency of the system
– The concern here is reducing storage space requirement, enhancing
searching time, indexing time, access time…
– Main emphasis: Compression, indexing structures, space – time
tradeoffs
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
–Searching
documents
Documents Assign document identifier
document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting
Weighted index
terms Index File
Searching Subsystem
Token stream.
Linguistic
Modified tokens.
preprocessing
roman
countryman
Building Index file
•An index file of a document is a file consisting of a list of index
terms and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1
3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
so 2 5 caesar 1
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index with
caesar
2
2 8 enact 1
terms in sequential the 2
9 julius 1
noble 2
file are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time
using binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added.
Inserting a new record may require moving a large proportion of the
file;
– random access is extremely slow.
• The problem of update can be solved:
– by ordering records by date of acquisition, than the key value; hence,
the newest entries are added at the end of the file & therefore pose
no difficulty to updating.
– But searching becomes very tough; it requires linear time
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it.
–Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time, n is number of keywords
• Content of the inverted file:
–vocabulary file
–Posting file
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
Multiple term
Term Doc #
entries in a single ambition 2 Term Doc # TF
brutus 1 ambition 2 1
document are
brutus 2 brutus 1 1
merged and brutus 2 1
capitol 1
frequency caesar 1 capitol 1 1
caesar 2 caesar 1 1
information added
caesar 2 caesar 2 2
Counting number enact 1 enact 1 1
of occurrence of julius 1 julius 1 1
kill 1 kill 1 2
terms in the noble 2 1
kill 1
collections helps to noble 2
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary
posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1
Pointers
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise
Suffix trie
What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
A suffix trie is an ordinary trie in which the input strings
are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (#of symbols
in text).
• To build the suffix TRIE we use these indices instead of the actual
object.
The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no duplicate).
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
•This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
Suffix tree
A suffix tree is a member of
the trie family. It is a Trie of
all the proper suffixes of S O
The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
We store pointers rather than
words in the leaves.
It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
$
ab
b 5
{
$
5 $
4 b$ $ ab$ 4
ab$
3 ab$ 3
2 bab$ 2
1
• We label each leaf with the
1 abab$ starting point of the
} corresponding suffix.
Complexity Analysis
The suffix tree for a string has been built in O(n2)
time.
51
Structure of Signature File
Document Signature file
F-bits pointer Text file
signature
0 1 … 0 1
1
1
…
N
blocks 1
1
0
1
Example
• Given a text: “A text has many words. Words are made from letters”
• Text
Signature:
1110101 0111100 1011111