Chapter-4 - Data Structure-File Structure
Chapter-4 - Data Structure-File Structure
rithms for IR
Introduction
• A program is written in order to solve a prob-
lem.
• A solution to a problem actually consists of
two things:
– A way to organize the data
– Sequence of steps to solve the problem
• The way data are organized in a computers
memory is said to be Data Structure and the
sequence of computational steps to solve a
problem is said to be an algorithm.
• Therefore, a program is nothing but data
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of
some key field.
• a particular attribute is chosen as primary key whose value
will determine the order of the records.
• when the first key fails to discriminate among records, a sec-
ond key is chosen to give an order.
Sequential File
• To access records search serially;
– starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.
So let it be with
Doc 2 Caesar. The noble
Brutus hath told you
Caesar was ambitious
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted
by terms
– Inverted index may record term locations within document during parsing
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1
hath 1
brutus 1
I 1
killed 1
I 1
me 1
i' 1
so 2
it 2
let 2
julius 1
it 2
be 2 killed 1
with 2 killed 1
caesar 2 let 2
the 2 me 1
noble 2 noble 2
brutus 2 so 2
hath 2 the 1
told 2 the 2
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2
Remove duplicate terms & add frequency
Term Doc # Term Freq
•Multiple term Term
ambitious
Doc #
2
ambitious
be
2
2
1
1
be 2
entries in a brutus 1
brutus
brutus
1
2
1
1
single docu-
brutus 2
capitol 1 capitol 1 1
caesar 1 1
ment are caesar
caesar
1
2 caesar 2 2
•Counting killed 1 2
julius 1
killed 1 let 2 1
killed 1
number of oc- let 2
me
noble
1
2
1
1
me 1
currence of noble 2
so
the
2
1
1
1
so 2
terms in the the 1 the 2 1
told 2 1
collections
the 2
told 2 you 2 1
Pointers
Inverted index storage
•Separation of inverted file into vocabulary and posting
file is a good idea.
–Vocabulary: For searching purpose we need only word list.
This allows the vocabulary to be kept in memory at search
time since the space required for the vocabulary is small.
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at po-
sition i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
• Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the ac-
tual object.
•The structure has several advantages:
• It requires less storage space.
• We do not have to worry how the text is represented (binary,
ASCII, etc).
• We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
• TEXT: GOOGOL$
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.
• This structure is
particularly useful
for any application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S
–The suffix tree is created by •O
compacting unary nodes of the
suffix TRIE.
• We store pointers rather than
words in the leaves.
–It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the be-
ginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
•To make suffixes prefix-free we add a special
char, $, at the end of s. To associate each suf-
fix with a unique string in S add a different
special symbol to each s
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• the places where S can be found are given by the pointers in all the
leaves in the subtree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Suffix Tree Applications
• Suffix Tree can be used to solve a large number of string
problems that occur in:
–text-editing,
–free-text search,
–etc.
• Main drawbacks:
–Its costly construction process,
–The need for the text to be readily available at query time
Building suffix array
• Procedure:
– Identify suffixes of the given string
– Sort the suffixes lexicographically
– Store indices of all the suffixes in a table.
• Text Signa-
ture:
1110101 0111100 1011111