Chapter Four Indexing Structure
Chapter Four Indexing Structure
Indexing structure
1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
– The concern here is retrieving more relevant documents for users
query
– Effectiveness of the system is measured in terms of precision,
recall, …
– Main emphasis: Stemming, stopwords removal, weighting
schemes, matching algorithms
–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query 3
Indexing Subsystem
documents
Documents Assign document identifier
Weighted index
terms Index File
4
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
5
Basic assertion
Indexing and searching:
– inexorably connected
– you cannot search that was not first indexed in some
manner or other
– indexing of documents or objects is done in order to be
searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language
Knowing searching is knowing indexing
6
Implementation Issues
•Storage of text:
– The need for text compression: to reduce storage space
•Indexing text
– Organizing indexes
• What techniques to use ? How to select it ?
– Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
– Accessing indexes
• How to access to indexes ? What data/file structure to use?
– Processing indexes
• How to search a given query in the index? How to update the index?
– Accessing documents 7
Text Compression
• Advantages:
–Save storage space requirement.
–Speed up document transmission time
–Takes less time to search the compressed text
• Disadvantages:
–Consumes computational resources (both memory space and
processor running time)
8
Common compression methods
• Static methods:
– Statistical methods which require statistical information about
frequency of occurrence of symbols in the document
E.g. Huffman coding
– Two-pass algorithm:
• Estimate probabilities of symbols,
• Encode symbols, generate codeword; usually shorter codes for symbols
with high probabilities
• Adaptive methods:
– Dictionary-based methods which construct dictionary in the
course of compression
E.g. Ziv-Lempel compression:
– One-pass algorithm:
• encode symbols to generate codeword 9
• Replace words or symbols with a pointer to dictionary entries
Huffman coding
Symbol Probability
a 0.05
• The Huffman encoding
b 0.05
algorithm picks each time two
c 0.1 symbols (with the smallest
d 0.2 frequency) to combine
e 0.3
f 0.2
g 0.1
12
Huffman code tree
1
0 1
0.4 0.6
0 1 0
1
0.3
d f 0 1 e
0.2
g 0 1
0.1 c
0 1
a b
14
A Simple Coding Example
16
Example
• Sort this list by frequency and make the two-lowest
elements into leaves, creating a parent node with a
frequency that is the sum of the two lower element's
frequencies:
•The two elements are
removed from the list
and the new parent
node, with frequency
12, is inserted into the
list by frequency. So
now the list, sorted by
frequency, is:
17
Example
18
Example
• You repeat until there is only one element left in the list.
19
Example
20
Example
• You repeat until there is only one element left in the list.
21
Exercise
1. Given the following, apply the Huffman algorithm
to find an optimal binary code:
Character: a b c d e t
Frequency: 16 5 12 17 10 25
2. Given text:
“for each rose, a rose is a rose”
Compress the above text at word level using
Huffman coding
22
Lempel-Ziv compression
•Ziv-Lempel compression
– Not rely on previous knowledge about the data
– Rather builds this knowledge in the course of data
transmission/data storage
– Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
• each time it replaces strings of characters with a reference to a previous
occurrence of the string. 23
Lempel-Ziv Compression Algorithm
1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.
27
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information
from document collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quickly
Example: author catalog in library
• An index file consists of records, called index entries.
– The usual unit for indexing is the word
• Index terms - are used to look up records in a file.
• Index files are much smaller than the original file. Do you
agree?
• Remember…
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary is
only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing 28 (like
stemming & other normalization methods).
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index
terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Stemming and Normalization: reduce words with similar meaning
into their stem/root word
• Suffix stripping is the common method
–Weighting terms: Different index terms have varying importance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to each
index term of a document.
• There are different index terms weighting methods (TF, DF, CF) based on which
TF*IDF weight can be calculated during searching
• Output: a set of index terms (vocabulary) to be used for Indexing the
29
documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenizer
stream. Friends Romans countrymen
countryman 13 16
30
Building Index file
• An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
– A good index file maps each keyword Ki to a set of documents Di that contain the
keyword
• An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
– In which documents does a specified search term appear?
– Where within each document does each term appear? (There may be several
occurrences.)
• For organizing index file for a collection of documents, there are various
options available:
– Decide what data structure and/or file structure to use. Is it sequential file, inverted
31
file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
– Indexing time
– Access/search time: is that allows sequential or random
searching/access?
– Update time (Insertion time, Deletion time, modification
time….): can the indexing structure support re-indexing or
incremental indexing?
• Space overhead
– Computer storage space consumed.
• Access types supported efficiently.
– Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
32
Sequential File
• Sequential file is the most primitive file structures.
It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
a particular attribute is chosen as primary key whose value will
determine the order of the records.
when the first key fails to discriminate among records, a second
key is chosen to give an order.
33
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 34
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index terms with
caesar
2
2 8 enact 1
in sequential file the 2
9 julius 1
noble 2
are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2 35
ambitious 2
Complexity Analysis
36
Sequential File
•Why location?
–Having information about the location of each term within the
document helps for:
• user interface design: highlight location of search term
• proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
• Having information about frequency is used for:
–calculating term weighting (like IDF, TF*IDF, …)
39
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
40
Organization of Index File
• An inverted index consists of two files :
• vocabulary file
• Posting file
Vocabulary (word list) Postings Actual
Term No Tot Pointer (inverted list) Documents
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
41
Inverted File
• Vocabulary file
–A vocabulary file (Word list):
• stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
• For each word a pointer to posting file
–Records kept for each term j in the word list contains the
following: term j, DFj, CFj and pointer to posting file
• Postings File (Inverted List)
– For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
– Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
– It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
• Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term 42 i
Construction of Inverted file
• Exercise:
– In the Terabyte of text collection, if 1 page is 100KBs and each
page contains 250 words, on the average, calculate the memory
space requirement of vocabulary words? Assume 1 word
43
contains 10 characters.
Inverted index storage
So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 45
Sorting the Vocabulary Term
ambitious
Doc #
2
be 2
Term Doc #
brutus 1
I 1
brutus 2
• After all did 1
capitol 1
documents have enact
julius
1
1 caesar 1
been tokenized caesar 1 caesar 2
caesar 2
the inverted file is I
was
1
1 did 1
sorted by terms killed 1 enact 1
I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 46 2
Remove stop-words, apply stemming & compute term frequency
Pointers
48
Complexity Analysis
49
Exercises/Assignment
50
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i; 51
Suffix trie
• This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
53
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S O
– The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
• We store pointers rather than
words in the leaves.
– It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $ 54
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
{
1 abab$ $
2 bab$ ab
b 5
3 ab$ $
4 b$ • We label each
$ ab$ 4
5 $ ab$ leaf with the
3 starting point
} 2 of the
1
corresponding
suffix. 55
Complexity Analysis
56
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s S
• To make suffixes prefix-free we add a special char, $, at the end of s. To
associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
• Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
{ a $ #
1. abab$ 1. aab# b
2. bab$ 2. ab# # 5 4
3. ab$ 3. b# ab#
b ab$ $
4. b$ 4. #
1 3
5. $ ab$ #
$ 2 4
}
1 3 2 57
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
– Start at root
– Go down the tree by taking each time the corresponding path
– If S correspond to a node then return all leaves in sub-tree
• The places where S can be found are given by the pointers in all
the leaves in the sub-tree rooted at x.
– If S encountered a NIL pointer before reaching the end, then S is not
in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree. 58
Drawbacks
59
60