100% found this document useful (2 votes)
474 views60 pages

Chapter Four Indexing Structure

The document discusses indexing and searching in information retrieval systems. It explains that indexing is an offline process that organizes documents using extracted keywords to speed up searching. Searching is an online process that scans documents to find relevant matches to user queries. Compression techniques like Huffman coding are used to reduce the storage space needed for indexing by assigning shorter codes to more frequent symbols. The document provides an example of how Huffman coding assigns variable-length binary codes to symbols based on their frequency in a collection of text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
474 views60 pages

Chapter Four Indexing Structure

The document discusses indexing and searching in information retrieval systems. It explains that indexing is an offline process that organizes documents using extracted keywords to speed up searching. Searching is an online process that scans documents to find relevant matches to user queries. Compression techniques like Huffman coding are used to reduce the storage space needed for indexing by assigning shorter codes to more frequent symbols. The document provides an example of how Huffman coding assigns variable-length binary codes to symbols based on their frequency in a collection of text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Chapter Four

Indexing structure

1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
– The concern here is retrieving more relevant documents for users
query
– Effectiveness of the system is measured in terms of precision,
recall, …
– Main emphasis: Stemming, stopwords removal, weighting
schemes, matching algorithms

• In improving Efficiency of the system


– The concern here is reducing storage space requirement,
enhancing searching time, indexing time, access time…
– Main emphasis: Compression, indexing structures, space – time
tradeoffs
2
Subsystems of IR system

The two subsystems of an IR system: Indexing and


Searching
–Indexing:
• is an offline process of organizing documents using
keywords extracted from the collection
• Indexing is used to speed up access to desired information
from document collection as per users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query 3
Indexing Subsystem
documents
Documents Assign document identifier

document Tokenization document


IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
4
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms

Index terms
Index

5
Basic assertion
Indexing and searching:
– inexorably connected
– you cannot search that was not first indexed in some
manner or other
– indexing of documents or objects is done in order to be
searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language
Knowing searching is knowing indexing
6
Implementation Issues
•Storage of text:
– The need for text compression: to reduce storage space
•Indexing text
– Organizing indexes
• What techniques to use ? How to select it ?
– Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
– Accessing indexes
• How to access to indexes ? What data/file structure to use?
– Processing indexes
• How to search a given query in the index? How to update the index?
– Accessing documents 7
Text Compression

• Text compression is about finding ways to represent the


text in fewer bits or bytes such that the file size is reduced

• Advantages:
–Save storage space requirement.
–Speed up document transmission time
–Takes less time to search the compressed text

• Disadvantages:
–Consumes computational resources (both memory space and
processor running time)
8
Common compression methods

• Static methods:
– Statistical methods which require statistical information about
frequency of occurrence of symbols in the document
E.g. Huffman coding
– Two-pass algorithm:
• Estimate probabilities of symbols,
• Encode symbols, generate codeword; usually shorter codes for symbols
with high probabilities

• Adaptive methods:
– Dictionary-based methods which construct dictionary in the
course of compression
E.g. Ziv-Lempel compression:
– One-pass algorithm:
• encode symbols to generate codeword 9
• Replace words or symbols with a pointer to dictionary entries
Huffman coding

•Developed in 1950s by David 1


0
Huffman, widely used for text
compression, multimedia codec and 0 1
D4
message transmission
•The problem: Given a set of n 1 D3
0
symbols and their weights (or D D2
frequencies), construct a tree structure 1
(a binary tree for binary code) with Codeword:
the objective of reducing memory D1 = 000
space & decoding time per symbol. D2 = 001
•Huffman coding is constructed based D3 = 01
on frequency of occurrence of letters D4 = 1
in text documents 10
How to construct Huffman coding

Step 1: Create forest of trees for each symbol, t1, t2,… tn


Step 2: WHILE more than one tree exist DO
– Sort forest of trees according to falling probabilities of symbol
occurrence
– Merge two trees t1 and t2 with least probabilities p1 and p2
– Label their root with sum p1 + p2
– Associate binary code: 1 with the right branch and 0 with the left
branch
Step 3: Create a unique codeword for each symbol by traversing
the tree from the root to the leaf.
– Concatenate all encountered 0s and 1s together during traversal
• The resulting tree has a prob. of 1 in its root and symbols in its
leaf node. 11
Example
• Consider a 7-symbol alphabet given in the following
table to construct the Huffman coding.

Symbol Probability
a 0.05
• The Huffman encoding
b 0.05
algorithm picks each time two
c 0.1 symbols (with the smallest
d 0.2 frequency) to combine
e 0.3
f 0.2
g 0.1
12
Huffman code tree
1
0 1
0.4 0.6
0 1 0
1
0.3
d f 0 1 e
0.2
g 0 1
0.1 c
0 1
a b

• Using the Huffman coding a codeword can be generated by


working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s. 13
A Simple Coding Example
• We'll look at how the string "go go gophers" is encoded in ASCII, how we might save bits
using a simpler coding scheme, and how Huffman coding is used to compress the data
resulting in still more savings.
• With an ASCII encoding (8 bits per character) the 13 character string "go go gophers" requires
104 bits.

• The string "go go gophers" would be


written (coded numerically) as :
• 1100111 1101111 1100000 1100111 1101111 1000000 1100111 1101111 1110000 1101000 1100101
1110010 1110011

14
A Simple Coding Example

•10 11 001 10 11 001 10 11 0100 0101 0110 0111 000


•This is a total of 37 bits,
•The bits are saved by coding frequently occurring characters like 'g' and 'o'
with fewer bits (here two bits) than characters that occur less frequently like
'p', 'h', 'e', and 'r'. 15
Example

• Consider the symbol given in the following table to


construct the Huffman coding.

16
Example
• Sort this list by frequency and make the two-lowest
elements into leaves, creating a parent node with a
frequency that is the sum of the two lower element's
frequencies:
•The two elements are
removed from the list
and the new parent
node, with frequency
12, is inserted into the
list by frequency. So
now the list, sorted by
frequency, is:
17
Example

• You then repeat the loop, combining the two lowest


elements. This results in:

and the list is now:

18
Example

• You repeat until there is only one element left in the list.

19
Example

• You repeat until there is only one element left in the


list.

20
Example

• You repeat until there is only one element left in the list.

21
Exercise
1. Given the following, apply the Huffman algorithm
to find an optimal binary code:

Character: a b c d e t
Frequency: 16 5 12 17 10 25

2. Given text:
“for each rose, a rose is a rose”
Compress the above text at word level using
Huffman coding
22
Lempel-Ziv compression

•The problem with Huffman coding is that it requires


knowledge about the data before encoding takes place.
– Huffman coding requires frequencies of symbol occurrence
before code word is assigned to symbols

•Ziv-Lempel compression
– Not rely on previous knowledge about the data
– Rather builds this knowledge in the course of data
transmission/data storage
– Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
• each time it replaces strings of characters with a reference to a previous
occurrence of the string. 23
Lempel-Ziv Compression Algorithm

• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn.


The prefix of a pattern consists of all the pattern symbols
except the last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code to


each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign (dictionary Prefix
Index, last Pattern Symbol)
• If the last in put symbol or the last pattern is in the dictionary, assign
(dictionary Prefix Index) 24
Example: LZ Compression

Encode (i.e., compress) the string ABBCBCABABCAABCAAB


using the LZ algorithm.

The compressed message is: 0A0B2C3A2A4A6B


25
Example: Decompression
Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:


26
ABBCBCABABCAABCAAB
Exercise

• Encode (i.e., compress) the following strings using the


Lempel-Ziv algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.

27
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information
from document collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quickly
Example: author catalog in library
• An index file consists of records, called index entries.
– The usual unit for indexing is the word
• Index terms - are used to look up records in a file.

• Index files are much smaller than the original file. Do you
agree?
• Remember…
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary is
only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing 28 (like
stemming & other normalization methods).
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index
terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Stemming and Normalization: reduce words with similar meaning
into their stem/root word
• Suffix stripping is the common method
–Weighting terms: Different index terms have varying importance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to each
index term of a document.
• There are different index terms weighting methods (TF, DF, CF) based on which
TF*IDF weight can be calculated during searching
• Output: a set of index terms (vocabulary) to be used for Indexing the
29
documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman


tokens. preprocessing

Index File Indexer


friend 2 4
(Inverted file).
roman 1 2

countryman 13 16
30
Building Index file
• An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
– A good index file maps each keyword Ki to a set of documents Di that contain the
keyword

• Index file usually has index terms in a sorted order.


– The sort order of the terms in the index file provides an order on a physical file

• An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
– In which documents does a specified search term appear?
– Where within each document does each term appear? (There may be several
occurrences.)

• For organizing index file for a collection of documents, there are various
options available:
– Decide what data structure and/or file structure to use. Is it sequential file, inverted
31
file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
– Indexing time
– Access/search time: is that allows sequential or random
searching/access?
– Update time (Insertion time, Deletion time, modification
time….): can the indexing structure support re-indexing or
incremental indexing?

• Space overhead
– Computer storage space consumed.
• Access types supported efficiently.
– Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
32
Sequential File
• Sequential file is the most primitive file structures.
It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
a particular attribute is chosen as primary key whose value will
determine the order of the records.
when the first key fails to discriminate among records, a second
key is chosen to give an order.

33
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 34
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index terms with
caesar
2
2 8 enact 1
in sequential file the 2
9 julius 1
noble 2
are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2 35
ambitious 2
Complexity Analysis

• Creating sequential file requires O(n log n) time, n is


the total number of content-bearing words identifies
from the corpus.
• Since terms in sequential file are sorted, the search
time is logarithmic using binary tree.
• Updating the index file needs re-indexing; that
means incremental indexing is not possible

36
Sequential File

• Its main advantages are:


– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using
binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added.
Inserting a new record may require moving a large proportion of the
file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence,
the newest entries are added at the end of the file & therefore pose no
difficulty to updating. But searching becomes very tough; it requires
37
linear time
Inverted file
• A technique that index based on sorted list of terms, with each
term having links to the documents containing it
– Building and maintaining an inverted index is a relatively low cost risk. On
a text of n words an inverted index can be built in O(n) time, n is number
of terms
• Content of the inverted file: Data to be held in the inverted file
includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document
collection)
• The occurrence: contains one record per term, listing
– Frequency of each term in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
38
– Locations/Positions of words in the text
Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents

•Why location?
–Having information about the location of each term within the
document helps for:
• user interface design: highlight location of search term
• proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
• Having information about frequency is used for:
–calculating term weighting (like IDF, TF*IDF, …)
39
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
40
Organization of Index File
• An inverted index consists of two files :
• vocabulary file
• Posting file
Vocabulary (word list) Postings Actual
Term No Tot Pointer (inverted list) Documents
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
41
Inverted File
• Vocabulary file
–A vocabulary file (Word list):
• stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
• For each word a pointer to posting file
–Records kept for each term j in the word list contains the
following: term j, DFj, CFj and pointer to posting file
• Postings File (Inverted List)
– For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
– Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
– It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
• Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term 42 i
Construction of Inverted file

Advantage of dividing inverted file:


• Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even for
large text collection, and
– Posting file to be kept on disk for accessing to documents

• Exercise:
– In the Terabyte of text collection, if 1 page is 100KBs and each
page contains 250 words, on the average, calculate the memory
space requirement of vocabulary words? Assume 1 word
43
contains 10 characters.
Inverted index storage

•Separation of inverted file into vocabulary and posting file


is a good idea.
–Vocabulary: For searching purpose we need only word list. This
allows the vocabulary to be kept in memory at search time since the
space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct
words. Hence, the size of index is 100 MBs, which can easily be held in
memory of a dedicated computer.
–Posting file requires much more space.
• For each word appearing in the text we are keeping statistical information
related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
44
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 45
Sorting the Vocabulary Term
ambitious
Doc #
2
be 2
Term Doc #
brutus 1
I 1
brutus 2
• After all did 1
capitol 1
documents have enact
julius
1
1 caesar 1
been tokenized caesar 1 caesar 2
caesar 2
the inverted file is I
was
1
1 did 1
sorted by terms killed 1 enact 1
I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 46 2
Remove stop-words, apply stemming & compute term frequency

•Multiple term Term Doc #


entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF 47
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
48
Complexity Analysis

• The inverted index can be built in O(n) + O(n log n)


time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted searching
takes logarithmic time.
• To update the inverted index it is possible to apply
Incremental indexing which requires O(k) time, k
is number of new index terms

49
Exercises/Assignment

• Construct the inverted index for the following


document collections.
Doc 1  :  New home to home sales forecasts
Doc 2  :  Rise in home sales in July
Doc 3  :  Home sales rise in July for new homes
Doc 4  :  July new home sales rise

50
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i; 51
Suffix trie

•A suffix trie is an ordinary trie in which the input strings


are all possible suffixes.
– Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (number of
symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.

•The structure has several advantages:


– We do not have to store the same object twice (no duplicate).
– Whatever the size of index terms, the search time is also linear in
the length of string S. 52
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

• This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
53
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S O
– The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
• We store pointers rather than
words in the leaves.
– It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $ 54
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
{
1 abab$ $
2 bab$ ab
b 5
3 ab$ $
4 b$ • We label each
$ ab$ 4
5 $ ab$ leaf with the
3 starting point
} 2 of the
1
corresponding
suffix. 55
Complexity Analysis

• The suffix tree for a string has been built in O(n2)


time.
• The search time is proportional to the length of
string S; i.e. O(|S|).
• Searching for a substring[1..m], in string[1..n], can
be solved in O(m) time
– It requires to search for the length of the string O(|S|).
• Updating the index file can be done incrementally
without affecting the existing index

56
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
• To make suffixes prefix-free we add a special char, $, at the end of s. To
associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
• Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:

{ a $ #
1. abab$ 1. aab# b
2. bab$ 2. ab# # 5 4
3. ab$ 3. b# ab#
b ab$ $
4. b$ 4. #
1 3
5. $ ab$ #
$ 2 4
}
1 3 2 57
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
– Start at root
– Go down the tree by taking each time the corresponding path
– If S correspond to a node then return all leaves in sub-tree
• The places where S can be found are given by the pointers in all
the leaves in the sub-tree rooted at x.
– If S encountered a NIL pointer before reaching the end, then S is not
in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree. 58
Drawbacks

• Suffix trees consume a lot of space


– Even if word beginnings are indexed, space overhead
of 120% - 240% over the text size is produced.
Because depending on the implementation each nodes
of the suffix tree takes a space (in bytes) equivalent to
the number of symbols used.

59
60

You might also like