100% found this document useful (2 votes)

474 views60 pages

Chapter Four Indexing Structure

The document discusses indexing and searching in information retrieval systems. It explains that indexing is an offline process that organizes documents using extracted keywords to speed up searching. Searching is an online process that scans documents to find relevant matches to user queries. Compression techniques like Huffman coding are used to reduce the storage space needed for indexing by assigning shorter codes to more frequent symbols. The document provides an example of how Huffman coding assigns variable-length binary codes to symbols based on their frequency in a collection of text.

Uploaded by

Alemayehu Getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

474 views60 pages

Chapter Four Indexing Structure

Uploaded by

Alemayehu Getachew

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 60

Chapter Four

Indexing structure

1
Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
– The concern here is retrieving more relevant documents for users
query
– Effectiveness of the system is measured in terms of precision,
recall, …
– Main emphasis: Stemming, stopwords removal, weighting
schemes, matching algorithms

• In improving Efficiency of the system

– The concern here is reducing storage space requirement,
enhancing searching time, indexing time, access time…
– Main emphasis: Compression, indexing structures, space – time
tradeoffs
2
Subsystems of IR system

The two subsystems of an IR system: Indexing and

Searching
–Indexing:
• is an offline process of organizing documents using
keywords extracted from the collection
• Indexing is used to speed up access to desired information
from document collection as per users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query 3
Indexing Subsystem
documents
Documents Assign document identifier

document Tokenization document

IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
4
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms

Index terms
Index

5
Basic assertion
Indexing and searching:
– inexorably connected
– you cannot search that was not first indexed in some
manner or other
– indexing of documents or objects is done in order to be
searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language
Knowing searching is knowing indexing
6
Implementation Issues
•Storage of text:
– The need for text compression: to reduce storage space
•Indexing text
– Organizing indexes
• What techniques to use ? How to select it ?
– Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
– Accessing indexes
• How to access to indexes ? What data/file structure to use?
– Processing indexes
• How to search a given query in the index? How to update the index?
– Accessing documents 7
Text Compression

• Text compression is about finding ways to represent the

text in fewer bits or bytes such that the file size is reduced

• Advantages:
–Save storage space requirement.
–Speed up document transmission time
–Takes less time to search the compressed text

• Disadvantages:
–Consumes computational resources (both memory space and
processor running time)
8
Common compression methods

• Static methods:
– Statistical methods which require statistical information about
frequency of occurrence of symbols in the document
E.g. Huffman coding
– Two-pass algorithm:
• Estimate probabilities of symbols,
• Encode symbols, generate codeword; usually shorter codes for symbols
with high probabilities

• Adaptive methods:
– Dictionary-based methods which construct dictionary in the
course of compression
E.g. Ziv-Lempel compression:
– One-pass algorithm:
• encode symbols to generate codeword 9
• Replace words or symbols with a pointer to dictionary entries
Huffman coding

•Developed in 1950s by David 1

0
Huffman, widely used for text
compression, multimedia codec and 0 1
D4
message transmission
•The problem: Given a set of n 1 D3
0
symbols and their weights (or D D2
frequencies), construct a tree structure 1
(a binary tree for binary code) with Codeword:
the objective of reducing memory D1 = 000
space & decoding time per symbol. D2 = 001
•Huffman coding is constructed based D3 = 01
on frequency of occurrence of letters D4 = 1
in text documents 10
How to construct Huffman coding

Step 1: Create forest of trees for each symbol, t1, t2,… tn

Step 2: WHILE more than one tree exist DO
– Sort forest of trees according to falling probabilities of symbol
occurrence
– Merge two trees t1 and t2 with least probabilities p1 and p2
– Label their root with sum p1 + p2
– Associate binary code: 1 with the right branch and 0 with the left
branch
Step 3: Create a unique codeword for each symbol by traversing
the tree from the root to the leaf.
– Concatenate all encountered 0s and 1s together during traversal
• The resulting tree has a prob. of 1 in its root and symbols in its
leaf node. 11
Example
• Consider a 7-symbol alphabet given in the following
table to construct the Huffman coding.

Symbol Probability
a 0.05
• The Huffman encoding
b 0.05
algorithm picks each time two
c 0.1 symbols (with the smallest
d 0.2 frequency) to combine
e 0.3
f 0.2
g 0.1
12
Huffman code tree
1
0 1
0.4 0.6
0 1 0
1
0.3
d f 0 1 e
0.2
g 0 1
0.1 c
0 1
a b

• Using the Huffman coding a codeword can be generated by

working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s. 13
A Simple Coding Example
• We'll look at how the string "go go gophers" is encoded in ASCII, how we might save bits
using a simpler coding scheme, and how Huffman coding is used to compress the data
resulting in still more savings.
• With an ASCII encoding (8 bits per character) the 13 character string "go go gophers" requires
104 bits.

• The string "go go gophers" would be

written (coded numerically) as :
• 1100111 1101111 1100000 1100111 1101111 1000000 1100111 1101111 1110000 1101000 1100101
1110010 1110011

14
A Simple Coding Example

•10 11 001 10 11 001 10 11 0100 0101 0110 0111 000

•This is a total of 37 bits,
•The bits are saved by coding frequently occurring characters like 'g' and 'o'
with fewer bits (here two bits) than characters that occur less frequently like
'p', 'h', 'e', and 'r'. 15
Example

• Consider the symbol given in the following table to

construct the Huffman coding.

16
Example
• Sort this list by frequency and make the two-lowest
elements into leaves, creating a parent node with a
frequency that is the sum of the two lower element's
frequencies:
•The two elements are
removed from the list
and the new parent
node, with frequency
12, is inserted into the
list by frequency. So
now the list, sorted by
frequency, is:
17
Example

• You then repeat the loop, combining the two lowest

elements. This results in:

and the list is now:

18
Example

• You repeat until there is only one element left in the list.

19
Example

• You repeat until there is only one element left in the

list.

20
Example

• You repeat until there is only one element left in the list.

21
Exercise
1. Given the following, apply the Huffman algorithm
to find an optimal binary code:

Character: a b c d e t
Frequency: 16 5 12 17 10 25

2. Given text:
“for each rose, a rose is a rose”
Compress the above text at word level using
Huffman coding
22
Lempel-Ziv compression

•The problem with Huffman coding is that it requires

knowledge about the data before encoding takes place.
– Huffman coding requires frequencies of symbol occurrence
before code word is assigned to symbols

•Ziv-Lempel compression
– Not rely on previous knowledge about the data
– Rather builds this knowledge in the course of data
transmission/data storage
– Ziv-Lempel algorithm (called LZ) uses a table of code-words
created during data transmission;
• each time it replaces strings of characters with a reference to a previous
occurrence of the string. 23
Lempel-Ziv Compression Algorithm

• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn.

The prefix of a pattern consists of all the pattern symbols
except the last: C0C1 . . . Cn-1

Lempel-Ziv Output: there are three options in assigning a code to

each symbol in the list
• If one-symbol pattern is not in dictionary, assign (0, symbol)
• If multi-symbol pattern is not in dictionary, assign (dictionary Prefix
Index, last Pattern Symbol)
• If the last in put symbol or the last pattern is in the dictionary, assign
(dictionary Prefix Index) 24
Example: LZ Compression

Encode (i.e., compress) the string ABBCBCABABCAABCAAB

using the LZ algorithm.

The compressed message is: 0A0B2C3A2A4A6B

25
Example: Decompression
Decode (i.e., decompress) the sequence: 0A0B2C3A2A4A6B

The decompressed message is:

26
ABBCBCABABCAABCAAB
Exercise

• Encode (i.e., compress) the following strings using the

Lempel-Ziv algorithm.

1. Mississippi
2. ABBCBCABABCAABCAAB
3. SATATASACITASA.

27
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information
from document collection as per users query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quickly
Example: author catalog in library
• An index file consists of records, called index entries.
– The usual unit for indexing is the word
• Index terms - are used to look up records in a file.

• Index files are much smaller than the original file. Do you
agree?
• Remember…
– Remember Heaps Law: In 1 GB text collection the size of a vocabulary is
only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing 28 (like
stemming & other normalization methods).
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called index
terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Stemming and Normalization: reduce words with similar meaning
into their stem/root word
• Suffix stripping is the common method
–Weighting terms: Different index terms have varying importance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to each
index term of a document.
• There are different index terms weighting methods (TF, DF, CF) based on which
TF*IDF weight can be calculated during searching
• Output: a set of index terms (vocabulary) to be used for Indexing the
29
documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenizer
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessing

Index File Indexer

friend 2 4
(Inverted file).
roman 1 2

countryman 13 16
30
Building Index file
• An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
– A good index file maps each keyword Ki to a set of documents Di that contain the
keyword

• Index file usually has index terms in a sorted order.

– The sort order of the terms in the index file provides an order on a physical file

• An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
– In which documents does a specified search term appear?
– Where within each document does each term appear? (There may be several
occurrences.)

• For organizing index file for a collection of documents, there are various
options available:
– Decide what data structure and/or file structure to use. Is it sequential file, inverted
31
file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
– Indexing time
– Access/search time: is that allows sequential or random
searching/access?
– Update time (Insertion time, Deletion time, modification
time….): can the indexing structure support re-indexing or
incremental indexing?

• Space overhead
– Computer storage space consumed.
• Access types supported efficiently.
– Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
32
Sequential File
• Sequential file is the most primitive file structures.
It has no vocabulary as well as linking pointers.
• The records are generally arranged serially, one after another,
but in lexicographic order on the value of some key field.
a particular attribute is chosen as primary key whose value will
determine the order of the records.
when the first key fails to discriminate among records, a second
key is chosen to give an order.

33
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 34
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index terms with
caesar
2
2 8 enact 1
in sequential file the 2
9 julius 1
noble 2
are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2 35
ambitious 2
Complexity Analysis

• Creating sequential file requires O(n log n) time, n is

the total number of content-bearing words identifies
from the corpus.
• Since terms in sequential file are sorted, the search
time is logarithmic using binary tree.
• Updating the index file needs re-indexing; that
means incremental indexing is not possible

36
Sequential File

• Its main advantages are:

– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time using
binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added.
Inserting a new record may require moving a large proportion of the
file;
– random access is extremely slow.
• The problem of update can be solved :
– by ordering records by date of acquisition, than the key value; hence,
the newest entries are added at the end of the file & therefore pose no
difficulty to updating. But searching becomes very tough; it requires
37
linear time
Inverted file
• A technique that index based on sorted list of terms, with each
term having links to the documents containing it
– Building and maintaining an inverted index is a relatively low cost risk. On
a text of n words an inverted index can be built in O(n) time, n is number
of terms
• Content of the inverted file: Data to be held in the inverted file
includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a document
collection)
• The occurrence: contains one record per term, listing
– Frequency of each term in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
38
– Locations/Positions of words in the text
Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents

•Why location?
–Having information about the location of each term within the
document helps for:
• user interface design: highlight location of search term
• proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
• Having information about frequency is used for:
–calculating term weighting (like IDF, TF*IDF, …)
39
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
40
Organization of Index File
• An inverted index consists of two files :
• vocabulary file
• Posting file
Vocabulary (word list) Postings Actual
Term No Tot Pointer (inverted list) Documents
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
41
Inverted File
• Vocabulary file
–A vocabulary file (Word list):
• stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
• For each word a pointer to posting file
–Records kept for each term j in the word list contains the
following: term j, DFj, CFj and pointer to posting file
• Postings File (Inverted List)
– For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
– Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
– It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
• Each list consists of one or many individual postings related to
Document ID, TF and location information about a given term 42 i
Construction of Inverted file

Advantage of dividing inverted file:

• Keeping a pointer in the vocabulary to the list in the
posting file allows:
– the vocabulary to be kept in memory at search time even for
large text collection, and
– Posting file to be kept on disk for accessing to documents

• Exercise:
– In the Terabyte of text collection, if 1 page is 100KBs and each
page contains 250 words, on the average, calculate the memory
space requirement of vocabulary words? Assume 1 word
43
contains 10 characters.
Inverted index storage

•Separation of inverted file into vocabulary and posting file

is a good idea.
–Vocabulary: For searching purpose we need only word list. This
allows the vocabulary to be kept in memory at search time since the
space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct
words. Hence, the size of index is 100 MBs, which can easily be held in
memory of a dedicated computer.
–Posting file requires much more space.
• For each word appearing in the text we are keeping statistical information
related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
44
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious 45
Sorting the Vocabulary Term
ambitious
Doc #
2
be 2
Term Doc #
brutus 1
I 1
brutus 2
• After all did 1
capitol 1
documents have enact
julius
1
1 caesar 1
been tokenized caesar 1 caesar 2
caesar 2
the inverted file is I
was
1
1 did 1
sorted by terms killed 1 enact 1
I 1 has 1
the 1 I 1
capitol 1 I 1
brutus 1
I 1
killed 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 46 2
Remove stop-words, apply stemming & compute term frequency

•Multiple term Term Doc #

entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF 47
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
48
Complexity Analysis

• The inverted index can be built in O(n) + O(n log n)

time.
– n is number of vocabulary terms
• Since terms in vocabulary file are sorted searching
takes logarithmic time.
• To update the inverted index it is possible to apply
Incremental indexing which requires O(k) time, k
is number of new index terms

49
Exercises/Assignment

• Construct the inverted index for the following

document collections.
Doc 1  :  New home to home sales forecasts
Doc 2  : Rise in home sales in July
Doc 3  : Home sales rise in July for new homes
Doc 4  : July new home sales rise

50
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i; 51
Suffix trie

•A suffix trie is an ordinary trie in which the input strings

are all possible suffixes.
– Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (number of
symbols in text).
• To build the suffix TRIE we use these indices instead of the
actual object.

•The structure has several advantages:

– We do not have to store the same object twice (no duplicate).
– Whatever the size of index terms, the search time is also linear in
the length of string S. 52
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

• This structure is
particularly
useful for any
application
requiring prefix
based ("starts
with") pattern
matching.
53
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S O
– The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
• We store pointers rather than
words in the leaves.
– It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $ 54
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
{
1 abab$ $
2 bab$ ab
b 5
3 ab$ $
4 b$ • We label each
$ ab$ 4
5 $ ab$ leaf with the
3 starting point
} 2 of the
1
corresponding
suffix. 55
Complexity Analysis

• The suffix tree for a string has been built in O(n2)

time.
• The search time is proportional to the length of
string S; i.e. O(|S|).
• Searching for a substring[1..m], in string[1..n], can
be solved in O(m) time
– It requires to search for the length of the string O(|S|).
• Updating the index file can be done incrementally
without affecting the existing index

56
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
• To make suffixes prefix-free we add a special char, $, at the end of s. To
associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
• Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:

{ a $ #
1. abab$ 1. aab# b
2. bab$ 2. ab# # 5 4
3. ab$ 3. b# ab#
b ab$ $
4. b$ 4. #
1 3
5. $ ab$ #
$ 2 4
}
1 3 2 57
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
– Start at root
– Go down the tree by taking each time the corresponding path
– If S correspond to a node then return all leaves in sub-tree
• The places where S can be found are given by the pointers in all
the leaves in the sub-tree rooted at x.
– If S encountered a NIL pointer before reaching the end, then S is not
in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$, GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree. 58
Drawbacks

• Suffix trees consume a lot of space

– Even if word beginnings are indexed, space overhead
of 120% - 240% over the text size is produced.
Because depending on the implementation each nodes
of the suffix tree takes a space (in bytes) equivalent to
the number of symbols used.

59
60

DBMS Assignment
0% (1)
DBMS Assignment
18 pages
Data Warehouse Modeling
0% (1)
Data Warehouse Modeling
50 pages
Information Management According To BS EN ISO 19650 Tendering and Appointments
No ratings yet
Information Management According To BS EN ISO 19650 Tendering and Appointments
18 pages
Priya V - PLSQL Dev - Conneqt
No ratings yet
Priya V - PLSQL Dev - Conneqt
3 pages
Unit-2 Dbms
No ratings yet
Unit-2 Dbms
74 pages
Market Basket Analysis
No ratings yet
Market Basket Analysis
86 pages
Traditional IR vs. Web IR
100% (2)
Traditional IR vs. Web IR
4 pages
How To Add VMFS Datastore Using VSphere Client
No ratings yet
How To Add VMFS Datastore Using VSphere Client
36 pages
QP Midsem Regular - Solutions For IR
100% (2)
QP Midsem Regular - Solutions For IR
4 pages
Input Design and Output Design
100% (2)
Input Design and Output Design
3 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
33 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Network Security
No ratings yet
Network Security
91 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
Linear Classifiers in Python: Chapter4
No ratings yet
Linear Classifiers in Python: Chapter4
24 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Fundamentals of Database Systems: LESSON 5: The Relational Data Model
No ratings yet
Fundamentals of Database Systems: LESSON 5: The Relational Data Model
31 pages
Unit 1
No ratings yet
Unit 1
35 pages
Describing Web Resources in RDF
No ratings yet
Describing Web Resources in RDF
120 pages
Hoffer Edm PP Ch05
No ratings yet
Hoffer Edm PP Ch05
30 pages
INFS 321 Session Slide 1 Intro To Ref Sources
No ratings yet
INFS 321 Session Slide 1 Intro To Ref Sources
19 pages
Jimma University JIT School of Computing Advanced Database System Lab
100% (1)
Jimma University JIT School of Computing Advanced Database System Lab
70 pages
Experiment 2: AIM: Implementation of Classification Technique On ARFF Files Using WEKA Data Set
No ratings yet
Experiment 2: AIM: Implementation of Classification Technique On ARFF Files Using WEKA Data Set
3 pages
Bitalag Integrated School Alumni Tracer With Sms Notification
No ratings yet
Bitalag Integrated School Alumni Tracer With Sms Notification
4 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Quation 9
No ratings yet
Quation 9
3 pages
Haramaya University College of Computing and Informatics Department of Graduating Class Registration Form For Training
No ratings yet
Haramaya University College of Computing and Informatics Department of Graduating Class Registration Form For Training
2 pages
Haramaya University College of Computing and Informatics Department of Graduating Class Registration Form For Training
No ratings yet
Haramaya University College of Computing and Informatics Department of Graduating Class Registration Form For Training
2 pages
Ashish Sudhakar Bidave
No ratings yet
Ashish Sudhakar Bidave
2 pages
Index Compression
100% (1)
Index Compression
38 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
L14 - Wildcard Queries
No ratings yet
L14 - Wildcard Queries
19 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Django Ppts
No ratings yet
Django Ppts
243 pages
Lec 4
No ratings yet
Lec 4
19 pages
Text Summarization As Feature Selection For Arabic Text Classification
No ratings yet
Text Summarization As Feature Selection For Arabic Text Classification
4 pages
2 - Text Operation
No ratings yet
2 - Text Operation
45 pages
Unit - 1 Erp Concepts
No ratings yet
Unit - 1 Erp Concepts
36 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Chapter 3 - Simple Sorting and Searching
100% (1)
Chapter 3 - Simple Sorting and Searching
18 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
IR - Models
100% (3)
IR - Models
58 pages
De Unit-V
No ratings yet
De Unit-V
46 pages
Chapter-5 Data Mining - Introduction
No ratings yet
Chapter-5 Data Mining - Introduction
1 page
Index Construction
No ratings yet
Index Construction
37 pages
Data Science Vs Decision Science-2
No ratings yet
Data Science Vs Decision Science-2
1 page
Information Retrieval System Assignment-1
No ratings yet
Information Retrieval System Assignment-1
10 pages
Solution 1:: Big O - Order of Magnitude
No ratings yet
Solution 1:: Big O - Order of Magnitude
20 pages
PHP and XMLUnit 4 Complete Notes
No ratings yet
PHP and XMLUnit 4 Complete Notes
24 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
SORTING
No ratings yet
SORTING
46 pages
Distributed File Systems: Unit - V Essay Questions
No ratings yet
Distributed File Systems: Unit - V Essay Questions
10 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
77 pages
How To Write A Systemic Review
No ratings yet
How To Write A Systemic Review
2 pages
List in Python
No ratings yet
List in Python
9 pages
Mobile Application Dev
No ratings yet
Mobile Application Dev
104 pages
Rdbms
100% (1)
Rdbms
88 pages
Python 5 Days
100% (1)
Python 5 Days
5 pages
AC Induction Motor Datasheet - Curve (AFC) (Approved)
No ratings yet
AC Induction Motor Datasheet - Curve (AFC) (Approved)
7 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Data Structures Notes
100% (1)
Data Structures Notes
17 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
SQL Cleaning Data
No ratings yet
SQL Cleaning Data
7 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Linux Assignment
No ratings yet
Linux Assignment
2 pages
OS Lab Manual
No ratings yet
OS Lab Manual
40 pages
R22 Unit 5
No ratings yet
R22 Unit 5
23 pages
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
No ratings yet
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
7 pages
Information Retreival Assignment
No ratings yet
Information Retreival Assignment
4 pages
System Programming Lab CSE
No ratings yet
System Programming Lab CSE
35 pages
Advanced Database Indexing
No ratings yet
Advanced Database Indexing
17 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
44 pages
Irs PPT Unit Ii
No ratings yet
Irs PPT Unit Ii
19 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Ai-Based Data Cleaning
No ratings yet
Ai-Based Data Cleaning
11 pages
Ruby On Rails: Database Indexing Techniques
No ratings yet
Ruby On Rails: Database Indexing Techniques
19 pages
Lecture 2.1.9 Comparison of BNN and ANN
No ratings yet
Lecture 2.1.9 Comparison of BNN and ANN
5 pages
BCA Multimedia Viva
No ratings yet
BCA Multimedia Viva
4 pages
Write A Shell Script To Find Sum of Digits of A Number: "Input No: "
No ratings yet
Write A Shell Script To Find Sum of Digits of A Number: "Input No: "
4 pages
ISR Chap... 4
No ratings yet
ISR Chap... 4
43 pages
MC9223-Design and Analysis of Algorithm Unit-I - Introduction
No ratings yet
MC9223-Design and Analysis of Algorithm Unit-I - Introduction
35 pages
Unit V (Accessing MYSQL)
No ratings yet
Unit V (Accessing MYSQL)
17 pages
Assignments 1 Solution
100% (1)
Assignments 1 Solution
6 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Capstone Project R2 Niwin
No ratings yet
Capstone Project R2 Niwin
24 pages

Chapter Four Indexing Structure

Uploaded by

Chapter Four Indexing Structure

Uploaded by

Chapter Four

• In improving Efficiency of the system

The two subsystems of an IR system: Indexing and

document Tokenization document

• Text compression is about finding ways to represent the

•Developed in 1950s by David 1

Step 1: Create forest of trees for each symbol, t1, t2,… tn

• Using the Huffman coding a codeword can be generated by

• The string "go go gophers" would be

•10 11 001 10 11 001 10 11 0100 0101 0110 0111 000

• Consider the symbol given in the following table to

• You then repeat the loop, combining the two lowest

and the list is now:

• You repeat until there is only one element left in the

•The problem with Huffman coding is that it requires

• The multi-symbol patterns are of the form: C0C1 . . . Cn-1Cn.

Lempel-Ziv Output: there are three options in assigning a code to

Encode (i.e., compress) the string ABBCBCABABCAABCAAB

The compressed message is: 0A0B2C3A2A4A6B

The decompressed message is:

• Encode (i.e., compress) the following strings using the

Modified Linguistic friend roman countryman

Index File Indexer

• Index file usually has index terms in a sorted order.

I did enact Julius

• Creating sequential file requires O(n log n) time, n is

• Its main advantages are:

Advantage of dividing inverted file:

•Separation of inverted file into vocabulary and posting file

I did enact Julius

•Multiple term Term Doc #

• The inverted index can be built in O(n) + O(n log n)

• Construct the inverted index for the following

•A suffix trie is an ordinary trie in which the input strings

•The structure has several advantages:

• The suffix tree for a string has been built in O(n2)

• Suffix trees consume a lot of space

You might also like