0% found this document useful (0 votes)

38 views56 pages

Chapter 4 IR

Chapter Four discusses indexing structures in information retrieval systems, focusing on the design and organization of document databases. It highlights the importance of indexing for improving retrieval efficiency and effectiveness, detailing various file structures like inverted files and suffix trees. The chapter also outlines the major steps in index construction, including tokenization, stopword removal, and term weighting.

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views56 pages

Chapter 4 IR

Uploaded by

bekeletamirat931

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Information Retrieval and Storage

Chapter Four
Indexing structure
Target Group –IT 3rd year students

Injibara, Ethiopia
Brain Storming Questions

1. What is Indexing Structure?

2. What is the Difference between Term

weighting and Indexing Structure?

3. What are Inverted files

4. What are Suffix Trees and Suffix Arrays?

2
Introduction
File structures:
 A fundamental decision in the design of IR systems is which
type of file structure to use for the underlying document
database.
 The file structures used in IR systems are
flat files,
inverted files,
signature files,
PAT trees, and
Graphs,
 Though it is possible to keep file structures in main memory, in
practice IR databases are usually stored on disk because of their
size. 3
Designing an IR System
Our focus during IR system design is:
1. improving Effectiveness of the system
– The concern here is retrieving more relevant documents for users query
– Effectiveness of the system is measured in terms of precision, recall, …
– Main emphasis: Stemming, stop words removal, weighting schemes,
matching algorithms
2. improving Efficiency of the system
– The concern here is reducing storage space requirement, enhancing
searching time, indexing time, access time…
– Main emphasis: Compression, indexing structures, space – time
tradeoffs
Subsystems of IR system
The two subsystems of an IR system:

–Indexing:

• is an offline process of organizing documents using

keywords extracted from the collection

• Indexing is used to speed up access to desired information

from document collection as per users query

–Searching

• Is an online process that scans document corpus to find

relevant documents that matches users query
Indexing Subsystem

documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming & Normalization
stemmed terms
Term weighting

Weighted index
terms Index File
Searching Subsystem

query parse query

query tokens
ranked non-stoplist
document Stop word
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching: inexorably connected
– you cannot search that was not first indexed in some manner or
other
– indexing of documents or objects is done in order to be
searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language
 Knowing searching is knowing indexing
Indexing: Basic Concepts
• Indexing is used to speed up access to desired information from
document collection as per users query such that
– It enhances efficiency in terms of time for retrieval.
– Relevant documents are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index entries.
• Index files are much smaller than the original file.
– Remember Heaps Law: In 1 GB text collection the size of a
vocabulary is only 5 MB (Baeza-Yates and Ribeiro-Neto, 2005)
– This size may be further reduced by Linguistic pre-processing
(like stemming & other normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
1. Source file: Collection of text document
–A document can be described by a set of representative keywords called index
terms.
2. Index Terms Selection:
–Tokenize: identify words in a document, so that each document is represented
by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into their
stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance when
used to describe document contents.
• This effect is captured through the assignment of numerical weights to each index
term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …

3. Output: a set of index terms (vocabulary) to be used for Indexing

the documents that each term occurs in.
Basic Indexing Process
Documents to be indexed.

Token stream.

Linguistic
Modified tokens.
preprocessing

Index File (Inverted file).

friend

roman

countryman
Building Index file
•An index file of a document is a file consisting of a list of index
terms and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword

•Index file usually has index terms in a sorted order.

–The sort order of the terms in the index file provides an order on a physical file
•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user‟s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several
occurrences.)
•For organizing index file for a collection of documents, there are
various options available:
–Decide what data structure and/or file structure to use. Is it sequential file,
inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification time….)
• Space overhead
–Computer storage space consumed.
• Access types supported efficiently.
–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of some
key field.
• a particular attribute is chosen as primary key whose value will
determine the order of the records.
• when the first key fails to discriminate among records, a second
key is chosen to give an order.
Example:
• Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1
3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
so 2 5 caesar 1
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index with
caesar
2
2 8 enact 1
terms in sequential the 2
9 julius 1
noble 2
file are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic order.
– Instead of Linear time search, one can search in logarithmic time
using binary search
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is added.
Inserting a new record may require moving a large proportion of the
file;
– random access is extremely slow.
• The problem of update can be solved:
– by ordering records by date of acquisition, than the key value; hence,
the newest entries are added at the end of the file & therefore pose
no difficulty to updating.
– But searching becomes very tough; it requires linear time
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it.
–Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time, n is number of keywords
• Content of the inverted file:

–Data to be held in the inverted file includes :

• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
Inverted file
• The occurrence: contains one record per term, listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….
–Locations/Positions of words in the text
Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents
•Why location?
– Having information about the location of each term within the
document helps for:
•user interface design: highlight location of search term
•proximity based ranking: adjacency and near operators (in
Boolean searching)
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain

Term CF Document TF Location

ID
This is called an
auto 3 2 1 66 index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
are performed
19 2 7, 212
before building
22 1 56 the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
• An inverted index consists of two files:

–vocabulary file

–Posting file

Advantage of dividing inverted file:

• Keeping a pointer in the vocabulary to the list in the posting file

allows:

– the vocabulary to be kept in memory at search time even for

large text collection, and

– Posting file to be kept on disk for accessing to documents

Inverted index storage
•Separation of inverted file into vocabulary and posting file
is a good idea.
–Vocabulary: For searching purpose we need only word list. This
allows the vocabulary to be kept in memory at search time since the
space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000 distinct
words. Hence, the size of index is 100 MBs, which can easily be held in
memory of a dedicated computer.

–Posting file requires much more space.

• For each word appearing in the text we are keeping statistical information
related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space of O(n).
•How to speed up access to inverted file?
Vocabulary file
A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that
appear in any of the documents (in lexicographical
order) and
–For each word a pointer to posting file
Records kept for each term j in the word list
contains the following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.

Each element in an inverted list is called a posting, i.e., the

occurrence of a term in a document

It is stored as a separate inverted list for each column, i.e., a list

corresponding to each term in the index file.

 Each list consists of one or many individual postings related to

Document ID, TF and location information about a given term i.
Organization of Index File

Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed to
extract words and these are saved with the Document
ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2

• After all I 1 did 1

the 1 enact 1
has 1
documents capitol
brutus
1
1 I 1
have been killed 1 I
I
1
1
me 1
tokenized the so 2 it 2
let 2 julius 1
inverted file is it 2 killed 1
sorted by terms be
with
2
2
killed
let
1
2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming & compute term frequency

Multiple term
Term Doc #
entries in a single ambition 2 Term Doc # TF
brutus 1 ambition 2 1
document are
brutus 2 brutus 1 1
merged and brutus 2 1
capitol 1
frequency caesar 1 capitol 1 1
caesar 2 caesar 1 1
information added
caesar 2 caesar 2 2
Counting number enact 1 enact 1 1
of occurrence of julius 1 julius 1 1
kill 1 kill 1 2
terms in the noble 2 1
kill 1
collections helps to noble 2
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary
posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1 2 1
ambitious 1 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Exercises
• Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise
Suffix trie
What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix
– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at
position i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
A suffix trie is an ordinary trie in which the input strings
are all possible suffixes.
–Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the text.
(i.e: First symbol has index 1, last symbol has index n (#of symbols
in text).
• To build the suffix TRIE we use these indices instead of the actual
object.
The structure has several advantages:
–It requires less storage space.
–We do not have to worry how the text is represented (binary,
ASCII, etc).
–We do not have to store the same object twice (no duplicate).
Suffix Trie
•Construct SUFFIX TRIE for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting from
left to right as per characters occurrence in the string.
TEXT : GOOGOL$
POSITION : 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

•This structure is
particularly useful
for any application
requiring prefix
based ("starts with")
pattern matching.
Suffix tree
A suffix tree is a member of
the trie family. It is a Trie of
all the proper suffixes of S O
 The suffix tree is created by
compacting unary nodes of the
suffix TRIE.
We store pointers rather than
words in the leaves.
 It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the
beginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
Example: Suffix tree
•Let s=abab, a suffix tree of s is a compressed trie
of all suffixes of s=abab$
$
ab
b 5
{
$
5 $
4 b$ $ ab$ 4
ab$
3 ab$ 3
2 bab$ 2
1
• We label each leaf with the
1 abab$ starting point of the
} corresponding suffix.
Complexity Analysis
The suffix tree for a string has been built in O(n2)
time.

The search time is also linear in the length of

string S.

Searching for a substring[1..m], in string[1..n], can

be solved in O(m) time
It requires to search for the length of the string O(|S|).
Generalized suffix tree
• Given a set of strings S, a generalized suffix tree of S is a
compressed trie of all suffixes of s  S
•To make suffixes prefix-free we add a special char, $, at the end of s. To
associate each suffix with a unique string in S add a different special
symbol to each s
• Build a suffix tree for the string s1$s2#, where `$' and `#' are a
special terminator for s1,s2.
•Ex.: Let s1=abab & s2=aab, a generalized suffix tree for s1 & s2 is:
•a •$ •#
•{ •b
• $ # •# •5 •4
• b$ b# •b
• ab$ ab# •ab$ •ab$ •$
• bab$ aab# •3
• abab$ •ab$ •# •4
•$ •1
•2
•} •1 •2
•3
Search in suffix tree
Searching for all instances of a substring S in a suffix tree is easy
since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• the places where S can be found are given by the pointers in all the leaves in
the subtree rooted at x.
–If S encountered a NIL pointer before reaching the end, then S is
not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Suffix Tree Applications
Suffix Tree can be used to solve a large number of string
problems that occur in:
– text-editing,
– free-text search,
– etc.

Some examples of string problems are given below.

– String matching
– Longest Common Substring
– Longest Repeated Substring
– Palindromes
– etc..
Drawbacks
Suffix trees consume a lot of space
 Even if word beginnings are indexed, space overhead of 120%
- 240% over the text size is produced. Because depending on
the implementation each nodes of the suffix tree takes a space
(in bytes) equivalent to the number of symbols used.

 How much space is required at each node for English word

indexing based on alphabets a to z.

How many bytes required to store MISSISSIPI ?

Suffix array
A suffix array is more compact than a suffix tree.
 Suffix arrays are a space efficient implementation of suffix trees
Like suffix tree, a suffix array is a sorted list of the suffixes of a
given string in lexicographical order.
 The sorted list is presented as an array of integers that identify the
suffixes in order.
 This allows a binary search or fast substring search.
Main drawbacks:
 Its costly construction process.
 The need for the document/text to be readily available at query
time
Building suffix array
Procedure:
– Identify suffixes of the given string
– Sort the suffixes lexicographically
– Store indices of all the suffixes in a table.
The suffix array gives the indices of the suffixes in sorted order
A suffix array can be constructed in O(n log n) time, where n is the
length of the string, by sorting the suffixes
Example: Consider the string "good".
– At the end, a special character is usually appended to the string.
– In lexicographical order, the suffixes are "d$", "good$", "od$” and
"ood$".
– The suffix array is [4, 1, 3, 2, 5].
Building a suffix array
•Example:
•given the string S = GOOGOL, construct suffix array
• Sort the suffixes in lexicographical order and store in a table
all the indices.
suffixes Indices ptr
GOL$ S[0] 4
GOOGOL$ S[1] 1
L$ S[2] 6
OGOL$ S[3] 3
OL$ S[4] 5
OOGOL$ S[5] 2
$ S[6] 7

Not stored Stored

Building suffix array
• How can we build suffix array for multiple strings, like GOOD
and GOOGOL ?
• Exercise
• Construct suffix array for the string
s = abab

• Identify suffixes and sort them

lexicographically:
ab$, abab$, b$, bab$, $
• The suffix array gives the indices of the
suffixes in sorted order
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in
the suffix array.
• Do a binary search on the suffix array
• Takes O(logn) time
• Example 1: search for „good‟ in the suffix array
constructed for „GOOGOL‟.
• Exercise: Let the string given is S = mississippi,
construct suffix array and search for
(i) ppi
(ii)issa
Example
•Let S = mississippi
L 11 i
8 ippi
•Let P = issa 5 issippi
2 ississippi
1 mississippi
M 10 pi
9 ppi
7 sippi
4 sisippi
6 ssippi
R 3 ssissippi
Signature file
• Word-oriented index structures based on hashing
• How to build signature file
–Hash each word to allocate fixed sized F-bits vector (word
signature)
–Divide the text in blocks of N words each
–Assign F-bits masks for each text block of size N (document
signature)
• This is obtained by bitwise ORing the signatures of all the
words in the text block.
• Efficient to search for phrases
• Hence the signature file is no more than the sequence of bit masks
of all blocks (plus a pointer to each block).
…con
– Signature files contain signatures--it patterns--that represent
documents.
– There are various ways of constructing signatures.
– Using one common signature method, for example, documents are
split into logical blocks each containing a fixed number of distinct
significant, that is, non-stoplist, words.
– Each word in the block is hashed to give a signature--a bit pattern with
some of the bits set to 1.
– The block signatures are then concatenated to produce the document
signature. Searching is done by comparing the signatures of queries
with document signatures.
50
….con

51
Structure of Signature File
Document Signature file
F-bits pointer Text file
signature
0 1 … 0 1
1
1
…
N
blocks 1
1
0
1
Example
• Given a text: “A text has many words. Words are made from letters”

A text has many words. Words are made from letters

• Text
Signature:
1110101 0111100 1011111

• Signature (hash) function:

• h(text) = 1000101 •Block 4: 001100
• h(many) = 0110101 •OR 100001
• h(word) = 0111100 • 101101
• h(made) = 0010111
• h(letter) = 1001011
Searching
During query processing:
–Hash the query to a F-bit mask Q
–Compare query signature with document signature of each block,
that is
• Bit-wise ANDing all the bits set in the query with bit masks Bi of all the
text block
–If all corresponding 1-bits are “on” in document signature,
document probably contains that term, that is
• If Q & Bi = Q, all the bits set in Q are also set in BI and therefore the
text block may contain the word

The main idea of signature file is that if a word is present

in a text block, then all the bits set in its signature are also
set in the bit mask of the text block
–Hence if a bit is set in the mask of the query word and not in the
mask of the text block, then the word is not present in the text block
Signature file trivia
Signature files leads to possible mismatches.
–It is possible that all the corresponding bits are set even
though the word is not there. This is called false drop.

False drop or false positive

–Document that is retrieved by a search but is not
relevant to the searcher‟s needs
–False drops occur because of words that are written the
same but have different meanings.
–Example: „squash‟ refer to a game, a vegetable or an
action
1 56

IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Unit 2
No ratings yet
Unit 2
10 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
Lec2 2
No ratings yet
Lec2 2
17 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Unit Ii
No ratings yet
Unit Ii
61 pages
IR Chapter Three
No ratings yet
IR Chapter Three
59 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Unit 2 Irs
No ratings yet
Unit 2 Irs
25 pages
Aptitude Test Paper Link For Vinove Software
No ratings yet
Aptitude Test Paper Link For Vinove Software
17 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
Unit 2
No ratings yet
Unit 2
40 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
L05
No ratings yet
L05
33 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
IRS Unit 2
No ratings yet
IRS Unit 2
15 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
No ratings yet
Assignment No: 3: Aim: Objective: Theory:-Inverted Index
2 pages
Lucene Solr
No ratings yet
Lucene Solr
52 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Indexing and Abstracting
No ratings yet
Indexing and Abstracting
48 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
450 Interview Questions - LeetCode Edition
No ratings yet
450 Interview Questions - LeetCode Edition
81 pages
Just Cse 2013-14 Course Summery
No ratings yet
Just Cse 2013-14 Course Summery
35 pages
Sep Bca-Ii Sem-General Question Bank
No ratings yet
Sep Bca-Ii Sem-General Question Bank
20 pages
DSA Interview Questions and Answers
No ratings yet
DSA Interview Questions and Answers
3 pages
Imp Leetcode Questions
No ratings yet
Imp Leetcode Questions
3 pages
Archy's DSA Sheet
No ratings yet
Archy's DSA Sheet
195 pages
BCS401 Module 2
No ratings yet
BCS401 Module 2
27 pages
Advanced Data Structures and Algorithms Notes
No ratings yet
Advanced Data Structures and Algorithms Notes
41 pages
Chapter Four
No ratings yet
Chapter Four
49 pages
Data Structure Previous Year Question Paper
No ratings yet
Data Structure Previous Year Question Paper
10 pages
EECS 3101 W24 - 04 Final Sol
No ratings yet
EECS 3101 W24 - 04 Final Sol
19 pages
Review Graph and Trees and New Topic On Trees
No ratings yet
Review Graph and Trees and New Topic On Trees
56 pages
Chapter 1 Event
No ratings yet
Chapter 1 Event
39 pages
Chapter 4
No ratings yet
Chapter 4
37 pages
Chapter 3
No ratings yet
Chapter 3
34 pages
B Plus Tree
No ratings yet
B Plus Tree
36 pages
Mobile App Chapter 2
No ratings yet
Mobile App Chapter 2
44 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Artificial Intelligence Ass
No ratings yet
Artificial Intelligence Ass
33 pages
Network Design, Configuration-IP Assignment
No ratings yet
Network Design, Configuration-IP Assignment
58 pages
ATCD Unit-5
No ratings yet
ATCD Unit-5
31 pages
Programming Imp Questions
No ratings yet
Programming Imp Questions
32 pages
Data Models and Data Structures LECTURE5
No ratings yet
Data Models and Data Structures LECTURE5
22 pages
Chapter Two IR
No ratings yet
Chapter Two IR
44 pages
Wube Lab Report
No ratings yet
Wube Lab Report
21 pages
CS23231-Data Structure - 2mark (Unit I - III)
No ratings yet
CS23231-Data Structure - 2mark (Unit I - III)
16 pages
DSA Question Bank
No ratings yet
DSA Question Bank
5 pages
FAANG Interview Questions (Summer Bootcamp) - Sheet1
No ratings yet
FAANG Interview Questions (Summer Bootcamp) - Sheet1
16 pages
CC Lab Manual
No ratings yet
CC Lab Manual
42 pages
Splay Trees 2
No ratings yet
Splay Trees 2
35 pages
All Algorithms (2 Files Merged)
No ratings yet
All Algorithms (2 Files Merged)
15 pages
LMS 20250221 202519
No ratings yet
LMS 20250221 202519
12 pages
Data Structure in Python
No ratings yet
Data Structure in Python
17 pages
Binary Search Tree
No ratings yet
Binary Search Tree
7 pages
Insemester - Data - Structures ( DSE-2155)
No ratings yet
Insemester - Data - Structures ( DSE-2155)
2 pages
ADS Question Paper 2022
No ratings yet
ADS Question Paper 2022
6 pages
Lesson Plan BCA
No ratings yet
Lesson Plan BCA
8 pages
DS Unit 1
No ratings yet
DS Unit 1
7 pages
As Test Paper Computer Science
No ratings yet
As Test Paper Computer Science
6 pages
SUpervised Result in Graphy
No ratings yet
SUpervised Result in Graphy
1 page
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
From Everand
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
Nolan Reeves
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
From Everand
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Chapter 4 IR

Uploaded by

Chapter 4 IR

Uploaded by

Information Retrieval and Storage

1. What is Indexing Structure?

2. What is the Difference between Term

3. What are Inverted files

4. What are Suffix Trees and Suffix Arrays?

• is an offline process of organizing documents using

• Indexing is used to speed up access to desired information

• Is an online process that scans document corpus to find

query parse query

3. Output: a set of index terms (vocabulary) to be used for Indexing

Index File (Inverted file).

•Index file usually has index terms in a sorted order.

I did enact Julius

–Data to be held in the inverted file includes :

Term CF Document TF Location

Advantage of dividing inverted file:

• Keeping a pointer in the vocabulary to the list in the posting file

– the vocabulary to be kept in memory at search time even for

– Posting file to be kept on disk for accessing to documents

–Posting file requires much more space.

Each element in an inverted list is called a posting, i.e., the

It is stored as a separate inverted list for each column, i.e., a list

 Each list consists of one or many individual postings related to

I did enact Julius

• After all I 1 did 1

The search time is also linear in the length of

Searching for a substring[1..m], in string[1..n], can

Some examples of string problems are given below.

 How much space is required at each node for English word

How many bytes required to store MISSISSIPI ?

Not stored Stored

• Identify suffixes and sort them

A text has many words. Words are made from letters

• Signature (hash) function:

The main idea of signature file is that if a word is present

False drop or false positive

You might also like