0% found this document useful (0 votes)
27 views28 pages

3 Indexing

Uploaded by

gosatilahun2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views28 pages

3 Indexing

Uploaded by

gosatilahun2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Indexing structure

Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents
for users query
–Effectiveness of the system is measured in terms of
precision, recall.
–Main emphasis: Stemming, stop words removal, weighting
schemes, matching algorithms

• In improving Efficiency of the system


–The concern here is reducing storage space requirement,
enhancing searching time, indexing time, access time…
–Main emphasis: Compression, indexing structures, space
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
• is an offline process of organizing documents
using keywords extracted from the collection
• Indexing is used to speed up access to desired
information from document collection as per
users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked
Stop word non-stoplist
document
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language

Knowing searching is knowing indexing


Indexing: Basic Concepts
• Indexing is used to speed up access to desired
information from document collection as per users
query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quickly.
Example: author catalog in library
• An index file consists of records, called index
entries.
• Index files are much smaller than the original file.
– This size may be further reduced by Linguistic pre-
processing (like stemming & other normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into
their stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to
each index term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for


Indexing the documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans,
countrymen.
Token Tokenize
stream. r Friends Roman countrymen
s
Modified Linguistic friend roman Country men
tokens. preprocessing

Indexe
Index File r friend 2 4
(Inverted
roman 1 2
file).
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword

•Index file usually has index terms in a sorted order.


•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
•For organizing index file for a collection of documents, there are
various options available:
–Decide what data structure and/or file structure to use. Is it sequential file,
inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification
time….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently.


–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File

•Sequential file is the most primitive file structures.


• It has no linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order.
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization capitol 1 3 brutus 2
and stemming brutus
killed
1
1 4 capitol 1
are applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index with 2
caesar 2 8 enact 1
terms in the 2
9 julius 1
sequential file noble
brutus
2
2 10 kill 1
are sorted in hath
told
2
2 11 kill 1
alphabetical you 2
12 noble 2
order caesar
was
2
2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic
order.
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is
added. Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it.
• Content of the inverted file:
–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
Inverted file
• The occurrence: contains one record per term,
listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

–Locations/Positions of words in the text


Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents

•Why location?
– Having information about the location of each term
within the document helps for:
•highlight location of search term
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
• An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in
the posting file allows:
– the vocabulary to be kept in memory at search
time even for large text collection, and
– Posting file to be kept on disk for accessing to
documents
Vocabulary file
• A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any
of the documents (in lexicographical order) and
–For each word a pointer to posting file

• Records kept for each term j in the word list contains the
following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
• For each distinct term in the vocabulary, stores
a list of pointers to the documents that contain
that term.
• Each element in an inverted list is called a
posting, i.e., the occurrence of a term in a
document
• It is stored as a separate inverted list for each
column, i.e., a list corresponding to each term
in the index file.
– Each list consists of one or many individual
postings related to Document ID, TF and location
information about a given term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius


Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents julius 1 caesar 1
caesar 2
have been caesar
I
1
1 caesar 2
tokenized the was
killed
1
1
did
enact
1
1
inverted file is I 1 has 1
the 1
sorted by capitol 1
I
I
1
1
terms brutus
killed
1
1
I 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming &
compute term frequency
•Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 1
enact 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections
noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1
brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Exercises
1) Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise

2) Doc 1: "I am not going there to be imprisoned," said Dantes.


Doc 2: "You are Edmond Dantes," cried Villefort,seizing the count
by the wrist; "then come here!”

You might also like