0% found this document useful (0 votes)

27 views28 pages

3 Indexing

Uploaded by

gosatilahun2017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views28 pages

3 Indexing

Uploaded by

gosatilahun2017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Indexing structure

Designing an IR System
Our focus during IR system design:
• In improving Effectiveness of the system
–The concern here is retrieving more relevant documents
for users query
–Effectiveness of the system is measured in terms of
precision, recall.
–Main emphasis: Stemming, stop words removal, weighting
schemes, matching algorithms

• In improving Efficiency of the system

–The concern here is reducing storage space requirement,
enhancing searching time, indexing time, access time…
–Main emphasis: Compression, indexing structures, space
Subsystems of IR system
The two subsystems of an IR system:
–Indexing:
• is an offline process of organizing documents
using keywords extracted from the collection
• Indexing is used to speed up access to desired
information from document collection as per
users query

–Searching
• Is an online process that scans document corpus to find
relevant documents that matches users query
Indexing Subsystem
documents
Documents Assign document identifier

document document
Tokenization
IDs
tokens
Stopword removal
non-stoplist tokens
Stemming &
stemmed terms
Normalization
Term weighting

Weighted index
terms Index File
Searching Subsystem
query parse query
query tokens
ranked
Stop word non-stoplist
document
tokens
set
Ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Basic assertion
Indexing and searching:inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language

Knowing searching is knowing indexing

Indexing: Basic Concepts
• Indexing is used to speed up access to desired
information from document collection as per users
query such that
– It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quickly.
Example: author catalog in library
• An index file consists of records, called index
entries.
• Index files are much smaller than the original file.
– This size may be further reduced by Linguistic pre-
processing (like stemming & other normalization methods).
• The usual unit for indexing is the word
– Index terms - are used to look up records in a file.
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem and normalization: reduce words with similar meaning into
their stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to
each index term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …

• Output: a set of index terms (vocabulary) to be used for

Indexing the documents that each term occurs in.
Basic Indexing Process
Documents to
be indexed. Friends, Romans,
countrymen.
Token Tokenize
stream. r Friends Roman countrymen
s
Modified Linguistic friend roman Country men
tokens. preprocessing

Indexe
Index File r friend 2 4
(Inverted
roman 1 2
file).
countryman 13 16
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword

•Index file usually has index terms in a sorted order.

•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
•For organizing index file for a collection of documents, there are
various options available:
–Decide what data structure and/or file structure to use. Is it sequential file,
inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
• Running time
–Indexing time
–Access/search time
–Update time (Insertion time, Deletion time, modification
time….)

• Space overhead
–Computer storage space consumed.

• Access types supported efficiently.

–Is the indexing structure allows to access:
• records with a specified term, or
• records with terms falling in a specified range of values.
Sequential File

•Sequential file is the most primitive file structures.

• It has no linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order.
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
• After all did
enact
1
1
Sequential file
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stopwords are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization capitol 1 3 brutus 2
and stemming brutus
killed
1
1 4 capitol 1
are applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
• These index with 2
caesar 2 8 enact 1
terms in the 2
9 julius 1
sequential file noble
brutus
2
2 10 kill 1
are sorted in hath
told
2
2 11 kill 1
alphabetical you 2
12 noble 2
order caesar
was
2
2
ambitious 2
Sequential File
• Its main advantages are:
– easy to implement;
– provides fast access to the next record using lexicographic
order.
• Its disadvantages:
– difficult to update. Index must be rebuilt if a new term is
added. Inserting a new record may require moving a large
proportion of the file;
– random access is extremely slow.
Inverted file
• A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it.
• Content of the inverted file:
–Data to be held in the inverted file includes :
• The vocabulary (List of terms)
• The occurrence (Location and frequency of terms in a
document collection)
Inverted file
• The occurrence: contains one record per term,
listing
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
• TFij, number of occurrences of term tj in document di
• DFj, number of documents containing tj
• maxi, maximum frequency of any term in di
• N, total number of documents in a collection
• CFj,, collection frequency of tj in nj
• ….

–Locations/Positions of words in the text

Inverted file
•Why vocabulary?
–Having information about vocabulary (list of terms) speeds
searching for relevant documents

•Why location?
– Having information about the location of each term
within the document helps for:
•highlight location of search term
•Why frequencies?
•Having information about frequency is used for:
–calculating term weighting (like TF, TF*IDF, …)
–optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Document TF Location
ID
This is called an
auto 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
taxi 1 5 1 43
train 3 11 2 3, 70
34 1 40
Construction of Inverted file
• An inverted index consists of two files:
–vocabulary file
–Posting file
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in
the posting file allows:
– the vocabulary to be kept in memory at search
time even for large text collection, and
– Posting file to be kept on disk for accessing to
documents
Vocabulary file
• A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear in any
of the documents (in lexicographical order) and
–For each word a pointer to posting file

• Records kept for each term j in the word list contains the
following:
–term j
–number of documents in which term j occurs (DFj)
–Total frequency of term j (CFj)
–pointer to postings (inverted) list for term j
Postings File (Inverted List)
• For each distinct term in the vocabulary, stores
a list of pointers to the documents that contain
that term.
• Each element in an inverted list is called a
posting, i.e., the occurrence of a term in a
document
• It is stored as a separate inverted list for each
column, i.e., a list corresponding to each term
in the index file.
– Each list consists of one or many individual
postings related to Document ID, TF and location
information about a given term i
Organization of Index File
Vocabulary
Postings Actual
(word list)
(inverted list) Documents
Term No Tot Pointer
of freq To
Doc posting

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term
ambitious
be
Doc #
2
2
Term Doc #
brutus 1
I 1
• After all did 1
brutus
capitol
2
1
enact 1
documents julius 1 caesar 1
caesar 2
have been caesar
I
1
1 caesar 2
tokenized the was
killed
1
1
did
enact
1
1
inverted file is I 1 has 1
the 1
sorted by capitol 1
I
I
1
1
terms brutus
killed
1
1
I 1
it 2
me 1
so 2 julius 1
let 2 killed 1
it 2 killed 1
be 2 let 2
with 2 me 1
caesar 2 noble 2
the 2 so 2
noble 2 the 1
brutus 2
the 2
hath 2
told 2
told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stopwords, apply stemming &
compute term frequency
•Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
•Counting enact 1 1
enact 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections
noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1
brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Exercises
1) Construct the inverted index for the following
document collections.
Doc 1 : New home to home sales forecasts
Doc 2 : Rise in home sales in July
Doc 3 : Home sales rise in July for new homes
Doc 4 : July new home sales rise

2) Doc 1: "I am not going there to be imprisoned," said Dantes.

Doc 2: "You are Edmond Dantes," cried Villefort,seizing the count
by the wrist; "then come here!”

IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Unit 2
No ratings yet
Unit 2
10 pages
How To Hack Facebook PDF
100% (1)
How To Hack Facebook PDF
9 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
78 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
L05
No ratings yet
L05
33 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
Inverted File
No ratings yet
Inverted File
20 pages
IR Chapter Three
No ratings yet
IR Chapter Three
59 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Lec 2
No ratings yet
Lec 2
17 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
CCIE Lab 400-101 H1
100% (1)
CCIE Lab 400-101 H1
155 pages
TVL Css g11 q1 m3 Student
No ratings yet
TVL Css g11 q1 m3 Student
14 pages
1.software Testing Methodologies
0% (1)
1.software Testing Methodologies
2 pages
Windows XP Lisence Keys
No ratings yet
Windows XP Lisence Keys
2 pages
PPT
0% (1)
PPT
37 pages
How To Guide v1.01
No ratings yet
How To Guide v1.01
5 pages
Siemens - Profibus and Modbus Comparison
No ratings yet
Siemens - Profibus and Modbus Comparison
5 pages
Chapter 5 Ai
No ratings yet
Chapter 5 Ai
66 pages
Sku 9619 Manual
No ratings yet
Sku 9619 Manual
37 pages
AutoDesk Revit Tip Sheet
No ratings yet
AutoDesk Revit Tip Sheet
4 pages
NetWorker 19.1 Installation Guide PDF
No ratings yet
NetWorker 19.1 Installation Guide PDF
196 pages
Chapter 4 AI
No ratings yet
Chapter 4 AI
33 pages
Chapter Three
No ratings yet
Chapter Three
104 pages
01 Computational Methods For Numerical Analysis With R - 1
No ratings yet
01 Computational Methods For Numerical Analysis With R - 1
28 pages
2 Termweighting
No ratings yet
2 Termweighting
38 pages
Data Analyst Test - AdvaRisk
No ratings yet
Data Analyst Test - AdvaRisk
13 pages
User Manual Ctes
No ratings yet
User Manual Ctes
7 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
13 pages
Reference Manual: Updated For Version 2.7 of Capture Polar
No ratings yet
Reference Manual: Updated For Version 2.7 of Capture Polar
33 pages
Service Set ID (SSID)
No ratings yet
Service Set ID (SSID)
17 pages
MAN 0052 Rev J
No ratings yet
MAN 0052 Rev J
7 pages
Yash 2024
No ratings yet
Yash 2024
1 page
TOC For Industries OM
No ratings yet
TOC For Industries OM
2 pages
DBMS Ex No 8
No ratings yet
DBMS Ex No 8
3 pages
Brumund Building A Smallish v1
No ratings yet
Brumund Building A Smallish v1
30 pages
Find Change List
No ratings yet
Find Change List
2 pages
Webleonz Technologies
No ratings yet
Webleonz Technologies
10 pages
Voice Based Automated Transport Enquiry System123
No ratings yet
Voice Based Automated Transport Enquiry System123
2 pages
Intrusion Detection
No ratings yet
Intrusion Detection
29 pages
EPOS USB Driver Installation
No ratings yet
EPOS USB Driver Installation
22 pages
Influence of Cloud Computing in Business - Are They Robust
No ratings yet
Influence of Cloud Computing in Business - Are They Robust
1 page
Perencanaan Pemeliharaan Mesin Produksi Dengan Menggunakan Metode Reliability Centered
No ratings yet
Perencanaan Pemeliharaan Mesin Produksi Dengan Menggunakan Metode Reliability Centered
8 pages
SI3000 CCS Iskratel Reliable Platform For Critical Communications - Leaflet - en - Web
No ratings yet
SI3000 CCS Iskratel Reliable Platform For Critical Communications - Leaflet - en - Web
2 pages
Certifics
No ratings yet
Certifics
4 pages
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
From Everand
Bash Shell from Zero to Hero: An SRE's Practical Guide to Terminal Skills, Scripting, and Automation
Nolan Reeves
No ratings yet
精通Python自然语言处理: Chinese Edition
From Everand
精通Python自然语言处理: Chinese Edition
Posts & Telecom Press
No ratings yet
Rust Essentials: Master the Language of Safe Systems Programming
From Everand
Rust Essentials: Master the Language of Safe Systems Programming
Tyler Hayes
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
From Everand
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

3 Indexing

Uploaded by

3 Indexing

Uploaded by

Indexing structure

• In improving Efficiency of the system

Knowing searching is knowing indexing

• Output: a set of index terms (vocabulary) to be used for

•Index file usually has index terms in a sorted order.

• Access types supported efficiently.

•Sequential file is the most primitive file structures.

I did enact Julius

–Locations/Positions of words in the text

I did enact Julius

2) Doc 1: "I am not going there to be imprisoned," said Dantes.

You might also like