0% found this document useful (0 votes)

107 views26 pages

Indexing Structure: Chapter Four

This document discusses indexing structures and the process of building an index file. It covers the following key points in 3 sentences: The document defines indexing as arranging index terms to allow fast searching and reduce memory usage. It explains that indexing enhances retrieval efficiency and speed by allowing relevant documents to be searched and retrieved quickly. The major steps of index construction discussed are selecting index terms through preprocessing, organizing the terms and their occurrences in an inverted or sequential file structure, and evaluating the performance of the indexing structure.

Uploaded by

milkikoo shifera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views26 pages

Indexing Structure: Chapter Four

Uploaded by

milkikoo shifera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

CHAPTER FOUR

Indexing structure
Indexing: Basic Concepts
Indexing is an arrangement of index terms to permit fast
searching and reading memory space requirement.
used to speed up access to desired information from
document collection as per users query such that:

 It enhances efficiency in terms of time for retrieval.

 Relevant documents are searched and retrieved
quickly.
 Index file usually has index terms in a sorted order.
…continued
An index file consists of records, called index entries.

Index files are much smaller than the original file.

Remember Heaps Law: in 1 GB of text collection the

vocabulary has a size of only 5 MB. This size may be
further reduced by Linguistic pre-processing (or text
operations).
Index terms - are used to look up records in a file.
Major Steps in Index Construction
Source file: Collection of text document
 A document can be described by a set of representative
keywords called index terms.
Index Terms Selection:apply text operations or
preprocessing
 Tokenize: identify words in a document,
 Stop words removal
 Word stem: reduce words with similar meaning into their
stem/root word
 Term relevance weight: Different index terms have
varying relevance when used to describe document contents.
 This effect is captured through the assignment of
numerical weights to each index term of a document.
Indexing structure: a set of index terms (vocabulary) are
organized in Index File to easily identify documents in which
each term occurs in.
Basic Indexing Process

Documents to
be indexed. Friends, Romans, countrymen.

Token Tokenize
stream. Friends Romans countrymen

Modified Linguistic friend roman countryman

tokens. preprocessor

Index File Indexer

friend 2 4

roman 1 2
Inverted file countryman 13 16
Index file Evaluation Metrics
Running time of the main operations
Access/search time
How much is the running time to find the required search
key from the list?
Update time (Insertion time, Deletion time)
How much time it takes to update existing records in an
attempt to add new terms or delete existing unnecessary
terms?
Is the indexing structure allows incremental update or
re-indexing?
 Space overhead
Computer storage space consumed for keeping the list.
Building Index file
An index file of a document is a file consisting of a list of
index terms and a link to one or more documents that has
the index term.
An index file is a list of search terms that are organized
for associative lookup, i.e., to answer user’s query.
For organizing index file for a collection of documents,
there are various option are available.
Decide what data structure and/or file structure to use.
 Is it sequential file, inverted file, suffix tree, etc. ?
Sequential File
Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
The records are generally arranged serially, one after
another, but in lexicographic order on the value of some
key field.
• a particular attribute is chosen as primary key whose
value will determine the order of the records.
• when the first key fails to discriminate among records, a
second key is chosen to give an order.
Example:
Given a collection of documents, they are parsed to
extract words and these are saved with the Document ID.

I did enact Julius

Doc 1 Caesar I was killed
I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary
Term Doc #
I 1
After all did 1
Sequential file
enact 1
documents have julius 1
Doc
caesar 1
been tokenized, I 1 Term No.
stop words are was
killed
1
1 1 ambition 2
removed, and I 1
2 brutus 1
the 1
normalization and capitol 1 3 brutus 2
stemming are brutus
killed
1
1 4 capitol 1
applied, to me 1
5 caesar 1
so 2
generate index let 2
6 caesar 2
terms it 2
be 2 7 caesar 2
These index terms with 2
caesar 2 8 enact 1
in sequential file the 2
9 julius 1
noble 2
are sorted in brutus 2 10 kill 1
alphabetical order hath
told
2
2 11 kill 1
you 2
caesar 2 12 noble 2
was 2
ambitious 2
Sequential File
To access records search serially;
starting at the first record read and investigate all the
succeeding records until the required record is found or end
of the file is reached.
Its main advantages:
Easy to implement
Provides fast access to the next record using lexicographic
order.
Can be searched quickly, using binary search.
Its disadvantages:
No weights attached to terms.
Random access is slow: since similar terms are indexed
individually.
Inverted file
 A word oriented indexing mechanism based on sorted list of
keywords, with each keyword having links to the documents
containing it
 Building and maintaining an inverted index is a relatively low cost
risk.
 On a text of n words an inverted index can be built in O(n) time
 This list is inverted from a list of terms in location order to a list of
terms in alphabetical order.

Word Extraction Word IDs

Original
Documents •W1:d1,d2,d3
•W1:d1,d2,d3
•W2:d2,d4,d7,d9
•W2:d2,d4,d7,d9
•…
•…
•Wn :d ,…dn
•Wn :di i,…dn
Document IDs

•Inverted
•InvertedFiles
Files
Inverted file
Data to be held in the inverted file includes
The vocabulary (List of terms): is the set of all distinct
words (index terms) in the text collection.
Location: all the text locations/positions where the word
occurs.
frequency of occurrence of terms in a document
collection
TFij, Number of occurrences of term tj in document di
DFj, Number of documents containing tj
TCF, total frequency of tj in the corpus n
mi, Maximum frequency of any term in di
n, Total number of documents in a collection ………
Inverted file

Having information about the location of each term

within the document helps for:
User interface design: highlight location of search
term
Proximity based ranking: adjacency and near
operators (in Boolean searching)
Having information about frequency is used for:
Calculating term weighting (like TF, TF*IDF, …)
Optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Term CF Doc ID TF Location
term 1 3 2 1 66
This is called
19 1 213 an index file.
29 1 45
term 2 4 3 1 94
19 2 7, 212 Text operations
are performed
22 1 56
before building
term 3 1 5 1 43 the index.
term 4 3 11 2 3, 70
34 1 40

Is it possible to keep all these information during searching?

Construction of Inverted file
 An inverted index consists of two files: vocabulary and posting
files.
 A vocabulary file (Word list):
stores all of the distinct terms (keywords) that appear in any of
the documents (in lexicographical order) and
For each word a pointer to posting file.

 Records kept for each term j in the word list contains the following:
term j
Number of documents in which term j occurs (DFj)
Collection frequency of term j
Pointer to inverted (postings) list for term j
Postings File (Inverted List)
For each distinct term in the vocabulary, stores a list of pointers to
the documents that contain that term.
Each element in an inverted list is called a posting, i.e., the
occurrence of a term in a document
It is stored as a separate inverted list for each column, i.e., a list
corresponding to each term in the index file.
Each list consists of one or many individual postings

Advantage of dividing inverted file:

Keeping a pointer in the vocabulary to the list in the posting file
allows:
The vocabulary to be kept in memory at search time even for large
text collection, and
Posting file to be kept on disk for accessing to documents
General structure of Inverted File
The following figure shows the general structure of
inverted index file.
Organization of Index File

Vocabulary
Postings
(word list) Documents
(inverted list)
Pointer
Term DF TF To
posting

term 1 3 3 Inverted
term 2 3 4 lists

term 3 1 1
term 4 2 3
Example:
Given a collection of documents, they are parsed to extract
words and these are saved with the Document ID .

I did enact Julius

Caesar I was killed
Doc 1 I the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus has told you
Caesar was ambitious
Sorting the Vocabulary Term Doc #
Term Doc # ambitious 2
I 1 be 2
After all documents did 1 brutus 1
enact 1 brutus 2
have been tokenized julius 1 capitol 1
the inverted file is caesar
I
1
1
caesar
caesar
1
2
sorted by terms. was
killed
1
1
caesar 2
did 1
I 1 enact 1
the 1 has 1
capitol 1 I 1
brutus 1 I 1
killed 1 I 1
me 1 it 2
so 2
julius 1
let 2
killed 1
it 2
killed 1
be 2
let 2
with 2
caesar 2 me 1
the 2 noble 2
noble 2 so 2
brutus 2 the 1
hath 2 the 2
told 2 told 2
you 2 you 2
caesar 2 was 1
was 2 was 2
ambitious 2 with 2
Remove stop words, stemming & compute
frequency
Multiple term
Term Doc #
entries in a ambition 2 Term Doc # TF
single brutus 1 ambition 2 1
document are brutus 2 brutus 1 1
merged and capitol 1 brutus 2 1
frequency capitol 1 1
caesar 1
information caesar 1 1
caesar 2
added caesar 2 2
caesar 2
Counting
enact 1 enact 1 1
number of julius 1 julius 1 1
occurrence of kill 1 kill 1 2
terms in the kill 1 noble 2 1
collections noble 2
helps to
compute TF
Vocabulary and postings file
The file is commonly split into a Dictionary and a Posting file
vocabulary posting
Term Doc # TF Term DF CF Doc # TF
ambition 2 1
ambitious 1 1 2 1
brutus 1 1 1 1
brutus 2 1 brutus 2 2
2 1
capitol 1 1 capitol 1 1
1 1
caesar 1 1 caesar 2 3 1 1
caesar 2 2 enact 1 1 2 2
enact 1 1 julius 1 1 1 1
julius 1 1 kill 1 2 1 1
kill 1 2 1 2
noble 1 1
noble 2 1 2 1

Pointers
Searching on Inverted File
 Since the whole index file is divided into two, searching can be
done faster by loading vocabulary list which takes less memory
even for large document collection

 Using binary Search the searching takes logarithmic time

The search is in the vocabulary lists

 Updating inverted file is very complex.

We need to update both vocabulary and posting files
Exercise: construct Inverted file
Consider the following Original Documents and
constract inverted file.
D1 The Department of Computer Science was established in
1984.
D2 The Department launched its first BSc in Computer
Studies in 1987.
D3 Followed by the MSc in Computer Science which was
started in 1991.
D4 The Department also produced its first PhD graduate in
1994.
D5 Our staff have contributed intellectually and
professionally to the advancements in these fields.
Thank you

Prog 3114 1122020
68% (19)
Prog 3114 1122020
16 pages
Dictionary of the Old Testament: Wisdom, Poetry & Writings: A Compendium of Contemporary Biblical Scholarship
From Everand
Dictionary of the Old Testament: Wisdom, Poetry & Writings: A Compendium of Contemporary Biblical Scholarship
Tremper Longman III
4.5/5 (12)
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
198 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
精通Python自然语言处理: Chinese Edition
From Everand
精通Python自然语言处理: Chinese Edition
Posts & Telecom Press
No ratings yet
Historian Admin
No ratings yet
Historian Admin
259 pages
The Angel’S Riddle: A Critical Analysis of the Book of Revelation
From Everand
The Angel’S Riddle: A Critical Analysis of the Book of Revelation
James V. Head
No ratings yet
IR Chapter Three
No ratings yet
IR Chapter Three
59 pages
Zero Data Loss Recovery Appliance
No ratings yet
Zero Data Loss Recovery Appliance
45 pages
Indexing 1
No ratings yet
Indexing 1
61 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
Unit 2
No ratings yet
Unit 2
10 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
3 Indexing
No ratings yet
3 Indexing
28 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
4 Indexing
No ratings yet
4 Indexing
59 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
chapter2-MA212-Indexing & Preprocessing
No ratings yet
chapter2-MA212-Indexing & Preprocessing
68 pages
ch3 - Indexing - 2019
No ratings yet
ch3 - Indexing - 2019
38 pages
4 Indexing
No ratings yet
4 Indexing
29 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Data Mining
50% (2)
Data Mining
34 pages
L05
No ratings yet
L05
33 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
List of Search Engines PDF
No ratings yet
List of Search Engines PDF
7 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
4.index Construction - New
No ratings yet
4.index Construction - New
46 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
03 - Lect3 Search Engines-Part2
No ratings yet
03 - Lect3 Search Engines-Part2
32 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
SLIM Report Templates Creation
No ratings yet
SLIM Report Templates Creation
99 pages
PDF (Ebook PDF) Business Intelligence, Analytics, and Data Science: A Managerial Perspective 4th Edition Download
100% (2)
PDF (Ebook PDF) Business Intelligence, Analytics, and Data Science: A Managerial Perspective 4th Edition Download
29 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
2.boolean Retrieval Model
No ratings yet
2.boolean Retrieval Model
40 pages
03lecture 3 - Biomedical IR-indexing
No ratings yet
03lecture 3 - Biomedical IR-indexing
27 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
Lect 3 Inverted Index
No ratings yet
Lect 3 Inverted Index
24 pages
Lec 2
No ratings yet
Lec 2
17 pages
Load Example SQLServer
No ratings yet
Load Example SQLServer
43 pages
Clinical Data Management and E-Clinical Trials (IPS)
71% (7)
Clinical Data Management and E-Clinical Trials (IPS)
22 pages
Ir Mod4 Notes
No ratings yet
Ir Mod4 Notes
19 pages
Online Voting Presentation Printsecond2
No ratings yet
Online Voting Presentation Printsecond2
96 pages
Functions of DBA (Database Administrator)
No ratings yet
Functions of DBA (Database Administrator)
1 page
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Oracle Goldengate Classic Architecture
No ratings yet
Oracle Goldengate Classic Architecture
14 pages
Inverted File
No ratings yet
Inverted File
20 pages
CSC2308 Lec 02
No ratings yet
CSC2308 Lec 02
16 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
Heaps Law Linguistic Pre-Processing Index Terms
No ratings yet
Heaps Law Linguistic Pre-Processing Index Terms
8 pages
DBMS Question Bank
100% (2)
DBMS Question Bank
13 pages
Term Weighting and Similarity Measures
No ratings yet
Term Weighting and Similarity Measures
35 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
16 pages
Qlikview Minus Points
No ratings yet
Qlikview Minus Points
36 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Introduction To Database Systems: Information Superhighway Have Become Ubiquitous, and Information Processing Is A
No ratings yet
Introduction To Database Systems: Information Superhighway Have Become Ubiquitous, and Information Processing Is A
21 pages
CH03 HKM Law Investigation and Ethics
No ratings yet
CH03 HKM Law Investigation and Ethics
32 pages
Query Languages and Query Operation: Chapter Seven
No ratings yet
Query Languages and Query Operation: Chapter Seven
20 pages
Normalization Concepts
No ratings yet
Normalization Concepts
13 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
6 Retrieval Evaluation
No ratings yet
6 Retrieval Evaluation
28 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
The Importance of Indexing in Database Design
No ratings yet
The Importance of Indexing in Database Design
6 pages
Bia
No ratings yet
Bia
8 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
IPT Chapter 1 - Network Programming & Integrative Coding
No ratings yet
IPT Chapter 1 - Network Programming & Integrative Coding
22 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Oracle Wait Event - Common Issues and Solutions
100% (1)
Oracle Wait Event - Common Issues and Solutions
7 pages
Data Migration in S4HANA by Ugur Hasdemir
100% (3)
Data Migration in S4HANA by Ugur Hasdemir
20 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
Chapter 3 - Naming and Threads-1
No ratings yet
Chapter 3 - Naming and Threads-1
21 pages
Define Technical Settings For All Involved Systems: Prerequisites
No ratings yet
Define Technical Settings For All Involved Systems: Prerequisites
2 pages
Proper Data Disposal Is Important in The Workplace
No ratings yet
Proper Data Disposal Is Important in The Workplace
9 pages
Skill Week-13 - Java Database Connectivity Servlet API and JDBC
No ratings yet
Skill Week-13 - Java Database Connectivity Servlet API and JDBC
2 pages
Course Name: Advanced Information Retrieval
No ratings yet
Course Name: Advanced Information Retrieval
6 pages
SQL Exercise 1
No ratings yet
SQL Exercise 1
3 pages
Ernst and Farocki. Towards An Archive For Visual Concepts
No ratings yet
Ernst and Farocki. Towards An Archive For Visual Concepts
28 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
Shepherd's Notes: Deuteronomy
From Everand
Shepherd's Notes: Deuteronomy
Paul Wright
No ratings yet
The Mighty Velociraptor
From Everand
The Mighty Velociraptor
Percy Leed
No ratings yet
Class 8 Comp For Deepanshu
No ratings yet
Class 8 Comp For Deepanshu
1 page
Anatomy of Ext4
No ratings yet
Anatomy of Ext4
9 pages
Blood Bank Database12
No ratings yet
Blood Bank Database12
14 pages

Indexing Structure: Chapter Four

Uploaded by

Indexing Structure: Chapter Four

Uploaded by

CHAPTER FOUR

 It enhances efficiency in terms of time for retrieval.

Index files are much smaller than the original file.

Remember Heaps Law: in 1 GB of text collection the

Modified Linguistic friend roman countryman

Index File Indexer

I did enact Julius

Word Extraction Word IDs

Having information about the location of each term

Is it possible to keep all these information during searching?

Advantage of dividing inverted file:

I did enact Julius

 Using binary Search the searching takes logarithmic time

 Updating inverted file is very complex.

You might also like