0% found this document useful (0 votes)

64 views34 pages

Chapter-4 - Data Structure-File Structure

The document discusses data structures and algorithms for information retrieval. It describes sequential files and inverted files. A sequential file stores records sequentially with no linking pointers, requiring serial searching. An inverted file indexes documents by terms, with each term listing documents that contain it and term frequencies, allowing faster searching. The construction of an inverted file involves creating a vocabulary list of unique terms and corresponding posting files that contain document pointers and term locations.

Uploaded by

abraham getu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views34 pages

Chapter-4 - Data Structure-File Structure

Uploaded by

abraham getu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Data/File Structures and Algo-

rithms for IR
Introduction
• A program is written in order to solve a prob-
lem.
• A solution to a problem actually consists of
two things:
– A way to organize the data
– Sequence of steps to solve the problem
• The way data are organized in a computers
memory is said to be Data Structure and the
sequence of computational steps to solve a
problem is said to be an algorithm.
• Therefore, a program is nothing but data
Sequential File
•Sequential file is the most primitive file structures.
• It has no vocabulary as well as linking pointers.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of
some key field.
• a particular attribute is chosen as primary key whose value
will determine the order of the records.
• when the first key fails to discriminate among records, a sec-
ond key is chosen to give an order.
Sequential File
• To access records search serially;
– starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.

• Its main advantages are:

– easy to implement;
– provides fast access to the next record using lexi-
cographic order.
– Can be searched quickly, e.g., by binary search,
Sequential File
• Its disadvantages:
– difficult to update. Index must be rebuilt if a
new term is added. Inserting a new record may
require moving a large proportion of the file;
– random access is extremely slow.

• The problem of update can be:

– solved by ordering records by date of acquisi-
tion, than the key value, hence, the newest en-
tries are added to the end of the file and there-
fore pose no difficulty to updating
Inverted file
• A word oriented indexing mechanism based on sorted list
of keywords, with each keyword having links to the docu-
ments containing it
–Building and maintaining an inverted index is a relatively low cost
risk.

• Data to be held in the inverted file includes list of index

terms and for each term:
–fij, number of occurrences of term tj in document di
–nj, number of documents containing tj
–mi, maximum frequency of any term in di
–n, total number of documents in a collection
–tf, total frequency of tj in nj
–….
Inverted file
• The inverted file contains:
–The vocabulary (List of terms)
–The occurrence (Location and frequency of terms in a document
collection)

• The vocabulary: is the set of all distinct words (index

terms) in the text collection.
–The collection is organized by terms

• The occurrence: contains one record per term, listing

–all the text locations/positions where the word occurs
–Frequency of each term in a document, i.e. count number of oc-
currences of keywords in a document
Inverted file
•Having information about vocabulary (list of
terms)
–speeds searching for relevant documents
•Having information about the location of each
term within the document helps for:
–user interface design: highlight location of search
term
–proximity based ranking: adjacency and near opera-
tors (in Boolean searching)

•Having information about frequency is used for:

•calculating term weighting (like TF, TF*IDF, …)
•optimizing query processing
Inverted File
Documents are organized by the terms/words they contain
Word Tot Freq Document Term Location
ID Freq
Act 3 2 1 66 This is called an
19 1 213 index file.
29 1 45

bus 4 3 1 94 Text operations

19 2 7, 212 are performed
22 1 56 before building
the index.
Pen 1 5 1 43
total 3 11 2 3, 70
34 1 40
Construction of Inverted file
An inverted index consists of two files: vocabulary
and posting files
• A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear
in any of the documents (in lexicographical order) and
–For each word a pointer to posting file

• Records kept for each term j in the word list con-

tains the following:
–term j
–Frequency of a term in a given document
–number of documents in which term j occurs (nj)
–Total frequency of term j
–pointer to inverted (postings) list for term j
Postings File (Inverted List)
• For each distinct term in the vocabulary, stores a list
of pointers to the documents that contain that term.
• Each element in an inverted list is called a posting,
i.e., the occurrence of a term in a document
• It is stored as a separate inverted list for each col-
umn, i.e., a list corresponding to each term in the in-
dex file.
–Each list consists of one or many individual postings

Advantage of dividing inverted file:

• Keeping a pointer in the vocabulary to the list in the
posting file allows:
–the vocabulary to be kept in memory at search time even for
large text collection, and
–Posting file to be kept on disk for accessing to documents
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To post -
Doc ing

Act 3 3 Inverted
Bus 3 4 lists

pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.

I did enact Julius

Doc 1 Caesar I was killed
i' the Capitol;
Brutus killed me.

So let it be with
Doc 2 Caesar. The noble
Brutus hath told you
Caesar was ambitious
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted
by terms
– Inverted index may record term locations within document during parsing
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1
hath 1
brutus 1
I 1
killed 1
I 1
me 1
i' 1
so 2
it 2
let 2
julius 1
it 2
be 2 killed 1
with 2 killed 1
caesar 2 let 2
the 2 me 1
noble 2 noble 2
brutus 2 so 2
hath 2 the 1
told 2 the 2
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2
Remove duplicate terms & add frequency
Term Doc # Term Freq
•Multiple term Term
ambitious
Doc #
2
ambitious
be
2
2
1
1
be 2
entries in a brutus 1
brutus
brutus
1
2
1
1
single docu-
brutus 2
capitol 1 capitol 1 1
caesar 1 1
ment are caesar
caesar
1
2 caesar 2 2

merged and caesar

did
2
1
did
enact
1
1
1
1
frequency in- enact
hath
1
1
hath
I
2
1
1
2
formation I
I
1
1 i' 1 1
it 2 1
added i'
it
1
2 julius 1 1

•Counting killed 1 2
julius 1
killed 1 let 2 1
killed 1
number of oc- let 2
me
noble
1
2
1
1
me 1
currence of noble 2
so
the
2
1
1
1
so 2
terms in the the 1 the 2 1
told 2 1
collections
the 2
told 2 you 2 1

helps to com- you

was
2
1
was
was
1
2
1
1
pute TF was
with
2
2
with 2 1
Vocabulary and postings file
The file is commonly split into a Dictionary and a
Postings file
Term Doc # Freq
ambitious 2 1 Doc # Term Freq
be 2 1 Term DocID Tot Freq 2 1
brutus 1 1 ambitious 1 1 2 1
brutus 2 1 be 1 1 1 1
capitol 1 1 brutus 2 2 2 1
caesar 1 1 capitol 1 1 1 1
caesar 2 2 caesar 2 3 1 1
did 1 1 did 1 1 2 2
enact 1 1 1 1
enact 1 1
hath 1 1 1 1
hath 2 1
I 1 2 2 1
I 1 2 i' 1 1 1 2
i' 1 1 it 1 1 1 1
it 2 1 julius 1 1 2 1
julius 1 1 killed 1 2 1 1
killed 1 2 let 1 1 1 2
let 2 1 me 1 1 2 1
me 1 1 noble 1 1 1 1
noble 2 1 so 1 1 2 1
so 2 1 the 2 2 2 1
told 1 1 1 1
the 1 1
you 1 1 2 1
the 2 1
was 2 2 2 1
told 2 1 with 1 1
you 2 1 2 1
was 1 1 1 1
2 1
was 2 1
2 1
with 2 1

Pointers
Inverted index storage
•Separation of inverted file into vocabulary and posting
file is a good idea.
–Vocabulary: For searching purpose we need only word list.
This allows the vocabulary to be kept in memory at search
time since the space required for the vocabulary is small.

–Posting file requires much more space.

• For each word appearing in the text we are keeping statistical informa-
tion related to word occurrence in documents.
Suffix trees and suffix arrays
Suffix trie
• What is Suffix? A suffix is a substring that exists at the end of the
given string.
– Each position in the text is considered as a text suffix

– If txt=t1t2...ti...tn is a string, then Ti=ti, ti+1...tn is the suffix of txt that starts at po-
sition i,
• Example: txt = mississippi txt = GOOGOL
T1 = mississippi; T1 = GOOGOL
T2 = ississippi; T2 = OOGOL
T3 = ssissippi; T3 = OGOL
T4 = sissippi; T4 = GOL
T5 = issippi; T5 = OL
T6 = ssippi; T6 = L
T7 = sippi;
T8 = ippi;
T9 = ppi;
T10 = pi;
T11 = i;
Suffix trie
•A suffix trie is an ordinary trie in which the input
strings are all possible suffixes.
• Principles: The idea behind suffix TRIE is to assign to each
symbol in a text an index corresponding to its position in the
text. (i.e: First symbol has index 1, last symbol has index n
(#of symbols in text).
• To build the suffix TRIE we use these indices instead of the ac-
tual object.
•The structure has several advantages:
• It requires less storage space.
• We do not have to worry how the text is represented (binary,
ASCII, etc).
• We do not have to store the same object twice (no
duplicate).
Suffix Trie
•Construct suffix trie for the following string: GOOGOL
•We begin by giving a position to every suffix in the text starting
from left to right as per characters occurrence in the string.
• TEXT: GOOGOL$
POSITION: 1 2 3 4 5 6 7
•Build a SUFFIX TRIE for all n suffixes of the text.
•Note: The resulting tree has n leaves and height n.

• This structure is
particularly useful
for any application
requiring prefix
based ("starts
with") pattern
matching.
Suffix tree
• A suffix tree is a member of
the trie family. It is a Trie of all
the proper suffixes of S
–The suffix tree is created by •O
compacting unary nodes of the
suffix TRIE.
• We store pointers rather than
words in the leaves.
–It is also possible to replace
strings in every edge by a pair
(a,b), where a & b are the be-
ginning and end index of the
string. i.e.
(3,7) for OGOL$
(1,2) for GO
(7,7) for $
•To make suffixes prefix-free we add a special
char, $, at the end of s. To associate each suf-
fix with a unique string in S add a different
special symbol to each s
Search in suffix tree
• Searching for all instances of a substring S in a suffix tree is
easy since any substring of S is the prefix of some suffix.
• Pseudo-code for searching in suffix tree:
–Start at root
–Go down the tree by taking each time the corresponding path
–If S correspond to a node then return all leaves in sub-tree
• the places where S can be found are given by the pointers in all the
leaves in the subtree rooted at x.
–If S encountered a NIL pointer before reaching the end, then
S is not in the tree
Example:
• If S = "GO" we take the GO path and return:
GOOGOL$,GOL$.
• If S = "OR" we take the O path and then we hit a NIL pointer so
"OR" is not in the tree.
Suffix Tree Applications
• Suffix Tree can be used to solve a large number of string
problems that occur in:
–text-editing,
–free-text search,
–etc.

• Some examples of string problems are given below.

–String matching
–Longest Common Substring
–Longest Repeated Substring
–etc..
Drawbacks
• Suffix trees consume a lot of space

• How many bytes required to store MIS-

SISSIPI ?
Suffix array
• A suffix array is more compact than a suffix tree.
–Suffix arrays are a space efficient implementation of suffix
trees

• Like suffix tree, a suffix array is a sorted list of the suf-

fixes of a given string in lexicographical order.
–The sorted list is presented as an array of integers that iden-
tify the suffixes in order.
–This allows a binary search or fast substring search.

• Main drawbacks:
–Its costly construction process,
–The need for the text to be readily available at query time
Building suffix array
• Procedure:
– Identify suffixes of the given string
– Sort the suffixes lexicographically
– Store indices of all the suffixes in a table.

• The suffix array gives the indices of the suffixes in

sorted order

• Consider the string "good".

– In lexicographical order, the suffixes are "d", "good", "od",
and "ood".
– The suffix array is [4, 1, 3, 2]. At the end, a special charac-
ter is usually appended to the string.
Building a suffix array
•Example:
•given the string S = GOOGOL, construct suffix array
• Sort the suffixes in lexicographical order and store in a table
all the indices.

Signature file
• Word-oriented index structures based on hashing
• How to build signature file
– Hash each word to allocate fixed sized F-bits vector
(word signature)
– Divide the text in blocks of N words each
– Assign F-bits masks for each text block of size N (docu-
ment signature)
• This is obtained by bitwise ORing the signatures of all the
words in the text block.
• Hence the signature file is no more than the se-
quence of bit masks of all blocks (plus a pointer to
each block).
Structure of Signature File
•Docu- •Signature file
F-bits •pointe •Text file
ment sig-
r
nature 0 1 … 0 1
1
1
•N …
blocks 1
1
0
1
Example
• Given a text:
A text has many words. Words are made from letters

• Text Signa-
ture:
1110101 0111100 1011111

• Signature (hash) function:

• h(text) = 1000101 •Block 4: 001100
• h(many) = 0110101 •OR100001
• h(word) = 0111100 • 101101
• h(made) = 0010111
• h(letter) = 1001011
Searching
• During query processing:
–Hash the query to a F-bit mask Q
–Compare query signature with document signature of each
block, that is
• Bit-wise ANDing all the bits set in the query with bit masks Bi of
all the text block
–If all corresponding 1-bits are “on” in document signature,
document probably contains that term, that is
• If Q & Bi = Q, all the bits set in Q are also set in BI and therefore
the text block may contain the word
• The main idea of signature file is that if a word is
present in a text block, then all the bits set in its signa-
ture are also set in the bit mask of the text block
–Hence if a bit is set in the mask of the query word and not in
the mask of the text block, then the word is not present in the
text block
Signature file trivia
• Signature files leads to possible mismatches.
–It is possible that all the corresponding bits are set
even though the word is not there. This is called
false drop.

• False drop or false positive

–Document that is retrieved by a search but is not
relevant to the searcher’s needs
–False drops occur because of words that are writ-
ten the same but have different meanings.
–Example: ‘squash’ refer to a game, a vegetable or
an action

Pedagogy
No ratings yet
Pedagogy
10 pages
Assessment of Quality of Life (AQoL-8D)
No ratings yet
Assessment of Quality of Life (AQoL-8D)
6 pages
In Mobi
No ratings yet
In Mobi
4 pages
Thesis 01
100% (1)
Thesis 01
14 pages
Chapter 3,4, 5 and 6
No ratings yet
Chapter 3,4, 5 and 6
145 pages
IR Chapter Three
No ratings yet
IR Chapter Three
30 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
3-Index Construction
No ratings yet
3-Index Construction
43 pages
Chapter 3 Indexing
No ratings yet
Chapter 3 Indexing
48 pages
ch3_ Indexing _2019
No ratings yet
ch3_ Indexing _2019
38 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
3 Indexing (2)
No ratings yet
3 Indexing (2)
28 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Indexing Structure: Chapter Four
No ratings yet
Indexing Structure: Chapter Four
26 pages
3
No ratings yet
3
8 pages
Chapter 4 IR
No ratings yet
Chapter 4 IR
56 pages
4_Indexing
No ratings yet
4_Indexing
59 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
IR Chapter Three Ppt
No ratings yet
IR Chapter Three Ppt
59 pages
Ir Chapter Three
No ratings yet
Ir Chapter Three
41 pages
Indexing 2021
No ratings yet
Indexing 2021
44 pages
UNIT-2
No ratings yet
UNIT-2
10 pages
Inverted File
No ratings yet
Inverted File
20 pages
IRS Module5-I
No ratings yet
IRS Module5-I
15 pages
Module 5 - Indexing and Searching
No ratings yet
Module 5 - Indexing and Searching
15 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
IR Unit 2 Dictionaries and Query Processing
No ratings yet
IR Unit 2 Dictionaries and Query Processing
20 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
FOP Efficiency Indexing 13
No ratings yet
FOP Efficiency Indexing 13
22 pages
Learning Guide Unit 2
No ratings yet
Learning Guide Unit 2
15 pages
L05
No ratings yet
L05
33 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
No ratings yet
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
32 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
Unit 3 Indexing
100% (1)
Unit 3 Indexing
10 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
GROUP04_Report
No ratings yet
GROUP04_Report
9 pages
Week 4 - Information Retrieval Indexing
No ratings yet
Week 4 - Information Retrieval Indexing
55 pages
Slides Chap09
No ratings yet
Slides Chap09
153 pages
unit5_trie
No ratings yet
unit5_trie
23 pages
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
No ratings yet
II. Information Retrieval (Basics Cont.) : Web Search - Summer Term 2006
16 pages
Chapter 3 IR
No ratings yet
Chapter 3 IR
56 pages
indexing_1
No ratings yet
indexing_1
61 pages
CHAP 4 Inverted Index
No ratings yet
CHAP 4 Inverted Index
21 pages
lecture3-tolerent
No ratings yet
lecture3-tolerent
81 pages
irs unit-2 modified
No ratings yet
irs unit-2 modified
7 pages
03 -Lect3 search engines-part2
No ratings yet
03 -Lect3 search engines-part2
32 pages
Unit I
No ratings yet
Unit I
83 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
IRS imp
No ratings yet
IRS imp
76 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
09-indexes2
No ratings yet
09-indexes2
5 pages
Syllabus Renewal
No ratings yet
Syllabus Renewal
2 pages
Fresh Onoion Exporters
100% (1)
Fresh Onoion Exporters
10 pages
(En) Design of A Steel Beams and Columns
No ratings yet
(En) Design of A Steel Beams and Columns
110 pages
OPEN-BATH PRODUCTION OF FERROCHROMIUM in a DC plasma furnace
No ratings yet
OPEN-BATH PRODUCTION OF FERROCHROMIUM in a DC plasma furnace
8 pages
TAMAGAWA SEIKI full
No ratings yet
TAMAGAWA SEIKI full
123 pages
Schools Division of Ifugao
No ratings yet
Schools Division of Ifugao
7 pages
Read and Choose The Best Answers: Choose The Word That Has The Same Meaning As The Underlined Word
No ratings yet
Read and Choose The Best Answers: Choose The Word That Has The Same Meaning As The Underlined Word
9 pages
Agrovet
No ratings yet
Agrovet
5 pages
Demand Forecasting - Lecture 5
No ratings yet
Demand Forecasting - Lecture 5
12 pages
QP_CBSE_IX_SS_Ch 2 Socialism in Europe_Practise Sheet 2
No ratings yet
QP_CBSE_IX_SS_Ch 2 Socialism in Europe_Practise Sheet 2
2 pages
Nutr 407 Meal Planning Project-Ana
No ratings yet
Nutr 407 Meal Planning Project-Ana
14 pages
A Study of Earthquake Preparedness: Mike Dyer, Kimbra Inglis, Dawn Robinson, Edward Sajor, and Court Williams
No ratings yet
A Study of Earthquake Preparedness: Mike Dyer, Kimbra Inglis, Dawn Robinson, Edward Sajor, and Court Williams
15 pages
Kumbharana CK Thesis Cs
No ratings yet
Kumbharana CK Thesis Cs
243 pages
Customer Relationship Management
No ratings yet
Customer Relationship Management
113 pages
Euthanasia: Mercy Killing
No ratings yet
Euthanasia: Mercy Killing
4 pages
CHAPTER 5. Number Theory. 1. Integers and Division. Discussion
No ratings yet
CHAPTER 5. Number Theory. 1. Integers and Division. Discussion
9 pages
Pc-Ee-591 Lab
No ratings yet
Pc-Ee-591 Lab
39 pages
ELEC211 Reviewer
No ratings yet
ELEC211 Reviewer
6 pages
Quaid-e-Azam (History Project)
No ratings yet
Quaid-e-Azam (History Project)
3 pages
RJ01GD0134
No ratings yet
RJ01GD0134
3 pages
Reading 2 - Forming and Empowering Scrum Team
No ratings yet
Reading 2 - Forming and Empowering Scrum Team
88 pages
Week 1 Day 1
No ratings yet
Week 1 Day 1
13 pages
Primary Education Degree Dissertation Ideas
100% (2)
Primary Education Degree Dissertation Ideas
8 pages
Boostnatics v. I-Max Trading - Complaint
No ratings yet
Boostnatics v. I-Max Trading - Complaint
28 pages
download
No ratings yet
download
1 page
Shiva Mahima Stotram Lyrics With Meaning
No ratings yet
Shiva Mahima Stotram Lyrics With Meaning
10 pages

Chapter-4 - Data Structure-File Structure

Uploaded by

Chapter-4 - Data Structure-File Structure

Uploaded by

Data/File Structures and Algo-

• Its main advantages are:

• The problem of update can be:

• Data to be held in the inverted file includes list of index

• The vocabulary: is the set of all distinct words (index

• The occurrence: contains one record per term, listing

•Having information about frequency is used for:

bus 4 3 1 94 Text operations

• Records kept for each term j in the word list con-

Advantage of dividing inverted file:

I did enact Julius

merged and caesar

helps to com- you

–Posting file requires much more space.

• Some examples of string problems are given below.

• How many bytes required to store MIS-

• Like suffix tree, a suffix array is a sorted list of the suf-

• The suffix array gives the indices of the suffixes in

• Consider the string "good".

• Signature (hash) function:

• False drop or false positive

You might also like