0% found this document useful (0 votes)
98 views44 pages

IR ch4 - Inverted-Index

The document discusses inverted indexes, which are a common data structure used in information retrieval systems like search engines. An inverted index stores a list of documents that contain each word in the vocabulary. It discusses how inverted indexes are constructed by parsing documents, building term-document matrices, sorting the postings lists, merging entries, and writing the results to a dictionary file and postings file. It also compares different implementations of inverted indexes using arrays, linked lists, B-trees, hash tables, and discusses issues like dynamic indexing when documents are frequently added, deleted or updated.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views44 pages

IR ch4 - Inverted-Index

The document discusses inverted indexes, which are a common data structure used in information retrieval systems like search engines. An inverted index stores a list of documents that contain each word in the vocabulary. It discusses how inverted indexes are constructed by parsing documents, building term-document matrices, sorting the postings lists, merging entries, and writing the results to a dictionary file and postings file. It also compares different implementations of inverted indexes using arrays, linked lists, B-trees, hash tables, and discusses issues like dynamic indexing when documents are frequently added, deleted or updated.

Uploaded by

Bushra Mamoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CS444: Information Retrieval

and Web Search


Fall 2021

CHAPTER 4:
INVERTED INDEX
Abstraction of search engine architecture
Indexed corpus
Crawler
Ranking procedure

Feedback Evaluation
Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 2
INVERTED
INDEX
Inverted Files: Main Concepts
In practice, document vectors are not stored directly; an inverted organization provides much
better efficiency.
Inverted index: a word-oriented mechanism for indexing a text collection to speed up the
searching task
The inverted index structure is composed of two elements: the vocabulary and the
occurrences
The vocabulary is the set of all different words in the text
For each token in the vocabulary the index stores the documents which contain that word
(inverted index)
The keyword-to-document index can be implemented as:
◦ a sorted array, a tree-based data structure (trie, B-tree),a hash table

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 3
Inverted Files: Main Concepts
Term-document matrix: the simplest way to represent the documents that contain each word of
the vocabulary

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 4
Inverted Files: Main Concepts
The main problem of this simple solution is that it requires too much space
As this is a sparse matrix, the solution is to associate a list of documents with
each word
The set of all those lists is called the occurrences

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 5
Inverted Index Construction: Steps
For each term T, we must store a list of all documents that contain T.
• Do we use an array or a list for this?
Keyword-To – Documents Relation

. .
(KEYWORDS) (Documents contains KEYWORDS)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 6
Inverted Index Construction: Steps
Linked lists generally preferred to arrays
• Dynamic space allocation
• Insertion of terms into documents easy
• Space overhead of pointers

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 7
Inverted Index Construction: Steps
• Sequence of (Modified token, Document ID) pairs.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 8
Inverted Index Construction:
Sorting
Sort by terms

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 9
Inverted Index Construction:
Merge
Multiple term entries in a single document are merged.
• Frequency information is added.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 10
Inverted Index Construction:
Steps
• The result is split into
a Dictionary file and
a Postings file.

Dictionary file Postings file

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 11
Inverted Index Construction:
Sorting issues
As we build index, we parse documents one at a time.
▶ The final postings for any term are incomplete until the end.
▶ Can we keep all postings in memory and then do the sort in-memory at the end?
▶ No, not for large collections
▶ At 10–12 bytes per postings entry, we need a lot of space for large collections.
▶ In-memory index construction does not scale for large collections.
▶ Thus: We need to store intermediate results on disk.
▶ the same issues will be on disk.
▶ We need an external sorting algorithm (using few disk seeks).

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 12
“External” sorting algorithm
To reduce space requirements, a technique called block addressing is used
The documents are divided into blocks, and the occurrences point to the blocks where the
word appears
We can easily fit that many postings into memory.
Basic idea of algorithm:
For each block:
accumulate postings,
sort in memory,
write to disk

Then merge the blocks into one long sorted order.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 13
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 14
“External” sorting algorithm

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 15
“External” sorting algorithm PROBLEM
Our assumption was: we can keep the dictionary in memory.
We need the dictionary (which grows dynamically) in order to implement a term to termID
mapping.
Actually, we could work with term,docID postings instead of termID,docID postings . . .
. . . but then intermediate files become very large. (We would end up with a scalable, but very
slow index construction method.)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 16
“External” sorting algorithm SOLUTION
Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID
mapping across blocks.
Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.
With these two ideas we can generate a complete inverted index for each block.
These separate indexes can then be merged into one big index.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 17
Other index implementations
Vocabulary and therefore dimensionality of vectors can be very large, ~104 .
However, most documents and queries do not contain most words, so vectors are sparse (i.e.
most entries are 0).
Need efficient methods for storing and computing with sparse vectors.
We showed sparse vectors as Linked Lists (sorted arrays)
Store vectors as linked lists:
◦ Space proportional to number of unique tokens (n) in document.
◦ Requires linear search of the list to find (or change) a specific token.
◦ Requires quadratic time in worst case to compute vector for a document:
O(n 2 )

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 18
Inverted Index as B Trees
Index tokens in a document in a balanced binary tree with weights stored with tokens at the
leaves.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 19
B+ Tree
B+ tree stores data pointers only at the leaf nodes of the tree.
data pointers are present only at the leaf nodes,
◦ the leaf nodes must necessarily store all the key values along with their corresponding data pointers to
the disk file block, in order to access them.
◦ Moreover, the leaf nodes are linked to providing ordered access to the records.
◦ The leaf nodes, therefore form the first level of the index, with the internal nodes forming the other
levels of a multilevel index.
◦ Some of the key values of the leaf nodes also appear in the internal nodes, to simply act as a medium to
control the searching of a record.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 20
B tree
Space overhead for tree structure: ~2n nodes.
O(log n) time to find or update weight of a specific token.
O(n log n) time to construct vector.
Need software package to support such data structures.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 22
Inverted Index as HashTables
Store tokens in hashtable, with token string as key and weight as value.
◦ Storage overhead for hashtable ~1.5n.
◦ Table must fit in main memory.
◦ Constant time to find or update weight of a specific token.
◦ O(n) time to construct vector

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 23
Hash index

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 24
B trees VS Hash index
You can find the difference form last slides!

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 25
Dynamic indexing
Thus far, we have assumed that the document collection is static.
But most collections are modified frequently ( added, deleted, and updated).
This means that new terms need to be added to the dictionary, and postings lists need to be updated
for existing terms.

The simplest way to achieve this is to periodically reconstruct the index from scratch.
This is a good solution if the number of changes over time is small and a delay in making new
documents searchable is acceptable
one solution is to maintain two indexes: a large main index and a small auxiliary index that
stores new documents.
The auxiliary index is kept in memory.
Searches are run across both indexes and results merged.
Deletions are stored in an invalidation bit vector.
We can then filter out deleted documents before returning the search result.
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 26
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 27
Distributed indexing
Collections are often so large that we cannot perform index construction efficiently on a single
machine.
This is particularly true of the World Wide Web for which we need large computer clusters to
construct any reasonably sized web index.
Web search engines, therefore, use distributed indexing algorithms for index construction.
The result of the construction process is a distributed index that is partitioned across several
machines - either according to term or according to document.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 28
Signature Files
Signature files are word-oriented index structures based on hashing
They pose a low overhead, at the cost of forcing a sequential search over the index
Since their search complexity is linear, it is suitable only for not very large texts
Nevertheless, inverted indexes outperform signature files for most applications

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 29
Signature Files Structure
A signature divides the text in blocks of b words each, and maps words to bit masks of B bits
This mask is obtained by bit-wise ORing the signatures of all the words in the text block

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 30
Signature Files Structure
If a word is present in a text block, then its signature is also set in the bit mask of the text block
Hence, if a query signature is not in the mask of the text block, then the word is not present in
the text block
However, it is possible that all the corresponding bits are set even though the word is not there
This is called a false drop

A delicate part of the design of a signature file is:


to ensure the probability of a false drop is low, and
to keep the signature file as short as possible Indexing

The hash function is forced to deliver bit masks which have at least # bits set
A good model assumes that # bits are randomly set in the mask (with possible repetition)

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 31
Search within inverted index
Query processing
Procedures
◦ Perform the same processing procedures as on documents to the input query
◦ Tokenization->normalization->stemming->stopwords removal
Lookup query term in the dictionary
◦ Retrieve the posting lists
Operation
◦ AND: intersect the posting lists
◦ OR: union the posting list
◦ NOT: diff the posting list

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 32
Single Word Queries:
The simplest type of search is that for the occurrences of a single word
The vocabulary search can be carried out using any suitable data structure
Ex: hashing, tries, or B-trees
We note that the vocabulary is in most cases sufficiently small so as to stay in
main memory
The occurrence lists, on the other hand, are usually fetched from disk

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 33
Multiple Word Queries:
If the query has more than one word, we have to consider two cases:
conjunctive (AND operator) queries
disjunctive (OR operator) queries
Differentiate (NOT operator) queries
Conjunctive queries imply to search for all the words in the query, obtaining
one inverted list for each word
Following, we have to intersect all the inverted lists to obtain the documents
that contain all these words
For disjunctive queries the lists must be merged
The first case is popular in the Web due to the size of the document collection

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 34
The Boolean Model
The user describes their information need using Boolean constraints (e.g., AND, OR, and AND
NOT)
• Unranked Boolean Retrieval Model: retrieves documents that satisfy the constraints in no
particular order
• Ranked Boolean Retrieval Model: retrieves documents that satisfy the constraints and ranks
them based on the number of ways they satisfy the constraints
• Also known as ‘exact-match’ retrieval models
• Advantages and disadvantages?

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 35
The Boolean Model
Advantages:
‣ Easy for the system
‣ Users get transparency: it is easy to understand why a document was or was not retrieved
‣ Users get control: it easy to determine whether the query is too specific (few results) or too
broad (many results)
• Disadvantages:
‣ The burden is on the user to formulate a good Boolean query

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 36
Query processing
Consider processing the query:
Brutus AND Caesar
• Locate Brutus in the Dictionary; Retrieve its postings.
• Locate Caesar in the Dictionary; Retrieve its postings.
• “Merge” the two postings:

If the list lengths are x and y, the merge takes O(x+y) operations.
postings sorted by docID.

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 37
Intersecting Algorithm

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 38
Exercise

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 39
Phrase Queries
Phrase queries are more difficult to solve with inverted indexes
The lists of all elements must be traversed to find places where
all the words appear in sequence (for a phrase)
this algorithm is similar to a list intersection algorithm
Another solution for phrase queries is based on indexing two-word phrases
and using similar algorithms over pairs of words
however the index will be much larger as the number of word pairs is not
linear

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 40
Positional indexes
for each term in the vocabulary, we store postings of the form
docID: <position1, position2, ...>,
where each position is a token index in the document.
Each posting will also usually record the term frequency

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 41
Positional indexes
Postings lists in a positional index: each posting is a docID and a list of positions

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 42
CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 43
Exercise

CS444 INFORMATION RETRIVAL & WEB SEARCH ENGIN BY ZAINAB AHMED MOHAMMED 44

You might also like