0% found this document useful (0 votes)
53 views2 pages

Assignment No: 3: Aim: Objective: Theory:-Inverted Index

The document discusses implementing an inverted index to allow for fast retrieval of documents. It defines an inverted index as a data structure that maps words or numbers to their locations in a database or documents. An example inverted index is provided using the texts "it is what it is", "what is it", and "it is a banana". The document also discusses applications of inverted indexes in search engines and DNA sequence assembly.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views2 pages

Assignment No: 3: Aim: Objective: Theory:-Inverted Index

The document discusses implementing an inverted index to allow for fast retrieval of documents. It defines an inverted index as a data structure that maps words or numbers to their locations in a database or documents. An example inverted index is provided using the texts "it is what it is", "what is it", and "it is a banana". The document also discusses applications of inverted indexes in search engines and DNA sequence assembly.

Uploaded by

Pratik B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Assignment No: 3

Aim: To implement a program Retrieval of documents using inverted files.

Objective: To study & Implement concept of Inverted Index.

Theory:-

Inverted index:-

In computer science, an inverted index (also referred to as postings file or inverted


file) is an index data structure storing a mapping from content, such as words or numbers, to
its locations in a database file, or in a document or a set of documents. The purpose of an
inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database. The inverted file may be the database file itself, rather
than its index. It is the most popular data structure used in document retrieval systems,[1] used
on a large scale for example in search engines. Several significant general-purpose
mainframe-based database management systems have used inverted list architectures,
including ADABAS, DATACOM/DB, and Model 204.

There are two main variants of inverted indexes: A record level inverted index (or
inverted file index or just inverted file) contains a list of references to documents for each
word. A word level inverted index (or full inverted index or inverted list) additionally
contains the positions of each word within a document. The latter form offers more
functionality (like phrase searches), but needs more time and space to be created.

Example:-

Given the texts "it is what it is", "what is it" and "it is a banana", we
have the following inverted file index (where the integers in the set notation brackets refer to
the subscripts of the text symbols, , etc.):

"a": {2}
"banana":{2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

A term search for the terms "what", "is" and "it" would give the set
.

With the same texts, we get the following full inverted index, where the pairs are document
numbers and local word numbers. Like the document numbers, local word numbers also
begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (
), and it is the fourth word in that document (position 3).

"a": {(2, 2)}


1
"banana": {(2, 3)}
"is": {(0, 1), (0, 4), (1, 1), (2, 1)}
"it": {(0, 0), (0, 3), (1, 2), (2, 0)}
"what": {(0, 2), (1, 0)}

If we run a phrase search for "what is it" we get hits for all the words in both
document 0 and 1. But the terms occur consecutively only in document 1.

Applications:-

The inverted index data structure is a central component of a typical search engine
indexing algorithm. A goal of a search engine implementation is to optimize the speed of the
query: find the documents where word X occurs. Once a forward index is developed, which
stores lists of words per document, it is next inverted to develop an inverted index. Querying
the forward index would require sequential iteration through each document and to each word
to verify a matching document. The time, memory, and processing resources to perform such
a query are not always technically realistic. Instead of listing the words per document in the
forward index, the inverted index data structure is developed which lists the documents per
word.

With the inverted index created, the query can now be resolved by jumping to the
word id (via random access) in the inverted index.

In pre-computer times, concordances to important books were manually assembled.


These were effectively inverted indexes with a small amount of accompanying commentary
that required a tremendous amount of effort to produce.

In bioinformatics, inverted indexes are very important in the sequence assembly of


short fragments of sequenced DNA. One way to find the source of a fragment is to search for
it against a reference DNA sequence. A small number of mismatches (due to differences
between the sequenced DNA and reference DNA, or errors) can be accounted for by dividing
the fragment into smaller fragments—at least one sub fragment is likely to match the
reference DNA sequence. The matching requires constructing an inverted index of all
substrings of a certain length from the reference DNA sequence. Since the human DNA
contains more than 3 billion base pairs, and we need to store a DNA substring for every
index, and a 32-bit integer for index itself, the storage requirement for such an inverted index
would probably be in the tens of gigabytes, just beyond the available RAM capacity of most
personal computers today.

Conclusion:- In this way we studied & successfully implemented Inverted Index.

You might also like