0% found this document useful (0 votes)
81 views5 pages

Information Retrieval

Information retrieval involves representing, storing, and searching large collections of data to discover knowledge in response to user queries. The main goal is to find documents relevant to a user's information needs. Key processes include indexing documents, filtering stop words, searching, and ranking results by relevance. Precision measures the percentage of retrieved documents that are relevant, while recall measures the percentage of relevant documents that were retrieved. Major IR models include Boolean, vector, probabilistic, and inference network models. Common applications of IR include digital libraries and search engines.

Uploaded by

NB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views5 pages

Information Retrieval

Information retrieval involves representing, storing, and searching large collections of data to discover knowledge in response to user queries. The main goal is to find documents relevant to a user's information needs. Key processes include indexing documents, filtering stop words, searching, and ranking results by relevance. Precision measures the percentage of retrieved documents that are relevant, while recall measures the percentage of relevant documents that were retrieved. Major IR models include Boolean, vector, probabilistic, and inference network models. Common applications of IR include digital libraries and search engines.

Uploaded by

NB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

INTRODUCTION

Information retrieval is generally considered as a subfield of computer science that deals with
the representation, storage, and access of information. Information retrieval is concerned with
the organization and retrieval of information from large database collections. Information
Retrieval (IR) is the process by which a collection of data is represented, stored, and searched
for the purpose of knowledge discovery as a response to a user request (query). This process
involves various stages initiate with representing data and ending with returning relevant
information to the user.
Intermediate stage includes filtering, searching, matching and ranking operations. The main
goal of information retrieval system (IRS) is to “finding relevant information or a document that
satisfies user information needs”. To achieve this goal, IRSs usually implement following
processes:

 In indexing process the documents are represented in summarized content form.


 In filtering process all the stop words and common words are remove.
 Searching is the core process of IRS. There are various techniques for retrieving
documents that match with users need.
There are two basic measures for assessing the quality of information retrieval.
Precision: This is the percentage of retrieved documents that are in fact relevant to the query.
Recall: This is the percentage of documents that are relevant to the query and were in fact
retrieved.
There are three basic processes an information retrieval system has to support: the
representation of the content of the documents, the representation of the user's information
need, and the comparison of the two representations. The processes are visualized in Figure 1.
In the figure, squared boxes represent data and rounded boxes represent processes.
Representing the documents is usually called the indexing process. The process takes place off-
line, that is, the end user of the information retrieval system is not directly involved. The
indexing process results in a representation of the document.
Users do not search just for fun, they have a need for information. The process of representing
their information need is often referred to as the query formulation process.
The resulting representation is the query.
Comparing the two representations is known as the matching process. Retrieval of documents
is the result of this process.
IR MODELS

An IR model specifies the details of the document representation, the query representation and
the retrieval functionality. The fundamental IR models can be classified into Boolean, vector,
probabilistic and inference network model. The rest of this section briefly describes these
models.

Boolean Model

The Boolean model is the _rst model of information retrieval and probably also the most
criticised model. The Boolean model is the _rst model of information retrieval and probably
also the most criticised model. The model can be explained by thinking of a query term as a
unambiguous de_nition of a set of documents. For instance, the query term economic simply
de_nes the set of all documents that are indexed with the term economic. Using the operators
of George Boole's mathematical logic, query terms and their corresponding sets of documents
can be combined to form new sets of documents. The Boolean model allows for the use of
operators of Boolean algebra, AND, OR and NOT, for query formulation, but has one major
disadvantage: a Boolean system is not able to rank the returned list of documents. In the
Boolean model, a document is associated with a set of keywords. Queries are also expressions
of keywords separated by AND, OR, or NOT/BUT.
The retrieval function in this model treats a document as either relevant or irrelevant. In Figure
2, the retrieved sets are visualised by the shaded areas.
Inference Network Model

In this model, document retrieval is modeled as an inference process in an inference network.


Most techniques used by IR systems can be implemented under this model. In the simplest
implementation of this model, a document instantiates a term with a certain strength, and the
credit from multiple terms is accumulated given a query to compute the equivalent of a
numeric score for the document.
From an operational perspective, the strength of instantiation of a term for a document can be
considered as the weight of the term in the document, and document ranking in the simplest
form of this model becomes similar to ranking in the vector space model and the probabilistic
models described above. The strength of instantiation of a term for a document is not defined
by the model, and any formulation can be used.

Basic Crawling and Indexing Strategies

Documents can be crawled using several strategies:

 Orthographic, where words are treated just as strings of characters


 Semantic, where words are connected with the concepts they express
 Statistical, where the term frequency is systematically compared with a frequency
lexicon

In general, a document can be represented as a set of keywords (or key phrases) that
contribute to the description of its content. During the indexing phase, when the collection and
the storage of the data are performed, texts are usually preprocessed in order to remove stop
words and to perform stemming.
Additionally, the obtained words and stems can be reconnected to their synonyms in order to
create relations between words and concept classes.
IR systems try to retrieve all the documents that are relevant to a user query while minimizing
the number of nonrelevant documents retrieved.

Inversion Indices

Each document can be represented by a list of keywords which describe the contents of the
document for retrieval purposes. Fast retrieval can be achieved if we invert on those keywords. The
keywords are stored, eg alphabetically; in the index file for each keyword we maintain a list of
pointers to the qualifying documents in the postings file. This method is followed by almost all
the commercial systems.

SEARCHING TECHNIQUES

There are various searching algorithms, including linear search, binary search, brute force
search etc. some general searching algorithms are described below:

 In linear search algorithm is a method of finding a particular element or keyword from


list or array that checks every element in list, one at a time and in sequence. Linear
search is a simplest search algorithm. One of the most important drawbacks of linear
search is slow searching speed in ordered list. This search is also known as sequential
search.
 Brute force search is a very general problem solving technique that consists of
systematically enumerating all possible candidates for the solution and checking
whether each candidate satisfies the problem's statement. Brute force algorithm is
simple to implement and it will always find a solution if it exist.
 Binary search algorithm, finds specified position of the element by using the key value
with in a sorted array. In each step, the algorithm compares the search key value with
the key value of the middle element of the array. If the keys match, then a matching
element has been found and its index, or position, is returned. Otherwise, if the search
key is less than the middle element's key, then the algorithm repeats its action on the
sub-array to the left of the middle element or, if the search key is greater, on the
subarray to the right.
AREA OF IR APPLICATION

Digital Library

A digital library is a library in which collections are stored in digital formats and accessible by
computers. The digital content may be stored locally, or accessed remotely via computer
networks. A digital library is a type of information retrieval system.

Search Engines

A search engine is one of the most the practical applications of information retrieval techniques
to large scale text collections. Web search engines are best‐ known examples, but many others
searches exist, like: Desktop search, Enterprise search, Federated search, Mobile search, and
Social search.

CONCLUSION

At last we conclude that, information retrieval is a process of searching and retrieving the
knowledge based information from collection of documents.

You might also like