IR Notes
IR Notes
Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
An IR system has the ability to represent, store, organize, and access information items. A set of
keywords are required to search. Keywords are what people are searching for in search engines.
These keywords summarize the description of the information.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the user
or the user has asked for in the form of a query. The documents and the queries are represented
in a similar manner, so that document selection and ranking can be formalized by a matching
function that returns a retrieval status value (RSV) for each document in the collection. Many
of the Information Retrieval systems represent document contents by a set of descriptors, called
terms, belonging to a vocabulary V. An IR model determines the query-document matching
function according to four main approaches:
The estimation of the probability of user’s relevance rel for each document d and query q with
respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model
● Acquisition: In this step, the selection of documents and other objects from various web
resources that consist of text-based documents takes place. The required data is collected
by web crawlers and stored in the database.
● Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
● File Organization: There are two types of file organization methods. i.e. Sequential: It
contains documents by document data. Inverted: It contains term by term, list of records
under each term. Combination of both.
● Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs, for example, search strings in web search
engines. In information retrieval, a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees
of relevancy.
Difference Between Information Retrieval and Data Retrieval
Information Retrieval Data Retrieval
The software program that deals with the Data retrieval deals with obtaining data from a
organization, storage, retrieval, and database management system such as ODBMS. It is
evaluation of information from document A process of identifying and retrieving the data from
Information Retrieval Data Retrieval
repositories particularly textual the database, based on the query provided by user or
information. application.
Determines the keywords in the user query and
Retrieves information about a subject.
retrieves the data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is
Has a well-defined structure and semantics.
semantically ambiguous.
Does not provide a solution to the user of
Provides solutions to the user of the database system.
the database system.
The results obtained are approximate
The results obtained are exact matches.
matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
User Interaction With Information Retrieval System
The User Task: The information first is supposed to be translated into a query by the user. In the
information retrieval system, there is a set of words that convey the semantics of the information
that is required whereas, in a data retrieval system, a query expression is used to convey the
constraints which are satisfied by the objects. Example: A user wants to search for something but
ends up searching with another thing. This means that the user is browsing and not searching.
The above figure shows the interaction of the user through different tasks.
● Logical View of the Documents: A long time ago, documents were represented through
a set of index terms or keywords. Nowadays, modern computers represent documents by
a full set of words which reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are text operations.
These text operations reduce the complexity of the document representation from full
text to set of index terms.
Past, Present, and Future of Information Retrieval
1. Early Developments: As there was an increase in the need for a lot of information, it became
necessary to build data structures to get faster access. The index is the data structure for faster
retrieval of information. Over centuries manual categorization of hierarchies was done for
indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies, and
the search was based on author name and title. In the second generation, it included searching by
subject heading, keywords, etc. In the third generation, it consisted of graphical interfaces,
electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it provides
greater access to networks due to digital communication and it gives free access to publish on a
larger medium.
Advantages of Information Retrieval
1. Efficient Access: Information retrieval techniques make it possible for users to easily locate
and retrieve vast amounts of data or information.
2. Personalization of Results: User profiling and personalization techniques are used in
information retrieval models to tailor search results to individual preferences and behaviors.
3. Scalability: Information retrieval models are capable of handling increasing data volumes.
4. Precision: These systems can provide highly accurate and relevant search results, reducing
the likelihood of irrelevant information appearing in search results.
Disadvantages of Information Retrieval
1. Information Overload: When a lot of information is available, users often face information
overload, making it difficult to find the most useful and relevant material.
2. Lack of Context: Information retrieval systems may fail to understand the context of a user’s
query, potentially leading to inaccurate results.
3. Privacy and Security Concerns: As information retrieval systems often access sensitive user
data, they can raise privacy and security concerns.
4. Maintenance Challenges: Keeping these systems up-to-date and effective requires ongoing
efforts, including regular updates, data cleaning, and algorithm adjustments.
5. Bias and fairness: Ensuring that information retrieval systems do not exhibit biases and
provide fair and unbiased results is a crucial challenge, especially in contexts like web search
engines and recommendation systems.
Applications of IR
Information retrieval (IR) systems were firstly developed to help manage the huge amount of
information. Many universities, corporate, and public libraries now use IR systems to
provide access to books, journals, and other documents. Information retrieval is used today in
many applications. General applications of information retrieval system are as follows:
IR System evaluation
IR evaluation is basically determining the accuracy of an IR system(Anwar.A, 2014). Two basic
factors of resolving IR system are:
√ Precision - the fraction of retrieved documents that are relevant to the user’s information
need.
√ Recall - the fraction of relevant documents in collection that are retrieved. Answers the
question of whether all the relevant documents were retrieved.
The higher the precision and recall, the better the system.
Approach to manage and organize large collection of information actually came from
librarianship. It can be unambiguously claimed that cataloguing is the primordial soup for the
birth of Information Retrieval. Earlier days, mostly different books, documents, sacred
manuscripts, scriptures, epics, spiritual documents were kept and indexed using cataloguing
schemes. Eliot and Rose claimed in 3rd century B.C. Greek poet, Callimachus, first created own
cataloguing schemes for managing his personal collections. In ancient periods, some big libraries
were built. For example, library at Alexandria (280 B.C.) had more than 700,000 documents.
Nalanda University had one huge library for document storage. But, the existence of any
mechanism to organize, classify or retrieve them is still unknown.
In 1891, Rudolph filed a patent to US patent office for a machine composed catalogue cards
joined together, which could be wound past a viewing window enabling rapid manual scanning
of the catalogues. Soper in 1918 filed another patent for a device where catalogue cards with
holed, related to categories, were aligned in front of each other to determine if there were entries
in a collection with a particular combination of categories. If light could be seen through the
arrangement of cards, a match was found.
The necessity of designing some mechanical devices that can be used for searching a catalogue
for a particular entry was felt in due years. Emanuel Goldberg was the first person who worked
to solve that problem in the 1920s and ‘30s and indigenously. By nature, it’s an optical device
which basically searches for a pattern of dots or letters within the catalogues on a roll of
microfilm. Goldberg patented many of his inventions in photography. Figure 1 shows the
diagram of the patent filed in USPTO in 1928. “Here it can be seen that catalogue entries were
stored on a roll of film (figure 1). A query (2) was also on film showing a negative image of the
part of the catalogue being searched for; in this case the 1st and 6thentries on the roll. A light
source (7) was shone through the catalogue roll and query film, focused onto a photocell (6). If
an exact match was found, all light was blocked to the cell causing a relay to move a counter
forward (12) and for an image of the match to be shown via a half silvered mirror (3), reflecting
the match onto a screen or photographic plate.
After this big invention, in 1935, Davis and Draeger also made several experiments in similar
line on microfilm based searching. As per Mooers, their work influenced Vannevar Bush and
developed famous Memex System in 1945.
In 1950, Luhn also made a selector using punch card, light and photo cells and this system could
search over 600 cards per minute. Another important feature of this system is it could search the
pattern of consecutive characters within a long string. Calvin Mooers in a conference in 1950
first coined the term “Information Retrieval”.
Introduction
Document Retrieval in Machine Learning is part of a larger aspect known as Information
Retrieval, where a given query by the user, the system tries to find relevant documents to the
search query as well as rank them in order of relevance or match.
They are different ways of Document retrieval, two popular ones are −
● Boolean Model
● Vector Space Model
Let us have a brief understanding of each of the above methods.
Boolean Model
It is a set-based retrieval model.The user query is in boolean form. Queries are joined using
AND, OR, NOT, etc. A document can be visualized as a keyword set. Based on the query a
document is retrieved based on relevance. Partial matches and ranking are not supported.
Example (Boolean query) −
[[America & France] | [Honduras & London]] & restaurants &! Manhattan]
Steps and Flow diagram of Boolean Model
Boolean model is an Inverted Index search to find if a document is relevant or not.It does not
return the rank of the document.
Let us consider we have 3 documents in our corpus.
document_id document_text
1. Taj Mahal is a beautiful monument
2. Victoria Memorial is also a monument
3. I like to visit Agra
The term matrix will be created as below.
term doc_1 doc_2 doc_3
taj 1 0 0
mahal 1 0 0
is 1 1 0
a 1 1 0
beautiful 1 0 0
monument 1 1 0
victoria 0 1 0
memorial 0 1 0
also 0 1 0
i 0 0 1
like 0 0 1
to 0 0 1
visit 0 0 1
agra 0 0 1
let us have a query like "taj mahal agra"
The query will be created as −
taj [100] & mahal [100] & agra [001]
or 100 & 100 & 001 = 000, so here we can see none of the documents are relevant using AND.
We can then try including other operators like OR or using different keywords in addition to
these.
The inverted index can be created for this corpus as −
taj - set(1)
mahal – set(1)
is - set(1,2)
a - set(1,2)
beautiful - set(1)
monument - set(1,2)
victoria – set(2)
memorial - set(2)
also - set(2)
i - set(3)
like - set(3)
to - set(3)
visit - set(3)
agra- set(3)