0% found this document useful (0 votes)
20 views14 pages

IR Notes

Information Retrieval (IR) is a software program focused on organizing, storing, and retrieving textual information from document repositories based on user queries. IR models rank documents based on their relevance to a user's query, utilizing various methods such as Boolean and Vector Space Models. While IR systems offer efficient access and personalized results, they also face challenges like information overload and privacy concerns.

Uploaded by

sam.varman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

IR Notes

Information Retrieval (IR) is a software program focused on organizing, storing, and retrieving textual information from document repositories based on user queries. IR models rank documents based on their relevance to a user's query, utilizing various methods such as Boolean and Vector Space Models. While IR systems offer efficient access and personalized results, they also face challenges like information overload and privacy concerns.

Uploaded by

sam.varman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

What is Information Retrieval?

Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
An IR system has the ability to represent, store, organize, and access information items. A set of
keywords are required to search. Keywords are what people are searching for in search engines.
These keywords summarize the description of the information.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the user
or the user has asked for in the form of a query. The documents and the queries are represented
in a similar manner, so that document selection and ranking can be formalized by a matching
function that returns a retrieval status value (RSV) for each document in the collection. Many
of the Information Retrieval systems represent document contents by a set of descriptors, called
terms, belonging to a vocabulary V. An IR model determines the query-document matching
function according to four main approaches:
The estimation of the probability of user’s relevance rel for each document d and query q with
respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model
●​ Acquisition: In this step, the selection of documents and other objects from various web
resources that consist of text-based documents takes place. The required data is collected
by web crawlers and stored in the database.
●​ Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
●​ File Organization: There are two types of file organization methods. i.e. Sequential: It
contains documents by document data. Inverted: It contains term by term, list of records
under each term. Combination of both.
●​ Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs, for example, search strings in web search
engines. In information retrieval, a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees
of relevancy.
Difference Between Information Retrieval and Data Retrieval
Information Retrieval Data Retrieval
The software program that deals with the Data retrieval deals with obtaining data from a
organization, storage, retrieval, and database management system such as ODBMS. It is
evaluation of information from document A process of identifying and retrieving the data from
Information Retrieval Data Retrieval
repositories particularly textual the database, based on the query provided by user or
information. application.
Determines the keywords in the user query and
Retrieves information about a subject.
retrieves the data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is
Has a well-defined structure and semantics.
semantically ambiguous.
Does not provide a solution to the user of
Provides solutions to the user of the database system.
the database system.
The results obtained are approximate
The results obtained are exact matches.
matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
User Interaction With Information Retrieval System

The User Task: The information first is supposed to be translated into a query by the user. In the
information retrieval system, there is a set of words that convey the semantics of the information
that is required whereas, in a data retrieval system, a query expression is used to convey the
constraints which are satisfied by the objects. Example: A user wants to search for something but
ends up searching with another thing. This means that the user is browsing and not searching.
The above figure shows the interaction of the user through different tasks.
●​ Logical View of the Documents: A long time ago, documents were represented through
a set of index terms or keywords. Nowadays, modern computers represent documents by
a full set of words which reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are text operations.
These text operations reduce the complexity of the document representation from full
text to set of index terms.
Past, Present, and Future of Information Retrieval
1. Early Developments: As there was an increase in the need for a lot of information, it became
necessary to build data structures to get faster access. The index is the data structure for faster
retrieval of information. Over centuries manual categorization of hierarchies was done for
indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies, and
the search was based on author name and title. In the second generation, it included searching by
subject heading, keywords, etc. In the third generation, it consisted of graphical interfaces,
electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it provides
greater access to networks due to digital communication and it gives free access to publish on a
larger medium.
Advantages of Information Retrieval
1. Efficient Access: Information retrieval techniques make it possible for users to easily locate
and retrieve vast amounts of data or information.
2. Personalization of Results: User profiling and personalization techniques are used in
information retrieval models to tailor search results to individual preferences and behaviors.
3. Scalability: Information retrieval models are capable of handling increasing data volumes.
4. Precision: These systems can provide highly accurate and relevant search results, reducing
the likelihood of irrelevant information appearing in search results.
Disadvantages of Information Retrieval
1. Information Overload: When a lot of information is available, users often face information
overload, making it difficult to find the most useful and relevant material.
2. Lack of Context: Information retrieval systems may fail to understand the context of a user’s
query, potentially leading to inaccurate results.
3. Privacy and Security Concerns: As information retrieval systems often access sensitive user
data, they can raise privacy and security concerns.
4. Maintenance Challenges: Keeping these systems up-to-date and effective requires ongoing
efforts, including regular updates, data cleaning, and algorithm adjustments.
5. Bias and fairness: Ensuring that information retrieval systems do not exhibit biases and
provide fair and unbiased results is a crucial challenge, especially in contexts like web search
engines and recommendation systems.
Applications of IR
Information retrieval (IR) systems were firstly developed to help manage the huge amount of
information. Many universities, corporate, and public libraries now use IR systems to
provide access to books, journals, and other documents. Information retrieval is used today in
many applications. General applications of information retrieval system are as follows:

1.​ Digital library


Defines digital library as a library in which collections are stored in digital formats and
accessible by computers. The digital content may be stored locally, or accessed remotely via
computer networks. A digital library is a type of information retrieval system. An upcoming
field of library and information science is focused on the human user aspects of information
retrieval.

2.​ Semantic web


Explain that the current web is primarily composed of pages with information in the form of
natural language texts and images intended for human view and understanding. Machines are
used primarily to render this information, laying it out on the screen or printed page. The idea
behind semantic web is to augment these web pages with markup that captures some of the
meaning of the content on pages and encodes it in a form that is suitable for machine
understanding.

3.​ Search engines


A search engine is one of the most the practical applications of information retrieval techniques
to large scale text collections. Web search engines are best‐known examples, but many others
searches exist, like: Desktop search, Enterprise search, Federated search, Mobile search, and
Social search.
A web search engine is designed to search for information on the World Wide Web. The search
results are usually presented in a list of results and are commonly called hits. The information
may consist of web pages, images, and other types of files.
4.​ Natural language processing
Natural language processing is focused on the syntactic, semantic, and pragmatic analysis of
natural language text and discourse. It involves;
-​ Ability to analyze syntax (phrase structure) and semantics could allow retrieval
based on meaning rather than keywords.
-​Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation ).
-​ Methods for identifying specific pieces of information in a document
(information extraction).

5.​ Search Engine Marketing


Search engine marketing (SEM) is a form of Internet marketing that involves the promotion of
websites by increasing their visibility in search engine results pages SEM may incorporate
search engine optimization (SEO), which adjusts or rewrites website content and site
architecture to achieve a higher ranking in search engine results pages to enhance pay per click
(PPC) listings.

6.​ Machine learning


Russell and Norvig(2013), machine learning focuses on the development of​
computational systems that improve their performance with experience. Machine learning
have been successfully implemented in recommendation systems to improve product
sales.

7.​ Artificial intelligence


Explains that natural language processing involves representation of knowledge, reasoning,
and intelligent action. AI uses formalism for representing knowledge and queries: the
First-order Logic, Predicate Logic, Bayesian Networks etc.
Recent work on web ontologies and intelligent information agents are some of most recent
applications of IR.
Figure1: A simplified IR architecture in search engines.

IR System evaluation
IR evaluation is basically determining the accuracy of an IR system(Anwar.A, 2014). Two basic
factors of resolving IR system are:
√​ Precision - the fraction of retrieved documents that are relevant to the user’s information
need.
√​ Recall - the fraction of relevant documents in collection that are retrieved. Answers the
question of whether all the relevant documents were retrieved.
The higher the precision and recall, the better the system.

Brief History of Information Retrieval

Approach to manage and organize large collection of information actually came from
librarianship. It can be unambiguously claimed that cataloguing is the primordial soup for the
birth of Information Retrieval. Earlier days, mostly different books, documents, sacred
manuscripts, scriptures, epics, spiritual documents were kept and indexed using cataloguing
schemes. Eliot and Rose claimed in 3rd century B.C. Greek poet, Callimachus, first created own
cataloguing schemes for managing his personal collections. In ancient periods, some big libraries
were built. For example, library at Alexandria (280 B.C.) had more than 700,000 documents.
Nalanda University had one huge library for document storage. But, the existence of any
mechanism to organize, classify or retrieve them is still unknown.

In 1891, Rudolph filed a patent to US patent office for a machine composed catalogue cards
joined together, which could be wound past a viewing window enabling rapid manual scanning
of the catalogues. Soper in 1918 filed another patent for a device where catalogue cards with
holed, related to categories, were aligned in front of each other to determine if there were entries
in a collection with a particular combination of categories. If light could be seen through the
arrangement of cards, a match was found.
The necessity of designing some mechanical devices that can be used for searching a catalogue
for a particular entry was felt in due years. Emanuel Goldberg was the first person who worked
to solve that problem in the 1920s and ‘30s and indigenously. By nature, it’s an optical device
which basically searches for a pattern of dots or letters within the catalogues on a roll of
microfilm. Goldberg patented many of his inventions in photography. Figure 1 shows the
diagram of the patent filed in USPTO in 1928. “Here it can be seen that catalogue entries were
stored on a roll of film (figure 1). A query (2) was also on film showing a negative image of the
part of the catalogue being searched for; in this case the 1st and 6thentries on the roll. A light
source (7) was shone through the catalogue roll and query film, focused onto a photocell (6). If
an exact match was found, all light was blocked to the cell causing a relay to move a counter
forward (12) and for an image of the match to be shown via a half silvered mirror (3), reflecting
the match onto a screen or photographic plate.
After this big invention, in 1935, Davis and Draeger also made several experiments in similar
line on microfilm based searching. As per Mooers, their work influenced Vannevar Bush and
developed famous Memex System in 1945.

Radolph Shaw implemented Rapid Selector in US department of Agriculture (USDA) library


.This machine was developed under the supervision of engineers in MIT and they worked on the
earlier version of Rapid Selector on consent from Vannever Bush and delivered to USDA in
1949. “It was reported to search through a 2,000 foot reel of film. Each half of the film’s frames
had a different purpose: one half for ‘frames of material’; the other for ‘index entries’. It is stated
that 72,000 frames were stored on the film, which in total were indexed by 430,000 entries.
Shaw reported that the selector was able to search at the rate of 78,000 entries per minute.”

In 1950, Luhn also made a selector using punch card, light and photo cells and this system could
search over 600 cards per minute. Another important feature of this system is it could search the
pattern of consecutive characters within a long string. Calvin Mooers in a conference in 1950
first coined the term “Information Retrieval”.

Introduction
Document Retrieval in Machine Learning is part of a larger aspect known as Information
Retrieval, where a given query by the user, the system tries to find relevant documents to the
search query as well as rank them in order of relevance or match.
They are different ways of Document retrieval, two popular ones are −
●​ Boolean Model
●​ Vector Space Model
Let us have a brief understanding of each of the above methods.
Boolean Model
It is a set-based retrieval model.The user query is in boolean form. Queries are joined using
AND, OR, NOT, etc. A document can be visualized as a keyword set. Based on the query a
document is retrieved based on relevance. Partial matches and ranking are not supported.
Example (Boolean query) −
[[America & France] | [Honduras & London]] & restaurants &! Manhattan]
Steps and Flow diagram of Boolean Model

Boolean model is an Inverted Index search to find if a document is relevant or not.It does not
return the rank of the document.
Let us consider we have 3 documents in our corpus.
document_id document_text
1. Taj Mahal is a beautiful monument
2. Victoria Memorial is also a monument
3. I like to visit Agra
The term matrix will be created as below.
term doc_1 doc_2 doc_3
taj 1 0 0
mahal 1 0 0
is 1 1 0
a 1 1 0
beautiful 1 0 0
monument 1 1 0
victoria 0 1 0
memorial 0 1 0
also 0 1 0
i 0 0 1
like 0 0 1
to 0 0 1
visit 0 0 1
agra 0 0 1
let us have a query like "taj mahal agra"
The query will be created as −
taj [100] & mahal [100] & agra [001]
or 100 & 100 & 001 = 000, so here we can see none of the documents are relevant using AND.
We can then try including other operators like OR or using different keywords in addition to
these.
The inverted index can be created for this corpus as −
taj - set(1)
mahal – set(1)
is - set(1,2)
a - set(1,2)
beautiful - set(1)
monument - set(1,2)
victoria – set(2)
memorial - set(2)
also - set(2)
i - set(3)
like - set(3)
to - set(3)
visit - set(3)
agra- set(3)

Vector Space Model


The vector space model is a kind f statistical model of retrieval.
●​ In this model, the documents are represented as a bag of words.
●​ The bag allows words to occur more than once
●​ User can use weights with search query like q = < ecommerce 0.5; products 0.8; price 0.2
●​ It is based on the similarity between the query and documents.
●​ Output is ranked documents.
●​ It can also encompass the multiple occurrences of words.
Graphical Representation

You might also like