IR Notes

Information Retrieval (IR) is a software program focused on organizing, storing, and retrieving textual information from document repositories based on user queries. IR models rank documents based on their relevance to a user's query, utilizing various methods such as Boolean and Vector Space Models. While IR systems offer efficient access and personalized results, they also face challenges like information overload and privacy concerns.

Uploaded by

sam.varman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views14 pages

IR Notes

Uploaded by

sam.varman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

What is Information Retrieval?

Information Retrieval (IR) can be defined as a software program that deals with the
organization, storage, retrieval, and evaluation of information from document repositories,
particularly textual information. Information Retrieval is the activity of obtaining material that
can usually be documented on an unstructured nature i.e. usually text which satisfies an
information need from within large collections which is stored on computers. For example,
Information Retrieval can be when a user enters a query into the system.
An IR system has the ability to represent, store, organize, and access information items. A set of
keywords are required to search. Keywords are what people are searching for in search engines.
These keywords summarize the description of the information.
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is required by the user
or the user has asked for in the form of a query. The documents and the queries are represented
in a similar manner, so that document selection and ranking can be formalized by a matching
function that returns a retrieval status value (RSV) for each document in the collection. Many
of the Information Retrieval systems represent document contents by a set of descriptors, called
terms, belonging to a vocabulary V. An IR model determines the query-document matching
function according to four main approaches:
The estimation of the probability of user’s relevance rel for each document d and query q with
respect to a set R q of training documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model
● Acquisition: In this step, the selection of documents and other objects from various web
resources that consist of text-based documents takes place. The required data is collected
by web crawlers and stored in the database.
● Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual & automatic techniques as well. example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data, and
metadata.
● File Organization: There are two types of file organization methods. i.e. Sequential: It
contains documents by document data. Inverted: It contains term by term, list of records
under each term. Combination of both.
● Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs, for example, search strings in web search
engines. In information retrieval, a query does not uniquely identify a single object in the
collection. Instead, several objects may match the query, perhaps with different degrees
of relevancy.
Difference Between Information Retrieval and Data Retrieval
Information Retrieval Data Retrieval
The software program that deals with the Data retrieval deals with obtaining data from a
organization, storage, retrieval, and database management system such as ODBMS. It is
evaluation of information from document A process of identifying and retrieving the data from
Information Retrieval Data Retrieval
repositories particularly textual the database, based on the query provided by user or
information. application.
Determines the keywords in the user query and
Retrieves information about a subject.
retrieves the data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is
Has a well-defined structure and semantics.
semantically ambiguous.
Does not provide a solution to the user of
Provides solutions to the user of the database system.
the database system.
The results obtained are approximate
The results obtained are exact matches.
matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
User Interaction With Information Retrieval System

The User Task: The information first is supposed to be translated into a query by the user. In the
information retrieval system, there is a set of words that convey the semantics of the information
that is required whereas, in a data retrieval system, a query expression is used to convey the
constraints which are satisfied by the objects. Example: A user wants to search for something but
ends up searching with another thing. This means that the user is browsing and not searching.
The above figure shows the interaction of the user through different tasks.
● Logical View of the Documents: A long time ago, documents were represented through
a set of index terms or keywords. Nowadays, modern computers represent documents by
a full set of words which reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are text operations.
These text operations reduce the complexity of the document representation from full
text to set of index terms.
Past, Present, and Future of Information Retrieval
1. Early Developments: As there was an increase in the need for a lot of information, it became
necessary to build data structures to get faster access. The index is the data structure for faster
retrieval of information. Over centuries manual categorization of hierarchies was done for
indexes.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval. In first-generation, it consisted, automation of previous technologies, and
the search was based on author name and title. In the second generation, it included searching by
subject heading, keywords, etc. In the third generation, it consisted of graphical interfaces,
electronic forms, hypertext features, etc.
3. The Web and Digital Libraries: It is cheaper than various sources of information, it provides
greater access to networks due to digital communication and it gives free access to publish on a
larger medium.
Advantages of Information Retrieval
1. Efficient Access: Information retrieval techniques make it possible for users to easily locate
and retrieve vast amounts of data or information.
2. Personalization of Results: User profiling and personalization techniques are used in
information retrieval models to tailor search results to individual preferences and behaviors.
3. Scalability: Information retrieval models are capable of handling increasing data volumes.
4. Precision: These systems can provide highly accurate and relevant search results, reducing
the likelihood of irrelevant information appearing in search results.
Disadvantages of Information Retrieval
1. Information Overload: When a lot of information is available, users often face information
overload, making it difficult to find the most useful and relevant material.
2. Lack of Context: Information retrieval systems may fail to understand the context of a user’s
query, potentially leading to inaccurate results.
3. Privacy and Security Concerns: As information retrieval systems often access sensitive user
data, they can raise privacy and security concerns.
4. Maintenance Challenges: Keeping these systems up-to-date and effective requires ongoing
efforts, including regular updates, data cleaning, and algorithm adjustments.
5. Bias and fairness: Ensuring that information retrieval systems do not exhibit biases and
provide fair and unbiased results is a crucial challenge, especially in contexts like web search
engines and recommendation systems.
Applications of IR
Information retrieval (IR) systems were firstly developed to help manage the huge amount of
information. Many universities, corporate, and public libraries now use IR systems to
provide access to books, journals, and other documents. Information retrieval is used today in
many applications. General applications of information retrieval system are as follows:

1. Digital library

Defines digital library as a library in which collections are stored in digital formats and
accessible by computers. The digital content may be stored locally, or accessed remotely via
computer networks. A digital library is a type of information retrieval system. An upcoming
field of library and information science is focused on the human user aspects of information
retrieval.

2. Semantic web

Explain that the current web is primarily composed of pages with information in the form of
natural language texts and images intended for human view and understanding. Machines are
used primarily to render this information, laying it out on the screen or printed page. The idea
behind semantic web is to augment these web pages with markup that captures some of the
meaning of the content on pages and encodes it in a form that is suitable for machine
understanding.

3. Search engines

A search engine is one of the most the practical applications of information retrieval techniques
to large scale text collections. Web search engines are best‐known examples, but many others
searches exist, like: Desktop search, Enterprise search, Federated search, Mobile search, and
Social search.
A web search engine is designed to search for information on the World Wide Web. The search
results are usually presented in a list of results and are commonly called hits. The information
may consist of web pages, images, and other types of files.
4. Natural language processing
Natural language processing is focused on the syntactic, semantic, and pragmatic analysis of
natural language text and discourse. It involves;
- Ability to analyze syntax (phrase structure) and semantics could allow retrieval
based on meaning rather than keywords.
-Methods for determining the sense of an ambiguous word based on context (word sense
disambiguation ).
- Methods for identifying specific pieces of information in a document
(information extraction).

5. Search Engine Marketing

Search engine marketing (SEM) is a form of Internet marketing that involves the promotion of
websites by increasing their visibility in search engine results pages SEM may incorporate
search engine optimization (SEO), which adjusts or rewrites website content and site
architecture to achieve a higher ranking in search engine results pages to enhance pay per click
(PPC) listings.

6. Machine learning

Russell and Norvig(2013), machine learning focuses on the development of
computational systems that improve their performance with experience. Machine learning
have been successfully implemented in recommendation systems to improve product
sales.

7. Artificial intelligence

Explains that natural language processing involves representation of knowledge, reasoning,
and intelligent action. AI uses formalism for representing knowledge and queries: the
First-order Logic, Predicate Logic, Bayesian Networks etc.
Recent work on web ontologies and intelligent information agents are some of most recent
applications of IR.
Figure1: A simplified IR architecture in search engines.

IR System evaluation
IR evaluation is basically determining the accuracy of an IR system(Anwar.A, 2014). Two basic
factors of resolving IR system are:
√ Precision - the fraction of retrieved documents that are relevant to the user’s information
need.
√ Recall - the fraction of relevant documents in collection that are retrieved. Answers the
question of whether all the relevant documents were retrieved.
The higher the precision and recall, the better the system.

Brief History of Information Retrieval

Approach to manage and organize large collection of information actually came from
librarianship. It can be unambiguously claimed that cataloguing is the primordial soup for the
birth of Information Retrieval. Earlier days, mostly different books, documents, sacred
manuscripts, scriptures, epics, spiritual documents were kept and indexed using cataloguing
schemes. Eliot and Rose claimed in 3rd century B.C. Greek poet, Callimachus, first created own
cataloguing schemes for managing his personal collections. In ancient periods, some big libraries
were built. For example, library at Alexandria (280 B.C.) had more than 700,000 documents.
Nalanda University had one huge library for document storage. But, the existence of any
mechanism to organize, classify or retrieve them is still unknown.

In 1891, Rudolph filed a patent to US patent office for a machine composed catalogue cards
joined together, which could be wound past a viewing window enabling rapid manual scanning
of the catalogues. Soper in 1918 filed another patent for a device where catalogue cards with
holed, related to categories, were aligned in front of each other to determine if there were entries
in a collection with a particular combination of categories. If light could be seen through the
arrangement of cards, a match was found.
The necessity of designing some mechanical devices that can be used for searching a catalogue
for a particular entry was felt in due years. Emanuel Goldberg was the first person who worked
to solve that problem in the 1920s and ‘30s and indigenously. By nature, it’s an optical device
which basically searches for a pattern of dots or letters within the catalogues on a roll of
microfilm. Goldberg patented many of his inventions in photography. Figure 1 shows the
diagram of the patent filed in USPTO in 1928. “Here it can be seen that catalogue entries were
stored on a roll of film (figure 1). A query (2) was also on film showing a negative image of the
part of the catalogue being searched for; in this case the 1st and 6thentries on the roll. A light
source (7) was shone through the catalogue roll and query film, focused onto a photocell (6). If
an exact match was found, all light was blocked to the cell causing a relay to move a counter
forward (12) and for an image of the match to be shown via a half silvered mirror (3), reflecting
the match onto a screen or photographic plate.
After this big invention, in 1935, Davis and Draeger also made several experiments in similar
line on microfilm based searching. As per Mooers, their work influenced Vannevar Bush and
developed famous Memex System in 1945.

Radolph Shaw implemented Rapid Selector in US department of Agriculture (USDA) library

.This machine was developed under the supervision of engineers in MIT and they worked on the
earlier version of Rapid Selector on consent from Vannever Bush and delivered to USDA in
1949. “It was reported to search through a 2,000 foot reel of film. Each half of the film’s frames
had a different purpose: one half for ‘frames of material’; the other for ‘index entries’. It is stated
that 72,000 frames were stored on the film, which in total were indexed by 430,000 entries.
Shaw reported that the selector was able to search at the rate of 78,000 entries per minute.”

In 1950, Luhn also made a selector using punch card, light and photo cells and this system could
search over 600 cards per minute. Another important feature of this system is it could search the
pattern of consecutive characters within a long string. Calvin Mooers in a conference in 1950
first coined the term “Information Retrieval”.

Introduction
Document Retrieval in Machine Learning is part of a larger aspect known as Information
Retrieval, where a given query by the user, the system tries to find relevant documents to the
search query as well as rank them in order of relevance or match.
They are different ways of Document retrieval, two popular ones are −
● Boolean Model
● Vector Space Model
Let us have a brief understanding of each of the above methods.
Boolean Model
It is a set-based retrieval model.The user query is in boolean form. Queries are joined using
AND, OR, NOT, etc. A document can be visualized as a keyword set. Based on the query a
document is retrieved based on relevance. Partial matches and ranking are not supported.
Example (Boolean query) −
[[America & France] | [Honduras & London]] & restaurants &! Manhattan]
Steps and Flow diagram of Boolean Model

Boolean model is an Inverted Index search to find if a document is relevant or not.It does not
return the rank of the document.
Let us consider we have 3 documents in our corpus.
document_id document_text
1. Taj Mahal is a beautiful monument
2. Victoria Memorial is also a monument
3. I like to visit Agra
The term matrix will be created as below.
term doc_1 doc_2 doc_3
taj 1 0 0
mahal 1 0 0
is 1 1 0
a 1 1 0
beautiful 1 0 0
monument 1 1 0
victoria 0 1 0
memorial 0 1 0
also 0 1 0
i 0 0 1
like 0 0 1
to 0 0 1
visit 0 0 1
agra 0 0 1
let us have a query like "taj mahal agra"
The query will be created as −
taj [100] & mahal [100] & agra [001]
or 100 & 100 & 001 = 000, so here we can see none of the documents are relevant using AND.
We can then try including other operators like OR or using different keywords in addition to
these.
The inverted index can be created for this corpus as −
taj - set(1)
mahal – set(1)
is - set(1,2)
a - set(1,2)
beautiful - set(1)
monument - set(1,2)
victoria – set(2)
memorial - set(2)
also - set(2)
i - set(3)
like - set(3)
to - set(3)
visit - set(3)
agra- set(3)

Vector Space Model

The vector space model is a kind f statistical model of retrieval.
● In this model, the documents are represented as a bag of words.
● The bag allows words to occur more than once
● User can use weights with search query like q = < ecommerce 0.5; products 0.8; price 0.2
● It is based on the similarity between the query and documents.
● Output is ranked documents.
● It can also encompass the multiple occurrences of words.
Graphical Representation

RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
100% (3)
RAADS-R Test: Ritvo Autism Asperger Diagnostic Scale-Revised
10 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Week 1
No ratings yet
Week 1
28 pages
Information Retrieval
No ratings yet
Information Retrieval
21 pages
Part B
No ratings yet
Part B
12 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
IR Module
No ratings yet
IR Module
80 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
IR Module For MIS Rift
No ratings yet
IR Module For MIS Rift
80 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
1 introIR
No ratings yet
1 introIR
22 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
No ratings yet
Tycs Sem Vi Informational Retrival Final Notes (WWW - Profajaypashankar.com-1
103 pages
CS8080 Irt
100% (1)
CS8080 Irt
33 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
11 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lect 1
No ratings yet
Lect 1
15 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Information Search and Retrieval
No ratings yet
Information Search and Retrieval
23 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
No ratings yet
Jeppiaar Institute of Technology: Department OF Computer Science and Engineering
24 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
The Information Retrieval Lesson ?
No ratings yet
The Information Retrieval Lesson ?
3 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
UNIT 1 Notes
No ratings yet
UNIT 1 Notes
16 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
IRS Spectrum
100% (1)
IRS Spectrum
150 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Irs Ia 1
No ratings yet
Irs Ia 1
12 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
Essay On Obedience
100% (2)
Essay On Obedience
5 pages
Moba Compaction Assistance
No ratings yet
Moba Compaction Assistance
12 pages
The 5th ICMS Agenda
No ratings yet
The 5th ICMS Agenda
13 pages
Operations Manual
No ratings yet
Operations Manual
189 pages
Planned Maintenance System
No ratings yet
Planned Maintenance System
9 pages
The Role of The Media in Peace Building Conflict Management and Prevention
No ratings yet
The Role of The Media in Peace Building Conflict Management and Prevention
3 pages
Thoits 1994 StressorsProblemSolvingIndividual
No ratings yet
Thoits 1994 StressorsProblemSolvingIndividual
19 pages
List of MCA For CSC
No ratings yet
List of MCA For CSC
9 pages
Hyd Pressure Spek
No ratings yet
Hyd Pressure Spek
3 pages
Student Guide M2
No ratings yet
Student Guide M2
49 pages
2019 G6NA Language Arts Paper 2
No ratings yet
2019 G6NA Language Arts Paper 2
10 pages
Exp 1a Determine The Resultant of Two Non-Linear Force Vectors
No ratings yet
Exp 1a Determine The Resultant of Two Non-Linear Force Vectors
7 pages
ITK - AquaCheck - Standard - EN
No ratings yet
ITK - AquaCheck - Standard - EN
18 pages
Sample Diagnostic
No ratings yet
Sample Diagnostic
29 pages
Bab3 Matrikulasi
No ratings yet
Bab3 Matrikulasi
31 pages
Dynamic Symmetry in Nature and Architecture
No ratings yet
Dynamic Symmetry in Nature and Architecture
20 pages
HTML Cheat Sheet
No ratings yet
HTML Cheat Sheet
5 pages
Varian TOGA
No ratings yet
Varian TOGA
3 pages
733-Article Text-1725-3-10-20230630
No ratings yet
733-Article Text-1725-3-10-20230630
16 pages
6 FM Circuits
100% (1)
6 FM Circuits
33 pages
Listening
No ratings yet
Listening
22 pages
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
100% (1)
Manual Operador Amaro 5000 - OMRON - HOSPITALAR EN
54 pages
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
No ratings yet
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
9 pages
Distributed Computing Question Bank
No ratings yet
Distributed Computing Question Bank
6 pages
Account STMT
No ratings yet
Account STMT
2 pages
Calculation Sheet For External Surface Areas (Including Glass)
No ratings yet
Calculation Sheet For External Surface Areas (Including Glass)
20 pages
TOEFL Reading Practice
No ratings yet
TOEFL Reading Practice
142 pages
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
No ratings yet
Resilience Through Education Equipping Schools and Students To Face Climate Change Challenges in Punjab
6 pages
AFS Pro700 Brochure AFS-8018-10
No ratings yet
AFS Pro700 Brochure AFS-8018-10
2 pages