0% found this document useful (0 votes)

46 views5 pages

CMP 312 - 2

Information retrieval (IR) involves representing, searching, and managing large collections of electronic text and other human-language data. IR systems like Google, Bing, and others provide access to technical information, people/organizations, news summaries, and comparison shopping. Behind a simple search interface are computer clusters working cooperatively to generate relevant search results. Major IR methods include the Boolean model using logical operators, term frequency-inverse document frequency (TF-IDF) to measure relevance, and similarity-based retrieval to find documents similar to a given one based on common terms.

Uploaded by

vyktoria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views5 pages

CMP 312 - 2

Uploaded by

vyktoria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Information Retrieval

Information retrieval (IR) is concerned with representing, searching, and

manipulating large collections of electronic text and other human-language data. IR
systems and services are now widespread, with millions of people depending on
them daily to facilitate business, education, and entertainment. Web search engines
such as Google, Bing, and others are by far the most popular and heavily used IR
services, providing access to up-to-date technical information, locating people and
organizations, summarizing news and events, and simplifying comparison
shopping.

Regular users of Web search engines casually expect to receive accurate and near-
instantaneous answers to questions and requests merely by entering a short query
i.e. a few words, into a text box and clicking on a search button. Underlying this
simple and intuitive interface are clusters of computers, comprising thousands of
machines, working cooperatively to generate a ranked list of those Web pages that
are likely to satisfy the information need embodied in the query.

These machines identify a set of Web pages containing the terms in the query,
compute a score for each page, eliminate duplicate and redundant pages, generate
summaries of the remaining pages, and finally return the summaries and links back
to the user for browsing.

Objective of an information Retrieval System

The general objective of an Information Retrieval System is to minimize the

overhead of a user locating needed information. Overhead can be expressed as the
time a user spends in all of the steps leading to reading an item containing the
needed information (e.g., query generation, query execution, scanning results of
query to select items to read, reading non-relevant items). The success of an
information system is very subjective, based upon what information is needed and
the willingness of a user to accept overhead.

In information retrieval the term “relevant” item is used to represent an item

containing the needed information. In reality the definition of relevance is not a
binary classification but a continuous function. From a user’s perspective “relevant”
and “needed” are synonymous. From a system perspective, information could be
relevant to a search statement (i.e., matching the criteria of the search statement)
even though it is not needed/relevant to user (e.g., the user already knew
the information.
The two major measures commonly associated with information systems are
precision and recall. Relevant items are those documents that contain information
that helps the searcher in answering his question. Non-relevant items are those items
that do not provide any directly useful information. There are two possibilities with
respect to each item: it can be retrieved or not retrieved by the
user’s query.

𝑁𝑢𝑚𝑏𝑒𝑟_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟_𝑇𝑜𝑡𝑎𝑙_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑

𝑁𝑢𝑚𝑏𝑒𝑟_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑁𝑢𝑚𝑏𝑒𝑟_𝑃𝑜𝑠𝑠𝑖𝑏𝑙𝑒_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡

where Number_Possible_Relevant are the number of relevant items in the database.

Number_Total_Retieved is the total number of items retrieved from the query.
Number_Retrieved_Relevant is the number of items retrieved that are relevant to the
user’s search need. Precision measures one aspect of information retrieval overhead
for a user associated with a particular search. Precision answers what exactly is the
search result? Recall answers how complete are the search results? Recall gauges
how well a system processing a particular query is able to retrieve the relevant items
that the user is interested in seeing

Information Retrieval Methods

1. The Boolean model

The Boolean model is the first model of information Retrieval. The model can be
explained by thinking of a query term as an unambiguous definition of a set of
documents. For instance, the query term computer simply defines the set of all
documents that are indexed with the term computer. Using the operators of
mathematical logic, query terms and their corresponding sets of documents can be
combined to form new sets of documents.

Mathematical logic defined three basic operators, the logical product called AND,
the logical sum called OR and the logical difference called NOT. Combining terms
with the AND operator will define a document set that is smaller than or equal to
the document sets of any of the single terms. For instance, the query computer AND
science will produce the set of documents that are indexed both with the term
computer and the term science, i.e. the intersection of both sets. Combining terms
with the OR operator will define a document set that is bigger than or equal to the
document sets of any of the single terms. So, the query computer OR science will
produce the set of documents that are indexed with either the term computer or the
term science, or both, i.e. the union of both sets.

2. Using TF-IDF

The first question to address is, given a particular term t, how relevant is a particular
document d to the term. One approach is to use the number of occurrences of the
term in the document as a measure of its relevance, on the assumption that relevant
terms are likely to be mentioned many times in a document. Just counting the
number of occurrences of a term is usually not a good indicator: first, the number of
occurrences depends on the length of the document, and second, a document
containing 10 occurrences of a term may not be 10 times as relevant as a document
containing one occurrence.

One way of measuring TF (d, t), the relevance of a document d to a term t, is

𝑛(𝑑, 𝑡)
𝑇𝐹(𝑑, 𝑡) = log (1 + )
𝑛(𝑑)

where n(d) denotes the number of term occurrences in the document and n(d, t)
denotes the number of occurrences of term t in the document d. Observe that this
metric takes the length of the document into account. The relevance grows with
more occurrences of a term in the document, although it is not directly proportional
to the number of occurrences.

Many systems refine the preceding metric by using other information. For instance,
if the term occurs in the title, or the author list, or the abstract, the document would
be considered more relevant to the term. Similarly, if the first occurrence of a term is
late in the document, the document may be considered less relevant than if the first
occurrence is early in the document. These notions can be formalized by extensions
of the formula we have shown for TF(d, t). In the information retrieval community,
the relevance of a document to a term is referred to as term frequency (TF),
regardless of the exact formula used.

A query Q may contain multiple keywords. The relevance of a document to a query

with two or more keywords is estimated by combining the relevance measures of the
document to each keyword. A simple way of combining the measures is to add them
up. However, not all terms used as keywords are equal. Suppose a query uses two
terms, one of which occurs frequently, such as database, and another that is less
frequent, such as Joseph. A document containing Joseph but not database should be
ranked higher than a document containing the term database but not Joseph.
To fix the problem, weights are assigned to terms using the inverse document
frequency (IDF), defined as
1
𝐼𝐷𝐹(𝑡) = n(t)

where n(t) denotes the number of documents (among those indexed by the system)
that contain the term t. The relevance of a document d to a set of terms Q is then
defined
as

𝑟(𝑑, 𝑄) = ∑ 𝑇𝐹(𝑑, 𝑡) ∗ 𝐼𝐷𝐹(𝑡)

𝑡∈𝑄

This measure can be further refined if the user is permitted to specify weights w(t)
for terms in the query, in which case the user-specified weights are also taken into
account by multiplying TF(t) by w(t) in the above formula.

The above approach of using term frequency and inverse document frequency as a
measure of the relevance of a document is called the TF–IDF approach.

Also, almost all text documents (in English) contain words such as and, or, a, and so
on, and hence these words are useless for querying purposes since their inverse
document frequency is extremely low. Information-retrieval systems define a set of
words, called stop words, containing 100 or so of the most common words, and
ignore these words when indexing a document. Such words are not used as
keywords and are discarded if present in the keywords supplied by the user.

3. Similarity-Based Retrieval

Certain information-retrieval systems permit similarity-based retrieval. Here, the

user can give the system document A, and ask the system to retrieve documents that
are “similar” to A. The similarity of a document to another may be defined, for
example, on the basis of common terms. One approach is to find k terms in A with
highest values of TF(A, t) ∗ IDF(t), and to use these k terms as a query to find the
relevance of other documents. The terms in the query are themselves weighted by
TF(A, t) ∗ IDF(t).

More generally, the similarity of documents is defined by the cosine similarity

metric. Let the terms occurring in either of the two documents be 𝑡1 , 𝑡2 , . . . , 𝑡𝑛 . Let
𝑟(𝑑, 𝑡) = 𝑇𝐹(𝑑, 𝑡) ∗ 𝐼𝐷𝐹(𝑡). Then the cosine similarity metric between documents
dand e is defined as:

∑𝑛𝑖=1 𝑟(𝑑, 𝑡𝑖 )𝑟(𝑒, 𝑡𝑖 )

√∑𝑛𝑖=1 𝑟(𝑑, 𝑡𝑖 )2 √∑𝑛𝑖=1 𝑟(𝑒, 𝑡𝑖 )2

You can easily verify that the cosine similarity metric of a document with itself is 1,
while that between two documents that do not share any terms is 0. The name
“cosine similarity” comes from the fact that the above formula computes the cosine
of the angle between two vectors, one representing each document, defined as
follows: Let there be n words overall across all the documents being considered. An
n-dimensional space is defined, with each word as one of the dimensions. A
document d is represented by a point in this space, with the value of the ith
coordinate of the point being r(d, ti). The vector for document d connects the origin
(all coordinates = 0) to the point representing the document. The model of
documents as points and vectors in an n-dimensional space is called the vector space
model.

4. Popularity Ranking

The basic idea of popularity ranking (also called prestige ranking) is to find pages
that are popular and to rank them higher than other pages that contain the specified
keywords. Since most searches are intended to find information from popular pages,
ranking such pages higher is generally a good idea. For instance, the term google may
occur in vast numbers of pages, but the page google.com is the most popular among
the pages that contain the term google. The page google.com should therefore be
ranked as the most relevant answer to a query consisting of the term google.
Traditional measures of relevance of a page such as the TF–IDF-based measures
can be combined with the popularity of the page to get an overall measure of the
relevance of the page to the query. Pages with the highest overall relevance value are
returned as the top answers to a query.

NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
IRS Unit 4
No ratings yet
IRS Unit 4
63 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Unit 4
No ratings yet
Unit 4
61 pages
IR 2nd Unit
No ratings yet
IR 2nd Unit
17 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Unit I - Irs
No ratings yet
Unit I - Irs
116 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Pe Ii6
No ratings yet
Pe Ii6
166 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
IR Unit IV
No ratings yet
IR Unit IV
20 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Minimize The Overhead of A User Locating Needed Information Precision and Recall
No ratings yet
Minimize The Overhead of A User Locating Needed Information Precision and Recall
14 pages
Unit 2
No ratings yet
Unit 2
58 pages
Fuzzy Ontologies and Scale Free Networks
No ratings yet
Fuzzy Ontologies and Scale Free Networks
11 pages
UNIT5-User Search Techniques
No ratings yet
UNIT5-User Search Techniques
24 pages
Bulu
No ratings yet
Bulu
47 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Information Retrieval 1
No ratings yet
Information Retrieval 1
10 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Lab1-Algorithms For Information Retrieval. Introduction
No ratings yet
Lab1-Algorithms For Information Retrieval. Introduction
13 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
E Commerce Module 5
No ratings yet
E Commerce Module 5
24 pages
CS 3308 Discussion Forum 4
No ratings yet
CS 3308 Discussion Forum 4
2 pages
Module 1print
No ratings yet
Module 1print
5 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Lecture 6 Score - Term Weight - Vector Space Model
No ratings yet
Lecture 6 Score - Term Weight - Vector Space Model
43 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
No ratings yet
A Survey On Various Architectures, Models and Methodologies For Information Retrieval
13 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Modern Information Retrieval Amit Singhal
No ratings yet
Modern Information Retrieval Amit Singhal
9 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
TF Idf Algorithm
No ratings yet
TF Idf Algorithm
4 pages