CMP 312 - 2
CMP 312 - 2
Regular users of Web search engines casually expect to receive accurate and near-
instantaneous answers to questions and requests merely by entering a short query
i.e. a few words, into a text box and clicking on a search button. Underlying this
simple and intuitive interface are clusters of computers, comprising thousands of
machines, working cooperatively to generate a ranked list of those Web pages that
are likely to satisfy the information need embodied in the query.
These machines identify a set of Web pages containing the terms in the query,
compute a score for each page, eliminate duplicate and redundant pages, generate
summaries of the remaining pages, and finally return the summaries and links back
to the user for browsing.
𝑁𝑢𝑚𝑏𝑒𝑟_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟_𝑇𝑜𝑡𝑎𝑙_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
𝑁𝑢𝑚𝑏𝑒𝑟_𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑁𝑢𝑚𝑏𝑒𝑟_𝑃𝑜𝑠𝑠𝑖𝑏𝑙𝑒_𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡
The Boolean model is the first model of information Retrieval. The model can be
explained by thinking of a query term as an unambiguous definition of a set of
documents. For instance, the query term computer simply defines the set of all
documents that are indexed with the term computer. Using the operators of
mathematical logic, query terms and their corresponding sets of documents can be
combined to form new sets of documents.
Mathematical logic defined three basic operators, the logical product called AND,
the logical sum called OR and the logical difference called NOT. Combining terms
with the AND operator will define a document set that is smaller than or equal to
the document sets of any of the single terms. For instance, the query computer AND
science will produce the set of documents that are indexed both with the term
computer and the term science, i.e. the intersection of both sets. Combining terms
with the OR operator will define a document set that is bigger than or equal to the
document sets of any of the single terms. So, the query computer OR science will
produce the set of documents that are indexed with either the term computer or the
term science, or both, i.e. the union of both sets.
2. Using TF-IDF
The first question to address is, given a particular term t, how relevant is a particular
document d to the term. One approach is to use the number of occurrences of the
term in the document as a measure of its relevance, on the assumption that relevant
terms are likely to be mentioned many times in a document. Just counting the
number of occurrences of a term is usually not a good indicator: first, the number of
occurrences depends on the length of the document, and second, a document
containing 10 occurrences of a term may not be 10 times as relevant as a document
containing one occurrence.
𝑛(𝑑, 𝑡)
𝑇𝐹(𝑑, 𝑡) = log (1 + )
𝑛(𝑑)
where n(d) denotes the number of term occurrences in the document and n(d, t)
denotes the number of occurrences of term t in the document d. Observe that this
metric takes the length of the document into account. The relevance grows with
more occurrences of a term in the document, although it is not directly proportional
to the number of occurrences.
Many systems refine the preceding metric by using other information. For instance,
if the term occurs in the title, or the author list, or the abstract, the document would
be considered more relevant to the term. Similarly, if the first occurrence of a term is
late in the document, the document may be considered less relevant than if the first
occurrence is early in the document. These notions can be formalized by extensions
of the formula we have shown for TF(d, t). In the information retrieval community,
the relevance of a document to a term is referred to as term frequency (TF),
regardless of the exact formula used.
where n(t) denotes the number of documents (among those indexed by the system)
that contain the term t. The relevance of a document d to a set of terms Q is then
defined
as
This measure can be further refined if the user is permitted to specify weights w(t)
for terms in the query, in which case the user-specified weights are also taken into
account by multiplying TF(t) by w(t) in the above formula.
The above approach of using term frequency and inverse document frequency as a
measure of the relevance of a document is called the TF–IDF approach.
Also, almost all text documents (in English) contain words such as and, or, a, and so
on, and hence these words are useless for querying purposes since their inverse
document frequency is extremely low. Information-retrieval systems define a set of
words, called stop words, containing 100 or so of the most common words, and
ignore these words when indexing a document. Such words are not used as
keywords and are discarded if present in the keywords supplied by the user.
3. Similarity-Based Retrieval
You can easily verify that the cosine similarity metric of a document with itself is 1,
while that between two documents that do not share any terms is 0. The name
“cosine similarity” comes from the fact that the above formula computes the cosine
of the angle between two vectors, one representing each document, defined as
follows: Let there be n words overall across all the documents being considered. An
n-dimensional space is defined, with each word as one of the dimensions. A
document d is represented by a point in this space, with the value of the ith
coordinate of the point being r(d, ti). The vector for document d connects the origin
(all coordinates = 0) to the point representing the document. The model of
documents as points and vectors in an n-dimensional space is called the vector space
model.
4. Popularity Ranking
The basic idea of popularity ranking (also called prestige ranking) is to find pages
that are popular and to rank them higher than other pages that contain the specified
keywords. Since most searches are intended to find information from popular pages,
ranking such pages higher is generally a good idea. For instance, the term google may
occur in vast numbers of pages, but the page google.com is the most popular among
the pages that contain the term google. The page google.com should therefore be
ranked as the most relevant answer to a query consisting of the term google.
Traditional measures of relevance of a page such as the TF–IDF-based measures
can be combined with the popularity of the page to get an overall measure of the
relevance of the page to the query. Pages with the highest overall relevance value are
returned as the top answers to a query.