Adt Unit 5
Adt Unit 5
What is an IR Model?
An Information Retrieval (IR) model selects and ranks the document that is
required by the user or the user has asked for in the form of a query. The
documents and the queries are represented in a similar manner, so that
document selection and ranking can be formalized by a matching function that
returns a retrieval status value (RSV) for each document in the collection.
Many of the Information Retrieval systems represent document contents by a
set of descriptors, called terms, belonging to a vocabulary V. An IR model
determines the query-document matching function according to four main
approaches:
The estimation of the probability of user’s relevance rel for each
document d and query q with respect to a set R q of training
documents: Prob (rel|d, q, Rq)
Types of IR Models
Components of Information Retrieval/ IR Model
The software program that deals with Data retrieval deals with obtaining data from
the organization, storage, retrieval, a database management system such as
and evaluation of information from ODBMS. It is A process of identifying and
document repositories particularly retrieving the data from the database, based
textual information. on the query provided by user or application.
Does not provide a solution to the Provides solutions to the user of the database
user of the database system. system.
The User Task: The information first is supposed to be translated into a query by
the user. In the information retrieval system, there is a set of words that convey
the semantics of the information that is required whereas, in a data retrieval
system, a query expression is used to convey the constraints which are satisfied
by the objects. Example: A user wants to search for something but ends up
searching with another thing. This means that the user is browsing and not
searching. The above figure shows the interaction of the user through different
tasks.
Logical View of the Documents: A long time ago, documents were
represented through a set of index terms or keywords. Nowadays,
modern computers represent documents by a full set of words which
reduces the set of representative keywords. This can be done by
eliminating stopwords i.e. articles and connectives. These operations are
text operations. These text operations reduce the complexity of the
document representation from full text to set of index terms.
Table of Contents.
Information Retrieval Models
2.1 Introduction
The purpose of this chapter is two-fold: First, we want to set the stage for the problems in
information retrieval that we try to address in this thesis. Second, we want to give the reader a quick
overview of the major textual retrieval methods, because the InfoCrystal can help to visualize the
output from any of them. We begin by providing a general model of the information retrieval
process. We then briefly describe the major retrieval methods and characterize them in terms of
their strengths and shortcomings.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their
information need. We use the word "document" as a general term that could also include non-
textual information, such as multimedia objects. Figure 4.1 provides a general overview of the
information retrieval process, which has been adapted from Lancaster and Warner (1993). Users
have to formulate their information need in a form that can be understood by the retrieval
mechanism. There are several steps involved in this translation process that we will briefly discuss
below. Likewise, the contents of large document collections need to be described in a form that
allows the retrieval mechanism to identify the potentially relevant documents quickly. In both cases,
information may be lost in the transformation process leading to a computer-usable representation.
Hence, the matching process is inherently imperfect.
Information seeking is a form of problem solving [Marcus 1994, Marchionini 1992]. It proceeds
according to the interaction among eight subprocesses: problem recognition and acceptance,
problem definition, search system selection, query formulation, query execution, examination of
results (including relevance feedback), information extraction, and reflection/iteration/termination.
To be able to perform effective searches, users have to develop the following expertise: knowledge
about various sources of information, skills in defining search problems and applying search
strategies, and competence in using electronic search tools.
Marchionini (1992) contends that some sort of spreadsheet is needed that supports users in the
problem definition as well as other information seeking tasks. The InfoCrystal is such a spreadsheet
because it assists users in the formulation of their information needs and the exploration of the
retrieved documents, using the a visual interface that supports a "what-if" functionality. He further
predicts that advances in computing power and speed, together with improved information retrieval
procedures, will continue to blur the distinctions between problem articulation and examination of
results. The InfoCrystal is both a visual query language and a tool for visualizing retrieval results.
The information need can be understood as forming a pyramid, where only its peak is made visible
by users in the form of a conceptual query (see Figure 2.1). The conceptual query captures the key
concepts and the relationships among them. It is the result of a conceptual analysis that operates on
the information need, which may be well or vaguely defined in the user's mind. This analysis can be
challenging, because users are faced with the general "vocabulary problem" as they are trying to
translate their information need into a conceptual query. This problem refers to the fact that a single
word can have more than one meaning, and, conversely, the same concept can be described by
surprisingly many different words. Furnas, Landauer, Gomez and Dumais (1983) have shown that
two people use the same main word to describe an object only 10 to 20% of the time. Further, the
concepts used to represent the documents can be different from the concepts used by the user. The
conceptual query can take the form of a natural language statement, a list of concepts that can have
degrees of importance assigned to them, or it can be statement that coordinates the concepts using
Boolean operators. Finally, the conceptual query has to be translated into a query surrogate that can
be understood by the retrieval system.
Figure 2.1: represents a general model of the information retrieval process, where both the user's
information need and the document collection have to be translated into the form of surrogates to
enable the matching process to be performed. This figure has been adapted from Lancaster and
Warner (1993).
Similarly, the meanings of documents need to be represented in the form of text surrogates that can
be processed by computer. A typical surrogate can consist of a set of index terms or descriptors. The
text surrogate can consist of multiple fields, such as the title, abstract, descriptor fields to capture
the meaning of a document at different levels of resolution or focusing on different characteristic
aspects of a document. Once the specified query has been executed by IR system, a user is
presented with the retrieved document surrogates. Either the user is satisfied by the retrieved
information or he will evaluate the retrieved documents and modify the query to initiate a further
search. The process of query modification based on user evaluation of the retrieved documents is
known as relevance feedback [Lancaster and Warner 1993]. Information retrieval is an inherently
interactive process, and the users can change direction by modifying the query surrogate, the
conceptual query or their understanding of their information need.
It is worth noting here the results, which have been obtained in studies investigating the
information-seeking process, that describe information retrieval in terms of the cognitive and
affective symptoms commonly experienced by a library user. The findings by Kuhlthau et al. (1990)
indicate that thoughts about the information need become clearer and more focused as users move
through the search process. Similarly, uncertainty, confusion, and frustration are nearly universal
experiences in the early stages of the search process, and they decrease as the search process
progresses and feelings of being confident, satisfied, sure and relieved increase. The studies also
indicate that cognitive attributes may affect the search process. User's expectations of the
information system and the search process may influence the way they approach searching and
therefore affect the intellectual access to information.
Analytical search strategies require the formulation of specific, well-structured queries and a
systematic, iterative search for information, whereas browsing involves the generation of broad
query terms and a scanning of much larger sets of information in a relatively unstructured fashion.
Campagnoni et al. (1989) have found in information retrieval studies in hypertext systems that the
predominant search strategy is "browsing" rather than "analytical search". Many users, especially
novices, are unwilling or unable to precisely formulate their search objectives, and browsing places
less cognitive load on them. Furthermore, their research showed that search strategy is only one
dimension of effective information retrieval; individual differences in visual skill appear to play an
equally important role.
These two studies argue for information displays that provide a spatial overview of the data
elements and that simultaneously provide rich visual cues about the content of the individual data
elements. Such a representation is less likely to increase the anxiety that is a natural part of the early
stages of the search process and it caters for a browsing interaction style, which is appropriate
especially in the beginning, when many users are unable to precisely formulate their search
objectives.
The following major models have been developed to retrieve information: the Boolean model, the
Statistical model, which includes the vector space and the probabilistic retrieval model, and the
Linguistic and Knowledge-based models. The first model is often referred to as the "exact match"
model; the latter ones as the "best match" models [Belkin and Croft 1992]. The material presented
here is based on the textbooks by Lancaster and Warner (1992) as well as Frakes and Baeza-Yates
(1992), the review article by Belkin and Croft (1992), and discussions with Richard Marcus, my thesis
advisor and mentor in the field of information retrieval.
Queries generally are less than perfect in two respects: First, they retrieve some irrelevant
documents. Second, they do not retrieve all the relevant documents. The following two measures
are usually used to evaluate the effectiveness of a retrieval method. The first one, called the
precision rate, is equal to the proportion of the retrieved documents that are actually relevant. The
second one, called the recall rate, is equal to the proportion of all relevant documents that are
actually retrieved. If searchers want to raise precision, then they have to narrow their queries. If
searchers want to raise recall, then they broaden their query. In general, there is an inverse
relationship between precision and recall. Users need help to become knowledgeable in how to
manage the precision and recall trade-off for their particular information need [Marcus 1991].
In Table 2.1 we summarize the defining characteristics of the standard Boolean approach and list its
key advantages and disadvantages. It has the following strengths: 1) It is easy to implement and it is
computationally efficient [Frakes and Baeza-Yates 1992]. Hence, it is the standard model for the
current large-scale, operational retrieval systems and many of the major on-line information services
use it. 2) It enables users to express structural and conceptual constraints to describe important
linguistic features [Marcus 1991]. Users find that synonym specifications (reflected by OR-clauses)
and phrases (represented by proximity relations) are useful in the formulation of queries [Cooper
1988, Marcus 1991]. 3) The Boolean approach possesses a great expressive power and clarity.
Boolean retrieval is very effective if a query requires an exhaustive and unambiguous selection. 4)
The Boolean method offers a multitude of techniques to broaden or narrow a query. 5) The Boolean
approach can be especially effective in the later stages of the search process, because of the clarity
and exactness with which relationships between concepts can be represented.
The standard Boolean approach has the following shortcomings: 1) Users find it difficult to construct
effective Boolean queries for several reasons [Cooper 1988, Fox and Koll 1988, Belkin and Croft
1992]. Users are using the natural language terms AND, OR or NOT that have a different meaning
when used in a query. Thus, users will make errors when they form a Boolean query, because they
resort to their knowledge of English.
Table 2.1: summarizes the defining characteristics of the standard Boolean approach and list the its
key advantages and disadvantages.
For example, in ordinary conversation a noun phrase of the form "A and B" usually refers to more
entities than would "A" alone, whereas when used in the context of information retrieval it refers to
fewer documents than would be retrieved by "A" alone. Hence, one of the common mistakes made
by users is to substitute the AND logical operator for the OR logical operator when translating an
English sentence to a Boolean query. Furthermore, to form complex queries, users must be familiar
with the rules of precedence and the use of parentheses. Novice users have difficulty using
parentheses, especially nested parentheses. Finally, users are overwhelmed by the multitude of
ways a query can be structured or modified, because of the combinatorial explosion of feasible
queries as the number of concepts increases. In particular, users have difficulty identifying and
applying the different strategies that are available for narrowing or broadening a Boolean query
[Marcus 1991, Lancaster and Warner 1993]. 2) Only documents that satisfy a query exactly are
retrieved. On the one hand, the AND operator is too severe because it does not distinguish between
the case when none of the concepts are satisfied and the case where all except one are satisfied.
Hence, no or very few documents are retrieved when more than three and four criteria are
combined with the Boolean operator AND (referred to as the Null Output problem). On the other
hand, the OR operator does not reflect how many concepts have been satisfied. Hence, often too
many documents are retrieved (the Output Overload problem). 3) It is difficult to control the
number of retrieved documents. Users are often faced with the null-output or the information
overload problem and they are at loss of how to modify the query to retrieve the reasonable
number documents. 4) The traditional Boolean approach does not provide a relevance ranking of the
retrieved documents, although modern Boolean approaches can make use of the degree of
coordination, field level and degree of stemming present to rank them [Marcus 1991]. 5) It does not
represent the degree of uncertainty or error due the vocabulary problem [Belkin and Croft 1992].
2.3.1.2 Narrowing and Broadening Techniques
As mentioned earlier, a Boolean query can be described in terms of the following four operations:
degree and type of coordination, proximity constraints, field specifications and degree of stemming
as expressed in terms of word/string specifications. If users want to (re)formulate a Boolean query
then they need to make informed choices along these four dimensions to create a query that is
sufficiently broad or narrow depending on their information needs. Most narrowing techniques
lower recall as well as raise precision, and most broadening techniques lower precision as well as
raise recall. Any query can be reformulated to achieve the desired precision or recall characteristics,
but generally it is difficult to achieve both. Each of the four kinds of operations in the query
formulation has particular operators, some of which tend to have a narrowing or broadening effect.
For each operator with a narrowing effect, there is one or more inverse operators with a broadening
effect [Marcus 1991]. Hence, users require help to gain an understanding of how changes along
these four dimensions will affect the broadness or narrowness of a query.
Figure 2.2: captures how coordination, proximity, field level and stemming affect the broadness or
narrowness of a Boolean query. By moving in the direction in which the wedges are expanding the
query is broadened.
Figure 2.2 shows how the four dimensions affect the broadness or narrowness of a query: 1)
Coordination: the different Boolean operators AND, OR and NOT have the following effects when
used to add a further concept to a query: a) the AND operator narrows a query; b) the OR broadens
it; c) the effect of the NOT depends on whether it is combined with an AND or OR operator.
Typically, in searching textual databases, the NOT is connected to the AND, in which case it has a
narrowing effect like the AND operator. 2) Proximity: The closer together two terms have to appear
in a document, the more narrow and precise the query. The most stringent proximity constraint
requires the two terms to be adjacent. 3) Field level: current document records have fields
associated with them, such as the "Title", "Index", "Abstract" or "Full-text" field: a) the more fields
that are searched, the broader the query; b) the individual fields have varying degrees of precision
associated with them, where the "title" field is the most specific and the "full-text" field is the most
general. 4) Stemming: The shorter the prefix that is used in truncation-based searching, the broader
the query. By reducing a term to its morphological stem and using it as a prefix, users can retrieve
many terms that are conceptually related to the original term [Marcus 1991].
Using Figure 2.2, we can easily read off how to broaden query. We just need to move in the direction
in which the wedges are expanding: we use the OR operator (rather than the AND), impose no
proximity constraints, search over all fields and apply a great deal of stemming. Similarly, we can
formulate a very narrow query by moving in the direction in which the wedges are contracting: we
use the AND operator (rather than the OR), impose proximity constraints, restrict the search to the
title field and perform exact rather than truncated word matches. In Chapter 4 we will show how
Figure 2.2 indicates how the broadness or narrowness of a Boolean query could be visualized.
There have been attempts to help users overcome some of the disadvantages of the traditional
Boolean discussed above. We will now describe such a method, called Smart Boolean, developed by
Marcus [1991, 1994] that tries to help users construct and modify a Boolean query as well as make
better choices along the four dimensions that characterize a Boolean query. We are not attempting
to provide an in-depth description of the Smart Boolean method, but to use it as a good example
that illustrates some of the possible ways to make Boolean retrieval more user-friendly and
effective. Table 2.2 provides a summary of the key features of the Smart Boolean approach.
Users start by specifying a natural language statement that is automatically translated into a Boolean
Topic representation that consists of a list of factors or concepts, which are automatically
coordinated using the AND operator. If the user at the initial stage can or wants to include
synonyms, then they are coordinated using the OR operator. Hence, the Boolean Topic
representation connects the different factors using the AND operator, where the factors can consist
of single terms or several synonyms connected by the OR operator. One of the goals of the Smart
Boolean approach is to make use of the structural knowledge contained in the text surrogates,
where the different fields represent contexts of useful information. Further, the Smart Boolean
approach wants to use the fact that related concepts can share a common stem. For example, the
concepts "computers" and "computing" have the common stem comput*.
Table 2.2: summarizes the defining characteristics of the Smart Boolean approach and list the its key
advantages and disadvantages.
The initial strategy of the Smart Boolean approach is to start out with the broadest possible query
within the constraints of how the factors and their synonyms have been coordinated. Hence, it
modifies the Boolean Topic representation into the query surrogate by using only the stems of the
concepts and searches for them over all the fields. Once the query surrogate has been performed,
users are guided in the process of evaluating the retrieved document surrogates. They choose from
a list of reasons to indicate why they consider certain documents as relevant. Similarly, they can
indicate why other documents are not relevant by interacting with a list of possible reasons. This
user feedback is used by the Smart Boolean system to automatically modify the Boolean Topic
representation or the query surrogate, whatever is more appropriate. The Smart Boolean approach
offers a rich set of strategies for modifying a query based on the received relevance feedback or the
expressed need to narrow or broaden the query. The Smart Boolean retrieval paradigm has been
implemented in the form of a system called CONIT, which is one of the earliest expert retrieval
systems that was able to demonstrate that ordinary users, assisted by such a system, could perform
equally well as experienced search intermediaries [Marcus 1983]. However, users have to navigate
through a series of menus listing different choices, where it might be hard for them to appreciate
the implications of some of these choices. A key limitation of the previous versions of the CONIT
system has been that lacked a visual interface. The most recent version has a graphical interface and
it uses the tiling metaphor suggested by Anick et al. (1991), and discussed in section 10.4, to visualize
Boolean coordination [Marcus 1994]. This visualization approach suffers from the limitation that it
enables users to visualize specific queries, whereas we will propose a visual interface that represents
all whole range of related Boolean queries in a single display, making changes in Boolean
coordination more user-friendly. Further, the different strategies of modifying a query in CONIT
require a better visualization metaphor to enable users to make use these search heuristics. In
Chapter 4 we show how some of these modification techniques can be visualized.
Several methods have been developed to extend the Boolean model to address the following issues:
1) The Boolean operators are too strict and ways need to be found to soften them. 2) The standard
Boolean approach has no provision for ranking. The Smart Boolean approach and the methods
described in this section provide users with relevance ranking [Fox and Koll 1988, Marcus 1991]. 3)
The Boolean model does not support the assignment of weights to the query or document terms.
We will briefly discuss the P-norm and the Fuzzy Logic approaches that extend the Boolean model to
address the above issues.
Table 2.3: summarizes the defining characteristics of the Extended Boolean approach and list the its
key advantages and disadvantages.
The P-norm method developed by Fox (1983) allows query and document terms to have weights,
which have been computed by using term frequency statistics with the proper normalization
procedures. These normalized weights can be used to rank the documents in the order of decreasing
distance from the point (0, 0, ... , 0) for an OR query, and in order of increasing distance from the
point (1, 1, ... , 1) for an AND query. Further, the Boolean operators have a coefficient P associated
with them to indicate the degree of strictness of the operator (from 1 for least strict to infinity for
most strict, i.e., the Boolean case). The P-norm uses a distance-based measure and the coefficient P
determines the degree of exponentiation to be used. The exponentiation is an expensive
computation, especially for P-values greater than one.
In Fuzzy Set theory, an element has a varying degree of membership to a set instead of the
traditional binary membership choice. The weight of an index term for a given document reflects the
degree to which this term describes the content of a document. Hence, this weight reflects the
degree of membership of the document in the fuzzy set associated with the term in question. The
degree of membership for union and intersection of two fuzzy sets is equal to the maximum and
minimum, respectively, of the degrees of membership of the elements of the two sets. In the "Mixed
Min and Max" model developed by Fox and Sharat (1986) the Boolean operators are softened by
considering the query-document similarity to be a linear combination of the min and max weights of
the documents.
The vector space and probabilistic models are the two major examples of the statistical retrieval
approach. Both models use statistical information in the form of term frequencies to determine the
relevance of documents with respect to a query. Although they differ in the way they use the term
frequencies, both produce as their output a list of documents ranked by their estimated relevance.
The statistical retrieval models address some of the problems of Boolean retrieval methods, but they
have disadvantages of their own. Table 2.4 provides summary of the key features of the vector space
and probabilistic approaches. We will also describe Latent Semantic Indexing and clustering
approaches that are based on statistical retrieval approaches, but their objective is to respond to
what the user's query did not say, could not say, but somehow made manifest [Furnas et al. 1983,
Cutting et al. 1991].
The vector space model represents the documents and queries as vectors in a multidimensional
space, whose dimensions are the terms used to build an index to represent the documents [Salton
1983]. The creation of an index involves lexical scanning to identify the significant terms, where
morphological analysis reduces different word forms to common "stems", and the occurrence of
those stems is computed. Query and document surrogates are compared by comparing their
vectors, using, for example, the cosine similarity measure. In this model, the terms of a query
surrogate can be weighted to take into account their importance, and they are computed by using
the statistical distributions of the terms in the collection and in the documents [Salton 1983]. The
vector space model can assign a high ranking score to a document that contains only a few of the
query terms if these terms occur infrequently in the collection but frequently in the document. The
vector space model makes the following assumptions: 1) The more similar a document vector is to a
query vector, the more likely it is that the document is relevant to that query. 2) The words used to
define the dimensions of the space are orthogonal or independent. While it is a reasonable first
approximation, the assumption that words are pairwise independent is not realistic.
The probabilistic retrieval model is based on the Probability Ranking Principle, which states that an
information retrieval system is supposed to rank the documents based on their probability of
relevance to the query, given all the evidence available [Belkin and Croft 1992]. The principle takes
into account that there is uncertainty in the representation of the information need and the
documents. There can be a variety of sources of evidence that are used by the probabilistic retrieval
methods, and the most common one is the statistical distribution of the terms in both the relevant
and non-relevant documents.
We will now describe the state-of-art system developed by Turtle and Croft (1991) that uses
Bayesian inference networks to rank documents by using multiple sources of evidence to compute
the conditional probability
The statistical approaches have the following strengths: 1) They provide users with a relevance
ranking of the retrieved documents. Hence, they enable users to control the output by setting a
relevance threshold or by specifying a certain number of documents to display. 2) Queries can be
easier to formulate because users do not have to learn a query language and can use natural
language. 3) The uncertainty inherent in the choice of query concepts can be represented. However,
the statistical approaches have the following shortcomings: 1) They have a limited expressive power.
For example, the NOT operation can not be represented because only positive weights are used. It
can be proven that only 2N*N of the 22N possible Boolean queries can be generated by the
statistical approaches that use weighted linear sums to rank the documents. This result follows from
the analysis of Linear Threshold Networks or Boolean Perceptrons [Anthony and Biggs 1992]. For
example, the very common and important Boolean query ((A and B) or (C and D)) can not be
represented by a vector space query (see section 5.4 for a proof). Hence, the statistical approaches
do not have the expressive power of the Boolean approach. 3) The statistical approach lacks the
structure to express important linguistic features such as phrases. Proximity constraints are also
difficult to express, a feature that is of great use for experienced searchers. 4) The computation of
the relevance scores can be computationally expensive. 5) A ranked linear list provides users with a
limited view of the information space and it does not directly suggest how to modify a query if the
need arises [Spoerri 1993, Hearst 1994]. 6) The queries have to contain a large number of words to
improve the retrieval performance. As is the case for the Boolean approach, users are faced with the
problem of having to choose the appropriate words that are also used in the relevant documents.
Table 2.4 summarizes the advantages and disadvantages that are specific to the vector space and
probabilistic model, respectively. This table also shows the formulas that are commonly used to
compute the term weights. The two central quantities used are the inverse term frequency in a
collection (idf), and the frequencies of a term i in a document j (freq(i,j)). In the probabilistic model,
the weight computation also considers how often a term appears in the relevant and irrelevant
documents, but this presupposes that the relevant documents are known or that these frequencies
can be reliably estimated.
Table 2.4: summarizes the defining characteristics of the statistical retrieval approach, which
includes the vector space and the probabilistic model and we list the their key advantages and
disadvantages.
If users provide the retrieval system with relevance feedback, then this information is used by the
statistical approaches to recompute the weights as follows: the weights of the query terms in the
relevant documents are increased, whereas the weights of the query terms that do not appear in the
relevant documents are decreased [Salton and Buckley 1990]. There are multiple ways of computing
and updating the weights, where each has its advantages and disadvantages. We do not discuss
these formulas in more detail, because research on relevance feedback has shown that significant
effectiveness improvements can be gained by using quite simple feedback techniques [Salton and
Buckley 1990]. Furthermore, what is important to this thesis is that the statistical retrieval approach
generates a ranked list, however how this ranking has been computed in detail is immaterial for the
purpose of this thesis.
Several statistical and AI techniques have been used in association with domain semantics to extend
the vector space model to help overcome some of the retrieval problems described above, such as
the "dependence problem" or the "vocabulary problem". One such method is Latent Semantic
Indexing (LSI). In LSI the associations among terms and documents are calculated and exploited in
the retrieval process. The assumption is that there is some "latent" structure in the pattern of word
usage across documents and that statistical techniques can be used to estimate this latent structure.
An advantage of this approach is that queries can retrieve documents even if they have no words in
common. The LSI technique captures deeper associative structure than simple term-to-term
correlations and is completely automatic. The only difference between LSI and vector space methods
is that LSI represents terms and documents in a reduced dimensional space of the derived indexing
dimensions. As with the vector space method, differential term weighting and relevance feedback
can improve LSI performance substantially.
Foltz and Dumais (1992) compared four retrieval methods that are based on the vector-space
model. The four methods were the result of crossing two factors, the first factor being whether the
retrieval method used Latent Semantic Indexing or keyword matching, and the second factor being
whether the profile was based on words or phrases provided by the user (Word profile), or
documents that the user had previously rated as relevant (Document profile). The LSI match-
document profile method proved to be the most successful of the four methods. This method
combines the advantages of both LSI and the document profile. The document profile provides a
simple, but effective, representation of the user's interests. Indicating just a few documents that are
of interest is as effective as generating a long list of words and phrases that describe one's interest.
Document profiles have an added advantage over word profiles: users can just indicate documents
they find relevant without having to generate a description of their interests.
In the simplest form of automatic text retrieval, users enter a string of keywords that are used to
search the inverted indexes of the document keywords. This approach retrieves documents based
solely on the presence or absence of exact single word strings as specified by the logical
representation of the query. Clearly this approach will miss many relevant documents because it
does not capture the complete or deep meaning of the user's query. The Smart Boolean approach
and the statistical retrieval approaches, each in their specific way, try to address this problem (see
Table 2.5). Linguistic and knowledge-based approaches have also been developed to address this
problem by performing a morphological, syntactic and semantic analysis to retrieve documents
more effectively [Lancaster and Warner 1993]. In a morphological analysis, roots and affixes are
analyzed to determine the part of speech (noun, verb, adjective etc.) of the words. Next complete
phrases have to be parsed using some form of syntactic analysis. Finally, the linguistic methods have
to resolve word ambiguities and/or generate relevant synonyms or quasi-synonyms based on the
semantic relationships between words. The development of a sophisticated linguistic retrieval
system is difficult and it requires complex knowledge bases of semantic information and retrieval
heuristics. Hence these systems often require techniques that are commonly referred to as artificial
intelligence or expert systems techniques.
We will now describe in some detail the DR-LINK system developed by Liddy et al., because it
represents an exemplary linguistic retrieval system. DR-LINK is based on the principle that retrieval
should take place at the conceptual level and not at the word level. Liddy et al. attempt to retrieve
documents on the basis of what people mean in their query and not just what they say in their
query. DR-LINK system employs sophisticated, linguistic text processing techniques to capture the
conceptual information in documents. Liddy et al. have developed a modular system that represents
and matches text at the lexical, syntactic, semantic, and the discourse levels of language. Some of
the modules that have been incorporated are: The Text Structurer is based on discourse linguistic
theory that suggests that texts of a particular type have a predictable structure which serves as an
indication where certain information can be found. The Subject Field Coder uses an established
semantic coding scheme from a machine-readable dictionary to tag each word with its
disambiguated subject code (e.g., computer science, economics) and to then produce a fixed-length,
subject-based vector representation of the document and the query. The Proper Noun Interpreter
uses a variety of processing heuristics and knowledge bases to produce: a canonical representation
of each proper noun; a classification of each proper noun into thirty-seven categories; and an
expansion of group nouns into their constituent proper noun members. The Complex Nominal
Phraser provides means for precise matching of complex semantic constructs when expressed as
either adjacent nouns or a non-predicating adjective and noun pair. Finally, The Natural Language
Query Constructor takes as input a natural language query and produces a formal query that reflects
the appropriate logical combination of text structure, proper noun, and complex nominal
requirements of the user's information need. This module interprets a query into pattern-action
rules that translate each sentence into a first-order logic assertion, reflecting the Boolean-like
requirements of queries.
Table 2.5: characterizes the major retrieval methods in terms of how deal with lexical,
morphological, syntactic and semantic issues.
To summarize, the DR-LINK retrieval system represents content at the conceptual level rather than
at the word level to reflect the multiple levels of human language comprehension. The text
representation combines the lexical, syntactic, semantic, and discourse levels of understanding to
predict the relevance of a document. DR-LINK accepts natural language statements, which it
translates into a precise Boolean representation of the user's relevance requirements. It also
produces a summary-level, semantic vector representations of queries and documents to provide a
ranking of the documents.
2.4 Conclusion
There is a growing discrepancy between the retrieval approach used by existing commercial retrieval
systems and the approaches investigated and promoted by a large segment of the information
retrieval research community. The former is based on the Boolean or Exact Matching retrieval
model, whereas the latter ones subscribe to statistical and linguistic approaches, also referred to as
the Partial Matching approaches. First, the major criticism leveled against the Boolean approach is
that its queries are difficult to formulate. Second, the Boolean approach makes it possible to
represent structural and contextual information that would be very difficult to represent using the
statistical approaches. Third, the Partial Matching approaches provide users with a ranked output,
but these ranked lists obscure
Table 2.6: lists some of the key problems in the field of information retrieval and possible solutions.
valuable information. Fourth, recent retrieval experiments have shown that the Exact and Partial
matching approaches are complementary and should therefore be combined [Belkin et al. 1993].
In Table 2.6 we summarize some of the key problems in the field of information retrieval and
possible solutions to them. We will attempt to show in this thesis: 1) how visualization can offer
ways to address these problems; 2) how to formulate and modify a query; 3) how to deal with large
sets of retrieved documents, commonly referred to as the information overload problem. In
particular, this thesis overcomes one of the major "bottlenecks" of the Boolean approach by
showing how Boolean coordination and its diverse narrowing and broadening techniques can be
visualized, thereby making it more user-friendly without limiting its expressive power. Further, this
thesis shows how both the Exact and Partial Matching approaches can be visualized in the same
visual framework to enable users to make effective use of their respective strengths.
TEXT PREPROCESSING
The information retrieval is the task of obtaining relevant information from a large collection of
databases. Preprocessing plays an important role in information retrieval to extract the relevant
information. A text preprocessing approach works in two steps. Firstly, spell check utility is used for
enhancing stemming and secondly, synonyms of similar tokens are combined.The commonly used
text preprocessing techniques are:
1. Stopword Removal
Stopwords are very commonly used words in a language that play a major role in the
formation of a sentence but which seldom contribute to the meaning of that sentence. Words that
are expected to occur in 80 percent or more of the documents in a collection are typically referred
to as stopwords, and they are rendered potentially useless. Because of the commonness and
function of these words, they do not contribute much to the relevance of a document for a query
search. Examples include words such as the, of, to, a, and, in, said, for, that, was, on, he, is, with, at,
by, and it. Removal of stopwords from a document must be performed before indexing. Articles,
prepositions, conjunctions, and some pronouns are generally classified as stopwords. Queries must
also be preprocessed for stopword removal before the actual retrieval process. Removal of
stopwords results in elimination of possible spurious indexes, thereby reducing the size of an index
structure by about 40 percent or more. However, doing so could impact the recall if the stopword is
an integral part of a query (for example, a search for the phrase ‘To be or not to be,’ where removal
of stopwords makes the query inappropriate, as all the words in the phrase are stopwords). Many
search engines do not employ query stopword removal for this reason.
2. Stemming
A stem of a word is defined as the word obtained after trimming the suffix and prefix of an
original word. For example, ‘comput’ is the stem word for computer, computing, and computation.
These suffixes and prefixes are very common in the English language for supporting the notion of
verbs, tenses, and plural forms. Stemming reduces the different forms of the word formed by
inflection (due to plurals or tenses) and derivation to a common stem.A stemming algorithm can be
applied to reduce any word to its stem. In English, the most famous stemming algorithm is Martin
Porter’s stemming algorithm. The Porter stemmer is a simplified version of Lovin’s technique that
uses a reduced set of about 60 rules (from 260 suffix patterns in Lovin’s technique) and organizes
them into sets; conflicts within one subset of rules are resolved before going on to the next. Using
stemming for preprocessing data results in a decrease in the size of the indexing structure and an
increase in recall, possibly at the cost of precision.
3. Utilizing a Thesaurus
A thesaurus comprises a precompiled list of important concepts and the main word that
describes each concept for a particular domain of knowledge. For each concept in this list, a set of
synonyms and related words is also compiled. Thus, a synonym can be converted to its matching
concept during preprocessing. This preprocessing step assists in providing a standard vocabulary for
indexing and searching. Usage of a thesaurus, also known as a collection of synonyms, has a
substantial impact on the recall of information systems. This process can be complicated because
many words have different meanings in different contexts. UMLS is a large biomedical thesaurus of
millions of concepts (called the Metathesaurus) and a semantic network of meta concepts and
relationships that organize the Metathesaurus. The concepts are assigned labels from the semantic
network. This thesaurus of concepts contains synonyms of medical terms, hierarchies of broader and
narrower terms, and other relationships among words and concepts that make it a very extensive
resource for information retrieval of documents in the medical domain. Figure 27.3 illustrates part of
the UMLS Semantic Network.
WordNet is a manually constructed thesaurus that groups words into strict synonym sets called
synsets. These synsets are divided into noun, verb, adjective, and adverb categories. Within each
category, these synsets are linked together by appropriate relationships such as class/subclass or “is-
a” relationships for nouns.
WordNet is based on the idea of using a controlled vocabulary for indexing, thereby eliminating
redundancies. It is also useful in providing assistance to users with locating terms for proper query
formulation.
Digits, dates, phone numbers, e-mail addresses, URLs, and other standard types of
text may or may not be removed during preprocessing. Web search engines, however, index them in
order to to use this type of information in the document metadata to improve precision and recall
(see Section 27.6 for detailed definitions of precision and recall).
Hyphens and punctuation marks may be handled in different ways. Either the entire phrase with the
hyphens/punctuation marks may be used, or they may be eliminated. In some systems, the
character representing the hyphen/punctuation mark may be removed, or may be replaced with a
space. Different information retrieval systems follow different rules of processing. Handling hyphens
automatically can be complex: it can either be done as a classification problem, or more commonly
by some heuristic rules.
Most information retrieval systems perform case-insensitive search, converting all the letters of the
text to uppercase or lowercase. It is also worth noting that many of these text preprocessing steps
are language specific, such as involving accents and diacritics and the idiosyncrasies that are
associated with a particular language.
5. Information Extraction
Information extraction (IE) is a generic term used for extracting structured con-tent from
text. Text analytic tasks such as identifying noun phrases, facts, events, people, places, and
relationships are examples of IE tasks. These tasks are also called named entity recognition tasks and
use rule-based approaches with either a the-saurus, regular expressions and grammars, or
probabilistic approaches. For IR and search applications, IE technologies are mostly used to identify
contextually relevant features that involve text analysis, matching, and categorization for improving
the relevance of search systems. Language technologies using part-of-speech tagging are applied to
semantically annotate the documents with extracted features to aid search relevance.
Inverted Index
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like
data structure that directs you from a word to a document or a web page.
A record-level inverted index contains a list of references to documents for each word.
A word-level inverted index additionally contains the positions of each word within a document.
The latter form offers more functionality, but needs more processing power and space to be
created.
Suppose we want to search the texts “hello everyone, ” “this article is based on inverted index, ”
“which is hashmap like data structure”. If we index by (text, word within the text), the index with
location in text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)
The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1) and
word “is” is in document 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on
word).
The index may have weights, frequencies, or other indicators.
Removing of Stop Words: Stop words are most occurring and useless words in document like “I”,
“the”, “we”, “is”, “an”.
Whenever I want to search for “cat”, I want to see a document that has information about it. But the
word present in the document is called “cats” or “catty” instead of “cat”. To relate the both words,
I’ll chop some part of each and every word I read so that I could get the “root word”. There are
standard tools for performing this like “Porter’s Stemmer”.
If word is already present add reference of document to index else create new entry. Add additional
information like frequency of word, location of word etc.
Example:
Words Document
ant doc1
demo doc2
● Inverted index is to allow fast full text searches, at a cost of increased processing when a
document is added to the database.
● It is easy to develop.
● It is the most popular data structure used in document retrieval systems, used on a large
scale for example in search engines.
Inverted Index also has disadvantage:
● Large storage overhead and high maintenance costs on update, delete and insert.
Evaluative Masures :
An internet search, otherwise known as a search query, is an entry into a search engine that yields
both paid and organic results. The paid results are the ads that appear at the top and the bottom of
the page, and they are marked accordingly. The organic results are the unmarked results that appear
in between the ads.
At the core of an internet search is a keyword. In turn, keywords are at the hearts of search engine
marketing (SEM) and search engine optimization (SEO).
Web analytics is the process of analyzing the behavior of visitors to a website. This involves tracking,
reviewing and reporting data to measure web activity, including the use of a website and its
components, such as webpages, images and videos.
Data collected through web analytics may include traffic sources, referring sites, page views, paths
taken and conversion rates. The compiled data often forms a part of customer relationship
management analytics (CRM analytics) to facilitate and streamline better business decisions.
Web analytics enables a business to retain customers, attract more visitors and increase the dollar
volume each customer spends.
Determine the likelihood that a given customer will repurchase a product after purchasing it in the
past.
Monitor the amount of money individual customers or specific groups of customers spend.
Observe the geographic regions from which the most and the least customers visit the site and
purchase specific products.
Predict which products customers are most and least likely to buy in the future.
The objective of web analytics is to serve as a business metric for promoting specific products to the
customers who are most likely to buy them and to determine which products a specific customer is
most likely to purchase. This can help improve the ratio of revenue to marketing costs.
In addition to these features, web analytics may track the clickthrough and drilldown behavior of
customers within a website, determine the sites from which customers most often arrive, and
communicate with browsers to track and analyze online behavior. The results of web analytics are
provided in the form of tables, charts and graphs.
Setting goals. The first step in the web analytics process is for businesses to determine goals and the
end results they are trying to achieve. These goals can include increased sales, customer satisfaction
and brand awareness. Business goals can be both quantitative and qualitative.
Collecting data. The second step in web analytics is the collection and storage of data. Businesses
can collect data directly from a website or web analytics tool, such as Google Analytics. The data
mainly comes from Hypertext Transfer Protocol requests -- including data at the network and
application levels -- and can be combined with external data to interpret web usage. For example, a
user's Internet Protocol address is typically associated with many factors, including geographic
location and clickthrough rates.
Processing data. The next stage of the web analytics funnel involves businesses processing the
collected data into actionable information.
Identifying key performance indicators (KPIs). In web analytics, a KPI is a quantifiable measure to
monitor and analyze user behavior on a website. Examples include bounce rates, unique users, user
sessions and on-site search queries.
Developing a strategy. This stage involves implementing insights to formulate strategies that align
with an organization's goals. For example, search queries conducted on-site can help an organization
develop a content strategy based on what users are searching for on its website.
Experimenting and testing. Businesses need to experiment with different strategies in order to find
the one that yields the best results. For example, A/B testing is a simple strategy to help learn how
an audience responds to different content. The process involves creating two or more versions of
content and then displaying it to different audience segments to reveal which version of the content
performs better.
The two main categories of web analytics are off-site web analytics and on-site web analytics.
The term off-site web analytics refers to the practice of monitoring visitor activity outside of an
organization's website to measure potential audience. Off-site web analytics provides an
industrywide analysis that gives insight into how a business is performing in comparison to
competitors. It refers to the type of analytics that focuses on data collected from across the web,
such as social media, search engines and forums.
On-site web analytics refers to a narrower focus that uses analytics to track the activity of visitors to
a specific site to see how the site is performing. The data gathered is usually more relevant to a site's
owner and can include details on site engagement, such as what content is most popular. Two
technological approaches to on-site web analytics include log file analysis and page tagging.
Log file analysis, also known as log management, is the process of analyzing data gathered from log
files to monitor, troubleshoot and report on the performance of a website. Log files hold records of
virtually every action taken on a network server, such as a web server, email server, database server
or file server.
Page tagging is the process of adding snippets of code into a website's HyperText Markup Language
code using a tag management system to track website visitors and their interactions across the
website. These snippets of code are called tags. When businesses add these tags to a website, they
can be used to track any number of metrics, such as the number of pages viewed, the number of
unique visitors and the number of specific products viewed.
Web analytics tools report important statistics on a website, such as where visitors came from, how
long they stayed, how they found the site and their online activity while on the site. In addition to
web analytics, these tools are commonly used for product analytics, social media analytics and
marketing analytics.
Google Analytics. Google Analytics is a web analytics platform that monitors website traffic,
behaviors and conversions. The platform tracks page views, unique visitors, bounce rates, referral
Uniform Resource Locators, average time on-site, page abandonment, new vs. returning visitors and
demographic data.
Optimizely. Optimizely is a customer experience and A/B testing platform that helps businesses test
and optimize their online experiences and marketing efforts, including conversion rate optimization.
Kissmetrics. Kissmetrics is a customer analytics platform that gathers website data and presents it in
an easy-to-read format. The platform also serves as a customer intelligence tool, as it enables
businesses to dive deeper into customer behavior and use this information to enhance their website
and marketing campaigns.
Crazy Egg. Crazy Egg is a tool that tracks where customers click on a page. This information can help
organizations understand how visitors interact with content and why they leave the site. The tool
tracks visitors, heatmaps and user session recordings.
2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for
information retrieval.
In first-generation, it consisted, automation of previous technologies, and the search was
based on author name and title.
In the second generation, it included searching by subject heading, keywords, etc.
In the third generation, it consisted of graphical interfaces, electronic forms, hypertext
features, etc.
The Web and Digital Libraries: It is cheaper than various sources of information, it
provides greater access to networks due to digital communication and it gives free access to
publish on a larger medium.
1. WebRTC.
WebRTC Known as Web Real-Time Communication, it’s an open framework for web and
is widely used in many browsers such as Google Chrome, Mozilla Firefox, Android, iOS, etc.
Using this framework, users can do video conferencing, share files, desktop sharing and can
interact Realtime without the use of external web plugins.
WebRTC can be used in the sector of online education & E-meetings.
The use of MOOCs (Massive Open Online Courses) has made WebRTC a very essential
framework.
The use of WebRTC would enable a better online learning experience and would break the
boundaries for education to be transmitted to everyone.
As of now, there are already many platforms that provide online education, which helps
thousands of students.
WebRTC is continuously improved to achieve for a better end-user experience and many
recent developments have led the WebRTC framework to be used in the older Devices and
in offline mode, which further enables more users to get benefited.
The concept of cloud-conferencing and E-meetings in the corporate sector is only possible
with the use of WebRTC.
Clients and employees are saving an ample amount of time via meeting online.
Internet of Things (IoT) Internet of things is considered as the backbone of the modern
internet, as only via IoT consumers, governments and businesses would be able to interact
with the physical world.
Thus, this will help the problems to be solved in a much better and engaging way.
The vision of an advanced and closely operated internet system cannot be visualized without
the use of these smart devices.
These smart devices need not necessarily be computerized devices but can also be non-
computerized devices such as fan, fridge, air conditioner, etc.
These devices will be given the potential to create user-specific data that can be optimized
for better user experience and to increase human productivity.
The goal of IoT is to form a network of internet-connected devices, which can interact
internally for better usage.
Many developed countries have already started using IoT, and a common example is the
use of light sensors in public places.
Whenever a vehicle/object will pass through the road, the first street light will be
lightened and will lighten all the other lights on that road which are internally connected,
thus creating a smarter and energy-saving model.
Around 35 billion devices are connected to the internet in 2020, and the number of
connections to the internet is expected to go up to 50 billion by 2030.
Thus, IOT proves to be one of the emerging web technologies in the coming decades.
Progressive Web Apps The smartphones we use in current scenarios are loaded with
apps, with the choice of users to download or remove any app depending upon his liking.
But what if, we do not have to download or remove any app to use its services?
The idea behind progressive web apps is much similar to this.
Such apps would cover the screen of the smartphones and would enable us to use or try
any app upon our liking without actually downloading it.
It is a combination of web and app technology to give the user a much smoother
experience.
The advantage of using progressive web apps is that the users will not face any hassle to
download and update the app time of time thus saving data.
Also, the app companies would not need to release the app for every updated version.
This would also eliminate the use complexity to create responsive apps, as the progressive
web apps can be used in any device and will give the same experience, despite the screen
size.
Further development into progressive web apps can also enable users to use it in an
offline mode, thus paving a way for those who are not connected with the internet.
The ease of use and availability will increase thus benefiting the user and making life much
simpler.
A very common example is the ‘Try Now’ feature that we get in the Google play store for
specific apps, it more or less uses the same technology of progressive web apps to run the
app without actually downloading it.
Social Networking via Virtual Reality The rise of virtual reality in the last few years is
due to its ability to fill the gap between reality and virtual.
The same idea of virtual reality is now being thought to be used with social networking.
Making the idea of social networking a base i.e. to interact with people over long distances
& the idea of virtual reality is used on top.
Social networking sites are devising ways so that the users not only confine themselves over-
communicating online but to provide a way through which they have access to the world of
virtual reality.
The video calling/conferencing shall not remain a visual perception but would be changed to
a complete 360-degree experience.
The user will be able to feel much more than just communication and can interact in a much
better way.
This idea of mixing social networking with virtual reality might be a challenging one, but
the kind of user experience, one could get will be amazing.
World’s largest social networking company, Facebook has started to develop a platform
way back in 2014 and was able to successfully create a virtual environment where users
were not just able to communicate but also feel their surroundings, but the platform has
not been open to public yet.
These Web trends will shortly arrive in the coming years, and the availability of these
technologies will once again prove that the internet is not stagnant and is always improving
to provide a better user experience.
Improvising these technologies will make the internet take a very essential place in our lives,
just like the way it has taken now.