IR Unit-1 - Updated
IR Unit-1 - Updated
Mr. S. G. Shaikh
Asst. Professor
Dept. Of Computer Engg,
AIKTC, New Panvel
[email protected]
Cell. +91 9960726716
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Subject: Information Retrieval CSDC7023
Prerequisite: Data structures and algorithms
Course Objectives:
The course aims students :
1 To learn the fundamentals of Information Retrieval
2 To analyze various Information retrieval modeling techniques
3 To understand query processing and its applications
4 To explore the various indexing and scoring techniques
5 To assess the various evaluation methods
6 To analyze various information retrieval for real world application.
Course Outcomes:
Learner will be able to: -
1 Describe and Analyze the concepts, challlenges of the Information retrieval system.
2 Design the various modeling techniques for information retrieval systems.
3 Implements the query structure and various query operations
4 Analyzing the indexing and scoring operation in information retrieval systems
5 Perform the evaluation of information retrieval systems
6 Analyze various information retrieval for real world application
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Unit No-01: Introduction to Information Retrieval
Syllabus:-
Introduction to Information Retrieval, Basic Concepts, Information Versus Data,
Trends and research issues in information retrieval. The retrieval process,
Information retrieval in the library, web and digital libraries.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Since information retrieval and database systems each handle different
kinds of data, some database system problems are usually not present in
information retrieval systems, such as concurrency control, recovery,
transaction management, and update.
There are some common information retrieval problems that are usually not
encountered in traditional database systems, such as unstructured
documents, approximate search based on keywords, and the notion of
relevance.
Because of the abundance of text data, information retrieval has discovered
several applications.
There exist several information retrieval systems, including online library
catalog systems, online records management systems, and the more
currently developed Web search engines.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Since information retrieval and database systems each handle different
kinds of data, some database system problems are usually not present in
information retrieval systems, such as concurrency control, recovery,
transaction management, and update.
There are some common information retrieval problems that are usually not
encountered in traditional database systems, such as unstructured
documents, approximate search based on keywords, and the notion of
relevance.
Because of the abundance of text data, information retrieval has discovered
several applications.
There exist several information retrieval systems, including online library
catalog systems, online records management systems, and the more
currently developed Web search engines.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
A general data retrieval problem is to locate relevant documents in a
document set depending on a user’s query, which is often some keywords
defining an information need, although it can also be an example of relevant
records.
This is most suitable when a user has some ad hoc (i.e., short-term) data
need, including finding data to buy a used car. When a user has a long-term
data need (e.g., a researcher’s interests), a retrieval system can also take
the initiative to “push” any newly arrived data elements to a user if the
element is judged as being relevant to the user’s data need.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
There are two basic measures for assessing the quality of text retrieval which
are as follows −
Precision − This is the percentage of retrieved data that are actually relevant
to the query (i.e., “correct” responses). It is formally represented as
precision=|{Relevant}∩{Retrieved}||{Retrieved}|
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
Recall − This is the percentage of records that are relevant to the query and
were actually retrieved. It is formally represented as
recall=|{Relevant}∩{Retrieved}||{Relevant}|
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
An information retrieval system searches a collection of natural language
documents with the goal of retrieving exactly the set of documents that
matches a user’s question.
They have their origin in library systems.
These systems assist users in finding the information they require but it does
not attempt to deduce or generate answers.
It tells about the existence and location of documents that might consist of
the required information that is given to the user.
The documents that satisfy the user’s requirement are called relevant
documents. If we have a perfect IR system, then it will retrieve only relevant
documents.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
From the above diagram, it is clear that a user who needs information will have
to formulate a request in the form of a query in natural language. After that, the
IR system will return output by retrieving the relevant output, in the form of
documents, about the required information.
Image Source: Google Images
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
Indexing
It is the process of selecting terms to represent a text.
Indexing involves:
Tokenization of string
Removing frequent words
Stemming
Boolean Model
Vector space model
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
Matching
1. TF: It stands for Term Frequency which is simply the number of times a given
term appears in that document.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
The effective retrieval of relevant information is directly affected both by
the user task and by the logical view of the documents adopted by the
retrieval system.
The user of a retrieval system has to translate his information need into a
query in the language provided by the system.
With an information retrieval system, this normally implies specifying a set
of words which convey the semantics of the information need.
With a data retrieval system, a query expression (such as, for instance,
a regular expression) is used to convey the constraints that must be
satisfied by objects in the answer set.
In both cases, we say that the user searches for useful information
executing a retrieval task.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The User Task
Consider now a user who has an interest which is either poorly defined or
which is inherently broad.
For instance, the user might be interested in documents about car racing in
general. In this situation, the user might use an interactive interface to simply look
around in the collection for documents related to car racing.
For instance, he might find interesting documents about Formula 1 racing, about
car manufacturers, or about the `24 Hours of Le Mans.' Furthermore, while reading
about the `24 Hours of Le Mans', he might turn his attention to a document which
provides directions to Le Mans and, from there, to documents which cover tourism
in France. In this situation, we say that the user is browsing the documents in the
collection, not searching.
It is still a process of retrieving information, but one whose main objectives are not
clearly defined in the beginning and whose purpose might change during the
interaction with the system.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The User Task
Figure: Interaction of the user with the retrieval system through distinct tasks.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Logical View of the Documents
2. Logical View of the Documents
Due to historical reasons, documents in a collection are frequently represented
through a set of index terms or keywords. Such keywords might be extracted directly from
the text of the document or might be specified by a human subject (as frequently done in the
information sciences arena).
No matter whether these representative keywords are derived automatically or generated by a
specialist, they provide a logical view of the document.
Modern computers are making it possible to represent a document by its full set of words. In
this case, we say that the retrieval system adopts a full text logical view (or representation) of
the documents.
With very large collections, however, even modern computers might have to reduce the set of
representative keywords.
This can be accomplished through the elimination of stopwords (such as articles and
connectives), the use of stemming (which reduces distinct words to their common
grammatical root), and the identification of noun groups (which eliminates adjectives,
adverbs, and verbs).
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Logical View of the Documents
2. Logical View of the Documents
Figure: Logical view of a document: from full text to a set of index terms.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Difference of Information Retrieval and Data Retrieval
<
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
Indexing is the most vital part of any Information
Retrieval System.
It is a process in which the documents required by
the users are transformed into searchable data
structures.
Indexing can be also referred to as the process of
extraction rather than analysis of particular content.
It creates a core functionality of the IR process since
it is the first step in IR and assists in efficient
information retrieval.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
In the process, first, the document surrogates are
created to represent each document.
Secondly, it requires analysis of original documents
that include simple (identifying meta-information
e.g., author, title, subject etc.) and complex (linguistic
analysis of content) data.
Indexes are the data structures that are used to
make the search faster.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
1 Document and Query Indexing –
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
2. Query Evaluation –
In the retrieval model how can a document be represented
with the selected keywords and how are documents and
query representations compared to calculate a score.
Information Retrieval (IR) deals with issues like uncertainty
and vagueness in information systems.
Uncertainty :
The available representation does not typically reflect true
semantics of objects such as images, videos etc.
Vagueness :
The information that the user requires lacks clarity, is only
vaguely expressed in a query, feedback or user action.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
In this section we review a few concepts that are being considered in more recent research
work in information retrieval.
1. Faceted Search
Faceted Search is a technique that allows for integrated search and navigation experience
by allowing users to explore by filtering available information. This search technique is
used often in ecommerce Websites and applications enabling users to navigate a multi-
dimensional information space. Facets are generally used for handling three or more
dimensions of classification. This allows the faceted classification scheme to classify an
object in various ways based on different taxonomical criteria. For example, a Web page
may be classified in various ways: by content (air-lines, music, news, ...); by use (sales,
information, registration, ...); by location; by language used (HTML, XML, ...) and in
other ways or facets. Hence, the object can be classified in multiple ways based on
multiple taxonomies.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
2. Social Search
The traditional view of Web navigation and browsing assumes that a single
user is searching for information. This view contrasts with previous research
by library scientists who studied users’ information seeking habits. This research
demonstrated that additional individuals may be valuable information resources
during information search by a single user. More recently, research indicates
that there is often direct user cooperation during Web-based information
search. Some studies report that significant segments of the user population
are engaged in explicit collaboration on joint search tasks on the Web.
Active collaboration by multiple parties also occur in certain cases (for example,
enterprise settings); at other times, and perhaps for a majority of searches, users
often interact with others remotely, asynchronously, and even involuntarily and
implicitly.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Trends in Information Retrieval
3. Conversational Search
Conversational Search (CS) is an interactive and collaborative information finding
interaction. The participants engage in a conversation and perform a social search
activity that is aided by intelligent agents. The collaborative search activity helps the
agent learn about conversations with interactions and feedback from participants. It uses the
semantic retrieval model with natural language understanding to provide the users with
faster and relevant search results. It moves search from being a solitary activity to being a
more participatory activity for the user. The search agent performs multiple tasks of
finding relevant information and connecting the users together; participants provide
feedback to the agent during the conversations that allows the agent to perform better.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
Artificial Intelligence
AI focuses on finding a logical, mathematical way to
represent knowledge.
The computer can be programmed with this mathematical
model to assist in decision making, information retrieval,
and analysis.
Then, when a query is asked, the computer follows the rules
for a response.
AI has many facets, including robotics, expert systems, and
voice recognition and simulation. Search engines incorporate
some of the fascinating trends in AI.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
Probabilistic Logic
Will it rain today? What is the possibility of my car needing an oil change?
Or, what is the chance of getting an A on my history test?.
There are many questions like these that cannot be answered with an
affirmative or negative answer. Uncertainty reigns.
In an effort to make a decision which accounted for such doubt, in the
midst of chaos, a branch of logic was defined to study probability.
Since the 16th and 17th centuries, probability theory has been used to
explain chance. Such questions rely on a factual information as history
coupled with probability.
In information retrieval, the same applies. By setting up a formula, an
algorithm, that places values on words, their interrelationships, proximity,
and their frequency, the computer can be used to help locate relevant
sites. By computing these terms together, the search engine can produce a
relevancy ranking that is then displayed to the user.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
Query by example
Query-by-example (QBE) is the concept of providing the search engine
an example for which to Using this example, the system returns other
like documents.
For example, I want a book about gorillas, published in 1984, that has a
green cover.
I have set up an example of what I am looking for using all my
qualifications. Search engines use the technique to set up queries to find
similar pages or files.
The search is reinitiated using the example as the new source for the
query. This interactive searching gives the user more control over the
search process.
Users can find more documents like the one selected. The results returned
are then more focused because of the qualified terms.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
Query Expansion
A library patron who comes to the desk asks one question, but usually
there is some other additional information need.
Newer search engines provide the user with more control over the query,
by adding a means to resubmit the search with any changes.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
Natural Language Processing
Natural Language Processing is the act and science of getting computers to
understand natural language.
It is a part of artificial intelligence. (Case.) Computers process language
not only by exact match, using keywords.
NLP involves using a set of concept to sort out the interrelationships of
words. The computer breaks apart the sentence into its semantic parts:
nouns, verbs, adjectives, etc., and then it creates links.
Since language can be ambiguous, vague, or metaphorical. NLP seeks to
compute the relationships between words, giving each a correlate to the
words around it.’
Put into a formula, the computer then makes assumptions based on its
logic. Although similar to a keyword search, the search engine allows a
user to make the query as if asking a librarian.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
New Trends in IR
• Concept-based searching
Using the idea of a thesaurus, a search engine can expand upon the
keyword that a user may input.
In this manner, users do not have to know the exact words to use to
retrieve relevant documents.
And, instead of reinstituting the search based on "confidence" or
"weighting," the search engine automatically includes the like terms.
• Search Engines
A survey of the Search Engines available from Netscape's Net Search will
help in explaining some of the techniques discussed.
By conducting a search for current trends in information retrieval,
differences can be seen in the structure and techniques of each engine.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
This is the key difference between the Database searching and Information
Retrieval.
After the query is sent to the core of the system. This part has the access to
the content management module which is directly linked with the back-end
i.e. the large collections of data objects.
Once results R are generated by the core system then it is returned to the user
by some graphical user interfaces.
The process repeats and results are modified until the user satisfied for what he
is actually looking for.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
1. Document Parsing
The Documents comes from different source combinations such as multiple
languages, formatting's, character sets; normally, if any document consisting of
more than languages. e.g. Consider a Spanish mail which has some part in french
language.
Thus Document parsing deals with the overall document structure. In this phase, it
breaks down the document into discrete components. In Preprocessing phase it
creates unit documents for example one document representing emails and
another as additional specific part.
2. Lexical Analysis
In Lexical analysis, tokenization is the process of breaking a stream into words,
phrases, symbols, or other meaningful terms called tokens. These meaningful
elements ae further sent to Parts of Speech Tagging.
Typically, Tokenization occurs at a word level.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Process of Information Retrieval
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Information Retrieval in the Library
Libraries were among the first institutions to adopt IR systems for retrieving
information.
Usually, systems to be used in libraries were initially developed by academic
institutions and later by commercial vendors.
In the first generation, such systems consisted basically of an automation of
previous technologies (such as card catalogs) and basically allowed searches
based on author name and title.
In the second generation, increased search functionality was added which
allowed searching by subject headings, by keywords, and some more complex
query facilities.
In the third generation, which is currently being deployed, the focus is on
improved graphical interfaces, electronic forms, hypertext features, and open
system architectures.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Web and Digital Libraries
If we consider the search engines on the Web today, we conclude that they continue to use
indexes which are very similar to those used by librarians a century ago. What has changed
then? Three dramatic and fundamental changes have occurred due to the advances in
modern computer technology and the boom of the Web.
First, it became a lot cheaper to have access to various sources of information. This
allows reaching a wider audience than ever possible before.
Second, the advances in all kinds of digital communication provided greater access
to networks. This implies that the information source is available even if distantly
located and that the access can be done quickly (frequently, in a few seconds).
Third, the freedom to post whatever information someone judges useful has greatly
contributed to the popularity of the Web. For the first time in history, many people
have free access to a large publishing medium.
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Web and Digital Libraries
Fundamentally, low cost, greater access, and publishing freedom have allowed
people to use the Web (and modern digital libraries) as a highly interactive medium.
software, videos, and to `chat' in a convenient and low cost fashion. Further, people
can do it at the time of their preference (for instance, you can buy a book late at
night) which further improves the convenience of the service. Thus, high interactivity
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The Web and Digital Libraries
In the future, three main questions need to be addressed. First, despite the
high interactivity, people still find it difficult (if not impossible) to retrieve
the Web and of large digital libraries, which techniques will allow retrieval of higher
quality? Second, with the ever increasing demand for access, quick response
is becoming more and more a pressing factor. Thus, which techniques will yield
faster indexes and smaller query response times? Third, the quality of the
retrieval task is greatly affected by the user interaction with the system. Thus,
how will a better understanding of the user behavior affect the design and
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
End of Unit-1
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel