1 IRIntro
1 IRIntro
Retrieval (ISR)
Introduction
1
What is Information Retrieval ?
The process of actively seeking out
information relevant to a topic of interest
(van Rijsbergen)
Typically it refers to the automatic (rather than
manual) retrieval of documents
Information Retrieval System (IRS)
“Document” is the generic term for an information
holder (book, chapter, article, webpage, etc)
2
Information Retrieval
Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
These days we frequently think first of web search,
but there are many other cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval
3
Motivation (why)
4
The stages of IR- The big picture
Information
Creation Indexing,
organizing
Indexed Retrieval
and structured • Searching
information • Browsing
5
What IR assumes?
Collection: A set of documents
Assume it is a static collection for the moment
Goal: Retrieve documents with information that
is relevant to the user’s information need and
helps the user complete a task
Information is stored (or available)
A user has an information need
An automated system exists from which
information can be retrieved
The system works!!
6
Sub Topics
7
Overview of IR and IR
Systems
8
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
Quite effective (at some
things)
Commercially successful
(some of them)
Butwhat goes on behind
the scenes?
Web search systems
How do they work? • Lycos, Excite, Yahoo, Google,
What happens beyond the Live, Northern Light, Teoma,
Web? HotBot, Baidu, …
9
Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
10
Information Retrieval - Definition
11
Cont…
12
Cont…
A good formal definition of information retrieval
is given in Baeze-Yates & Riberio-Neto (2011p1)
Storage
Organization
Access
Evaluation
14
Cont…
Information items: usually text, but possibly
also image, audio, video, etc.
Text items are often referred to as
16
Information Retrieval
Can be structured for ease of discussion as
Text IR
17
Cont…
18
Entities in IRS
19
Thus the focus is on
20
Key Issues IR
Organizing
How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
Retrieving
How to find the appropriate information resources
or information-bearing objects for someone’s (or
your own) needs. Build a system that retrieves
documents that users are likely to find relevant to
their queries
This set of assumption underlies the field of IR
21
IR is an Iterative Process
Creation
Active
Authoring
Modifying
Using Organizing
Creating Indexing
Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Networking
Utilization Disposition Searching
Inactive
22
IR-Representation/organizing
23
Cont…
24
Cont…
Basic remarks on user information need(in the
context of the World Wide Web):
Such full descripion of the user information need is not
necessarily a good query to be submitted to the IR system
The user must first translate his information need into a query
which can be prossessed by the search engine( or IR
system).
In its most common form, the translation yields a set of
keywords (or index terms) which summerizes the user
information need
Given the user query, the key goal of the IR system is
to retrieve information that is useful or relevant to the
user.
Emphasis is on the retrieval of information (not data)
25
A sketch of a searcher… moving through many actions towards a
general goal of satisfactory completion of research related to an
information need
IR is an Iterative Process
Repositories
Q2 Q4
Q3
Q1 Q5
Goals Q0
26
IR- Storing/Retrieving
Information storage
How and where is information stored?
Retrieving information
How is information recovered from storage
How to find needed information
Linked with accessing/filtering stage
27
IR - Accessing/Filtering
28
Implementation
29
More on IR
IR is concerned with retrieval of relevant
documents from a large collection of
documents
Relevant documents are identified
according to specific criteria (usually
called query)
IR usually deals with Natural Language
text which is not always well structured
and could be semantically ambiguous
30
Cont…
31
Data Vs Information Retrieval
32
Data Retrieval System
Ex: relational databases
Deals with data that has a well defined structure
and semantics
While IR system deals with natural language text which is
not well structured
A single erroneous object among a thousand
retrieved objects means total failure
While in IR small errors are likely to go unnoticed
Data retrieval does not solve the problem of
retrieving information about a subject or topic
33
Information Vs Data Retrieval
35
High level structure of an IRS
An Information Retrieval System serves as a
bridge between the world of authors and the
world of readers/users,
That is, writers present a set of ideas in a document
using a set of concepts. Then Users seek the IR
system for relevant documents that satisfy their
information need.
Black box
User Documents
Structure of IR System
Information
Relevance Collection
need
37
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System
Ranking
Potentially
Relevant
Documents
38
Architecture of an IR System
Indexing, retrieval and ranking
The Information Retrieval
Process
41
The IR Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w
Ranking
Ranked docs
IR Processes
The user interface – think of it as the user interface
available with current IR systems including
Web search engines
It is necessary to define the text database before any
of the retrieval processes are initiated
This is usually done by the manager of the database
and includes specifying the following
the documents to be used
The operations to be performed on the text
43
Cont…
Once the logical view of the documents is defined,
an index of the text is built
An index is an optimized data structure that is
built on top of the information objects
allowing faster access for the search process over
large volume of data
The indexer:
tokenizes the text (tokenization), removes words with little
semantic value (stop-words), unifies word families
(stemming)
Different index structures might be used, but the
most popular one is the inverted file
44
Cont…
Given the document database is indexed,
the retrieval process can be initiated
Information retrieval is the process of matching the query
against the indexed information objects
The user first specifies an information need
which is then parsed and transformed
Using the same text operation applied to the text
Then the query operations might be applied
before the actual query, which provides a
system representation for the user information
need, is generated
45
IR Processes
The IR system responds by matching information
objects, which are relevant to a query
Information retrieval focuses on finding relevant
information rather than simple pattern matching
Relevance
is a subjective notion
depends on the task being solved and its context
can change with time (eg. new info became available)
can change with location (eg. the most important answer is
the closest one)
can change with the device (eg. The best answer is a short
doc that is easier to download and visualize)
IR Processes
A retrieval strategy (model) is an algorithm
and related structures that takes a query and
a set of documents and assigns a similarity
measure between the query and each
document
similarity represents relevance to the user query
Documents are ranked on the basis of their
similarity to the query
Cont…
Before the retrieved documents are sent to the user,
the retrieved documents are ranked according to the
likelihood of relevance
The user then examines the set of ranked
documents in the search for useful information
This process can be repeated and the query
can be modified
The user may need to reformulate query
48
Cont…
49
IR Processes
Information
Retrieval
Text Analysis
(tokenization,
normalization, stop
word removal,
stemming)
Types of IR Systems
51
Types of IR Systems
Based on the nature of the information items:
Text based IR
Multimedia
Audio retrieval system
IR system
Video retrieval system
Image retrieval
Multimodal IR system – can handle all types
52
Types of IR Systems
Domain/Fields of the documents:
General purpose IR
Specialized IR(Medical, Legal, Agricultural, etc.)
Based on languages supported:
Mono-lingual IR
Multilingual IR
Cross lingual IR
53
Types of IR Systems
Based on the nature of the output
Document retrieval system
Question-Answering system
Automatic summarization systems
Intelligent IR
Recommender system
Information Extraction
Etc.
54
Factors Affecting Effective
Retrieval
55
Factors Affecting Effective Retrieval
56
The User Task
Searching
Database
Browsing/ surfing
57
Cont…
58
Users
The user: anyone who need to find some
information
The user groups
group by their knowledge of the system
novice users vs. experienced users
end users vs. information specialists
group by their domain knowledge
Domain experts vs. general public
group by information needs
need to locate a particular item
need some information
need all information on a subject
59
User’s Information Needs
At all levels of our life we need information (e.g.
crossing the road, health, nutrition, travel,…)
Information need is the desire to know, the desire
to fill a gap of knowledge
Example- problem: one wants to cross a road in a
high traffic area: What is the information he needs?
He needs information
About the direction people drive (left or right)
About the meanings of the traffic light (green, yellow, and
red)
Sign posts, etc
60
Cont…
61
Logical view of documents
62
Document Processing Steps
63
Cont..
Documents in a collection are frequently
represented through a set of index terms or
keywords
An index term is a key word (or group of related
words) which has some meaning of its own (which
usually has the semantics of a noun)
In its more general form, an index term is simply
any word which appears in the text of a document
collection
64
Cont…
65
Cont…
66
Cont...
67
Cont...
Standard steps
Recognizing document structures (titles,
sections, paragraphs, etc.)
Break into tokens
Usually space and punctuation delimited
Special issues with some languages (Chinese,
compound words,)
The elimination of stopwords (such as
articles and connectives)
68
Cont…
Conflation: The use of stemming/
morphological analysis
Purpose
Overcome the variants of word forms by reducing all
words with the same root, i.e., (which reduces distinct
words to their common grammatical root)
Most IR systems perform stemming on both text
and query
The identification of noun groups (which
eliminates adjectives, adverbs, and verbs)
Other further operation can also be performed
Store in inverted index (to be discussed in later
chapters)
69
Cont…
Such text operations reduce the complexity
of the document representation and allow
moving the logical view from that of a full
text to that of index terms
Index - A list of important key words from
the documents
70
Cont...
The full text is the most complete logical
view of a document, But its usage usually
implies higher computational costs
A small set of categories/ index terms
(generated automatically or by a human
specialist) provides the most concise
logical view of a document, but its usage
might negatively affect the retrieval quality
Several intermediate logical views (of a
document) might be adopted by an
information retrieval system as shown in
the figure
71
Cont…
72
Cont...
The index terms obtained are a description of
a document content and of its structure
Models may allow reference to the text
document
The models might also allow references to
the structure normally present in written text
(in this case we say a structured model)
Retrieval based on index terms or keywords
might be of fairly low quality
73
Cont…
Two major reasons for this
The user query might be composed of too few terms
which usually implies the query context is poorly
characterized
This problem is dealt with through transformations in the
query such as query expansion and user relevance
feedback
The set of keywords generated for a given document
might fail to summarize its semantic content properly
This problem is dealt with through transformations in the
text such as
Identification of noun groups to be used as keywords
Stemming
The use of thesaurus
74
Cont...
75
Cont…
Examples of such properties
A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user might be
interested in
A word which appears in just five documents is
quite useful because it narrows down considerably
the space of documents which might be of interest
to the user
Thus, distinct index terms have varying relevance
when used to describe document contents
This effect is captured through the assignment of
numerical weights to each of the index term of a
document – Term weighting
76
Challenges in IR
77
Why is IR a Difficult Problem?
78
Cont…
Unstructured data: difficult to capture
semantics in documents. Compare:
“select * from Employee where Salary >
100,000”
“retrieve all news items about corporate
takeover”
Why is the second query more difficult to
answer? The following query is even more
difficult:
“retrieve all news items about corporate
takeover involving an internet company”
79
Cont…
Documents have unrestricted domains
it is hard to predefine or pre-categorize
the subject domains of documents
a particular subject is related to several major
topics including linguistics, psychology,
Cybernetics, Communications, Information
System design, Engineering & Technology,
Networking, Computer Science, Mathematics,
Economics, Management Science,
education …
80
Cont…
Diversified user base: expert to casual users
The users of information retrieval systems include
Research scientists (that seek articles related to
particular experiments)
Engineers (who try to determine whether a patent
covering some new idea has previously been
obtained)
Attorney( who search for legal presidents)
Buyers in general (who try to obtain new product
information)
81
Cont…
Information retrieval users
Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
May be led by many different reasons to use the retrieval
facilities
As a result, they require a variety of services and end
products
In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user
82
Cont…
84
Other Central Concepts in IR
85
Other Central Concepts in IR
Documents
Queries
Collections
Evaluations
relevance
86
Why is IR Important?
87
Historical overview
Organization and storage of knowledge for
ease of access is centuries old
That is, the history of recording knowledge
goes as far as thousands of years.
Important events
Development of writing, Books and printing
technology, News publishing, Journal publishing
(economic reasons- books are not economical in
terms of money and time), Libraries (to put
publications in one centre)
88
Cont…
89
Cont…
90
Cont…
Simple methods to facilitate access to
single document:
Table of contents,
Keyword index
Classical methods to facilitate access to
collections of documents
Index (keywords, authors)
Hierarchical (Dewey-Decimal classification)
91
Cont…
92
Creation of Disciplines
The mechanized era (Sparc Jones and Willett, 1997)
IR systems were mainly used by librarians
for carrying out bibliographic searches in place of
manual tools such as card catalogue and universal
classification systems
The advent of word processing technology (software
+ hardware)
a rapid, wide spread growth in the usage of IR
Increased interest in Web-based distributed information
processing and in the application of IR techniques to
non-textual information
− The growth of knowledge Creation of discipline
93
Cont…
Discipline oriented era
e.g. Science from philosophy
physics from science
electricity from physics
electronics from electricity
Similarly, information retrieval from the wider
discipline of information science
Then came the Problem Oriented Era
Disciplines are merged to form a new subject.
E.g. Molecular Biology from physics and Biology
(Fosket, 1988)
94
Cont…
Such growth of knowledge gave birth to the
creation of disciplines (domain knowledge) which
then brought about the need for classification and
indexing
Putting related knowledge together
E.g. Science, Arts, and Humanities
Creation of subclasses within classes
Designing ways and means of accessing
information (which is the area of IR)
95