0% found this document useful (0 votes)
52 views52 pages

ch1 - Information Retrieval Systems

The document provides an overview of Information Retrieval (IR) systems, defining key concepts such as information, retrieval, and the structure of IR systems. It discusses the challenges of matching user queries with relevant documents, the importance of indexing, and the processes involved in retrieving information from large collections. Additionally, it highlights the distinction between information retrieval and data retrieval, emphasizing the need for effective representation and organization of information to meet user needs.

Uploaded by

misrak dagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views52 pages

ch1 - Information Retrieval Systems

The document provides an overview of Information Retrieval (IR) systems, defining key concepts such as information, retrieval, and the structure of IR systems. It discusses the challenges of matching user queries with relevant documents, the importance of indexing, and the processes involved in retrieving information from large collections. Additionally, it highlights the distinction between information retrieval and data retrieval, emphasizing the need for effective representation and organization of information to meet user needs.

Uploaded by

misrak dagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

1

Information Retrieval Systems


 Information
 What is “information”?
 Retrieval
 What do we mean by “retrieval”?
 What are different types information needs?
 Systems
 How do computer systems fit into the human
information seeking process?

2
What is Information?
 What do you think?
 There is no “correct” definition
 Cookie Monster’s definition:
 “news or facts about something”
 Different approaches:
 Philosophy
 Psychology
 Linguistics
 Electrical engineering
 Physics
 Computer science
 Information science 3
Dictionary says…
 Oxford English Dictionary
 information: informing, telling; thing told, knowledge,
items of knowledge, news
 knowledge: knowing familiarity gained by experience;
person’s range of information; a theoretical or practical
understanding of; the sum of what is known
 Random House Dictionary
 information: knowledge communicated or received
concerning a particular fact or circumstance; news

4
Intuitive Notions
 Information must
 Be something, although the exact nature (substance,
energy, or abstract concept) is not clear;
 Be “new”: repetition of previously received messages is
not informative
 Be “true”: false or counterfactual information is “mis-
information”
 Be “about” something

Robert M. Losee. (1997) A Discipline Independent Definition of Information.


Journal of the American Society for Information Science, 48(3), 254-269.
5
Information Hierarchy
More refined and abstract

Wisdom

Knowledge

Information

Data

6
Information Hierarchy
 Data
 The raw material of information
 Information
 Data organized and presented in a particular manner
 Knowledge
 “Justified true belief”
 Information that can be acted upon
 Wisdom
 Distilled and integrated knowledge
 Demonstrative of high-level “understanding”
7
“Retrieval?”
 “Fetch something” that’s been stored
 Recover a stored state of knowledge
 Search through stored messages to find some
messages relevant to the task at hand

8
What types of information?
 Text (Documents and portions thereof)
 XML and structured documents
 Images
 Audio (sound effects, songs, etc.)
 Video
 Source code
 Applications/Web services

9
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
 Quite effective (at some
things)
 Commercially successful
(some of them)
Web search systems
But what goes on behind
• Lycos, Excite, Yahoo,
the scenes?
 How do they work? Google, Live, Northern
 What happens beyond the Light, HotBot, Baidu, …
Web?
10
Examples of IR systems
 Conventional (library catalog): Search by keyword, title,
author, etc.
 Text-based (Lexis-Nexis, Google, FAST): Search by
keywords. Limited search using queries in natural language.
 Multimedia (IBMs QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
 Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
 Other:
 Cross language information retrieval,
 Music retrieval
11
WebSEEk Search Engine

12
Information Retrieval
 Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured
nature (usually text) that satisfies an information
need from within large collections (usually stored on
computers).
 Information is organized into (a large number of)
documents
 Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
 Example: Web Search Engines like Google claim to index
Trillions of pages

13
General Goal of Information Retrieval
 To help users find useful information based on
their information needs (with a minimum effort)
despite
 Increasing complexity of Information
 Changing needs of user
 Provide immediate random access to the document
collection.
 Retrieval systems, such as Google, Yahoo, are
developed with this aim.

14
Info Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than on
the retrieval of data
 Data retrieval
 Consists mainly of determining which documents contain a set
of keywords in the user query
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects
implies failure
 Mainly designed for structured databases
 Information retrieval
 Is concerned with retrieving information about a subject or
topic than retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be
inaccurate
 small errors are tolerated
15
Info Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database

Data Retrieval Info Retrieval


Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,) than text and images etc)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
16
Why is IR so hard?
 Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
 The real problem boils down to matching the language of
the query to the language of the document.
 Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
 “take a place at the table”
 “take money to the bank”
 “take a picture”
17
More Problems with IR
 You can’t even tell what part of speech a word has:
 “I saw her duck”

 A query that searches for “pictures of a duck” will find


documents that contains:
 “I saw her duck away from the ball falling from the sky”

 Proper Nouns often use regular nouns


 Consider a document with “a man named Abraham owned a
Lincoln”
 A word matching query for “Abraham Lincoln” may well find
the above document.

18
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
The User Task:
two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
19
The User Task: Retrieval
• It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
• The user of a retrieval system has to translate his
information need into a query in the language provided
by the system.
• In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of
Secrets

20
Browsing
• It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning
and whose purpose might change during the interaction
with the system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document
providing ‘direction to Addis’, and from this to documents
which cover ‘Tourism in Ethiopia’.
• In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest
glancing around
21
Logical View of Documents
Documents in a collection are frequently represented by a
set of index terms or keywords
Such keywords are mostly extracted directly from the text
of the document
These representative keywords provide a logical view of
the document

Docs Tokenization stop words stemming Indexing

Full Index terms


text

Document representation viewed as a continuum, in which


logical view of documents might shift from full text to index
terms
22
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called
text operation
 It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms

23
Structure of an IR System
 An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set
of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.

User Documents
Black box

The black box is the information retrieval system.

24
Structure of an IR System
 To be effective in its attempt to satisfy information
need of users, the IR system must ‘interpret’ the
contents of documents in a collection and rank
them according to their degree of relevance to the
user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is to retrieve all
the documents which are relevant to a user query
while retrieving as few non-relevant documents as
possible

25
Structure of an IR System
Typical IR Task
 Given: Document
corpus
 A corpus of textual
natural-language
documents.
Query IR
 A user query in the String System
form of a textual
string.
1. Doc1
 Find: 2. Doc2
Ranked 3. Doc3
 A ranked set of Documents .
documents that are .

relevant to the
query.
26
Web Search System
Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

27
What is Information Retrieval ?
 A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation, storage, organization
of, and access to information items. The organization and access of
information items should provide the user with easy access to the
information in which he is interested”
 The definition incorporates all important features
of a good information retrieval system
 Representation
 Storage
 Organization
 Access
 The focus is mainly on the user information need

28
Overview of the Retrieval process

29
The Retrieval Process
 It is necessary to define the text database before
any of the retrieval processes are initiated
 This is usually done by the manager of the database
and includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)
 The text operations transform the original
documents and the information needs and
generate a logical view of them
30
Retrieval Process ….
 Once the logical view of the documents is
defined, the database module builds an index of
the text
 An index is a critical data structure
 It allows fast searching over large volumes of
data
 Different index structures might be used , but the
most popular one is the inverted file
 Given that the document database is indexed, the
retrieval process can be initiated

31
The Retrieval Process …
The user first specifies a user need which is then
parsed and transformed by the same text operation
applied to the text
Next the query operations is applied before the actual
query, which provides a system representation for the
user need, is generated
The query is then processed to obtain the retrieved
documents
Before the retrieved documents are sent to the user,
the retrieved documents are ranked according to
the likelihood of relevance
32
The Retrieval Process …
The user then examines the set of ranked documents
in the search for useful information. Two choices for
the user:
 (i) reformulate query, run on entire collection or (ii)
reformulate query, run on result set
At this point, s/he might pinpoint a subset of the
documents seen as definitely of interest and initiate a
user feedback cycle
 In such a cycle, the system uses the documents
selected by the user to change the query formulation.
 Hopefully, this modified query is a better
representation of the real user need
33
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs
34
Issues that arise in IR
 Text representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they organized?
 information needs representation
 what is an appropriate query language?
 how can interactive query formulation and refinement be
supported?
 Comparing representations (to identify relevant
documents)
 What weighting scheme and similarity measure to be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 what are good metrics/measurements?

35
Focus in IR System Design
Our focus during IR system design is:
 In improving performance effectiveness of the
system
 Effectiveness of the system is measured in terms of
precision, recall, …
 Stemming, stop words, weighting schemes, matching
algorithms
 In improving performance efficiency
 The concern here is storage space usage, access time,
searching time, data transfer time …
 Concern regarding space – time tradeoffs !!
 Use Compression techniques, data/file structures, etc.

36
Subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant
documents in the index list as per users query
 Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
 Indexing and searching: are unavoidably connected
 you cannot search what was not first indexed
 indexing of documents or objects is done in order to be
searchable
 to index one needs an indexing language
 there are many indexing languages
 even taking every word in a document is an indexing language

37
Indexing Subsystem
documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index

38
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index

39
1. What are the two sub-system in IR describe them
2. Define the logical view of the document the steps in
the logical view of the documents
3. Define the four term that define an information
retrieval
4. Define the steps in the overview of retrieval process

40
Interesting Examples
http://images.google.com/
 Google image search
http://video.google.com/
 Google video search
http://http.cs.berkeley.edu/~daf/people.html
 Finding naked people (seriously!)
http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html
 Query by humming

41
Tackling the IR Challenge
 Divide and conquer!
 Strategy: limit complexity
 Approach:
 Define interfaces (input and output) for each
component
 Define the functions performed by each component
 Study each component in isolation
 Repeat the process within components as needed
 Make sure that this decomposition makes sense
 Result: a hierarchical decomposition
42
A Tour of This Course
 Major themes:
 Learn about the IR black box
 Put the user back in the loop
 Extensions beyond standard document retrieval
 Along the way:
 Assignments
 Test, lab exam and final
 Project

43
Where do we make the cut?
 Study the IR black box in isolation
 Simple behavior: in goes query, out comes documents
 Optimize the quality of documents that come out
Query

Search Ranked List

 Study everything else around the black box


 Put the human back in the loop!

44
The IR Black Box
Query Documents

Hits

45
Inside The IR Black Box
Query Documents

Representation Representation
Function Function

Query Representation Document Representation

Comparison
Function Index

Hits

46
The Central Problem in IR
Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?


47
Building the IR Black Box
 Different models of information retrieval
 Boolean model
 Vector space model
 Languages models
 Representing the meaning of documents
 How do we capture the meaning of documents?
 Is meaning just the sum of all terms?
 Indexing
 How do we actually store all those words?
 How do we access indexed terms quickly?
48
Beyond the IR Black Box
 Studying the IR black box in isolation: Is this realistic?
 What are the assumptions of this methodology?

49
The User in the Loop
 Relevance Feedback
 How do humans (and machines) modify queries based
on retrieved results?
 User Interaction
 Information retrieval meets computer-human
interaction
 How do we present search results to users in an effective
manner?
 What tools can systems provide to aid the user in
information seeking?

50
Extensions
 Filtering and Categorization
 Traditional information retrieval: static collection,
dynamic queries
 What about static queries against dynamic collections?
 Multimedia Retrieval
 Thus far, we’ve been focused on text…
 What about images, sounds, video, etc.?
 Question Answering
 We want answers, not just documents!

51
52

You might also like