0% found this document useful (0 votes)

46 views34 pages

01 Introduction To ISR

The document provides an introduction to Information Storage and Retrieval (ISR), detailing the various types of information retrieval systems, their goals, and the challenges faced in matching user queries to relevant documents. It distinguishes between information retrieval and data retrieval, emphasizing the importance of relevance and user needs in the retrieval process. The document also outlines the structure of an IR system, the retrieval process, and issues related to text representation and query formulation.

Uploaded by

yoseftirusew1194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views34 pages

01 Introduction To ISR

Uploaded by

yoseftirusew1194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduction to

Information Storage and Retrieval

Introduction to
Information Storage and Retrieval

Chapter One:
Introduction to ISR
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
Quite effective (at some
things)
Commercially successful
(some of them)
But what goes on behind
the scenes?
How do they work? Web search
What happens beyond
systems
• Lycos, Excite,
the Web?
Yahoo, Google, Live,
Northern Light,
Teoma, HotBot,
Examples of IR systems
Conventional (library catalog): by keyword, title, author,
etc.
 E.g.: You are probably familiar with www.library.unt.edu
 AAU library catalog

Text-based (Google, Yahoo, Lexis-Nexis,, FAST): Search

by keywords. Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by visual
appearance (shapes, colors,… ).
Question answering systems (AskJeeves, Answerbus):
Search in (restricted) natural language
Other:
 Cross language information retrieval (uses multiple languages),
 Music retrieval
Information Retrieval
Information retrieval (IR) is the process of
finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need of the user from within
large collections (usually stored on computers).
Information is organized into (a large number
of) documents
Large collections of documents from various sources:
news articles,
research papers,
books,
digital libraries,
Web pages, etc.
Example: Web Search Engines like Google claim to index
over 1.5 billion pages
General Goal of Information
Retrieval
To help users find useful/relevant
information based on their information
needs (with a minimum effort) despite
The challenge is:
Increasing complexity of Information
Changing needs of user

Provide immediate random access to the

document collection
Retrieval systems, such as Google, Yahoo,
are developed with this aim
Information Retrieval vs. Data
Retrieval
 Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data

Data retrieval
 Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies
failure

Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated
Information Retrieval vs. Data
Retrieval
 Example of data retrieval system is a relational database

Data Retrieval Info Retrieval

Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% >50%
Error response Sensitive Insensitive
Why is IR so hard?
Information retrieval problem: locating
relevant documents based on user input, such
as keywords or example documents
The real problem boils down to matching the
language of the query to the language of the
document
Simply matching on words is a very weak approach
 One word can have different semantic meanings. Consider:
Take
“take a place at the table”
“take money to the bank”
“take a picture”
More Problems with IR
You can’t even tell what part of speech a word has:
“I saw her duck”

A query that searches for “pictures of a duck” will find

documents that contains:
“I saw her duck away from the ball falling from the sky”

Proper Nouns often use regular old nouns

Consider a document with “a man named Abraham
owned a Lincoln”
A word matching query for “Abraham Lincoln” may well
find the above document
Basic Concepts in Information Retrieval:
(i) User Task and
(ii) Logical View of documents
The User Task:
user task – retrieval and browsing

Retrieval

DB
Browsing

USER
• Retrieval The User Task
• It is the process of retrieving information
whereby the main objective is clearly
defined from the onset of searching
process
• The user of a retrieval system has to
translate his information need into a
query in the language provided by the
system
• In this context (i.e. by specifying a set of
words), the user searches for useful
information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets
• Browsing The User Task
• It is the process of retrieving information,
whereby the main objective is not
clearly defined from the beginning and
whose purpose might change during the
interaction with the system

• E.g. User might search for documents about ‘car

racing’ . Meanwhile he might find interesting
documents about ‘car manufacturers’. While
reading about car manufacturers in Addis, he might
turn his attention to a document providing
‘direction to Addis’, and from this to documents
which cover ‘Tourism in Ethiopia’

• In this context, user is said to be browsing in

the collection and not searching, since a
user may has an interest of glancing around
Logical View of Documents

Documents in a collection are frequently represented by a
set of index terms or keywords

Such keywords are mostly extracted directly from the text of
the document

These representative keywords provide a logical view of the
document

Docs Tokenizatio stop words stemming Indexing

Full Index terms

text

Document representation viewed as a continuum, in which
logical view of documents might shift from full text to index
terms
Logical view of documents
If full text :
Each word in the text is a keyword
Most complex form….why?
Expensive….why?
If full text is too large, the set of
representative keywords can be reduced
through transformation process called text
operation
Itreduce the complexity of the document
representation and allow moving the logical
view from that of a full text to a set of index
terms
Structure of an IR System
 An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users,
 That is, writers present a set of ideas in a document using
a set of concepts. Then Users seek the IR system for
relevant documents that satisfy their information need.

User Black box Documents

 The black box is the information retrieval system.

To be effective in its attempt to satisfy information need of
users, the IR system must some how ‘interpret’ the
contents of documents in a collection and rank them
according to their degree of relevance to the user
query.
 Thus the notion of relevance is at the centre of IR
 The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while
retrieving as few non-relevant documents as
possible
Typical IR Task
 Given:
 A collection of textual natural-language
documents
 A user query in the form of a textual string
 Process:
 A ranked set of documents that are assumed
to be relevant to the user query
 Measure of Effectiveness:
 Number of relevant docs from the retrieved
collection
 Number of relevant docs retrieved from the
whole collection
Measure of Accuracy
Typical IR System Architecture

Document
corpus

Query IR
String System

Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
Web Search System (e.g.: Google)
Web crawler

Web Spider
Document
corpus

Query IR
String System

Ranked
Documents
1. Page1
2. Page2
3. Page3
.
What is Information Retrieval ?
Information retrieval a broad area of computer
science focusing on easy access of information, as
defined in Baeze-Yates & Riberio-Neto (2011, p1)
“Information retrieval deals with representation,
storage, organization of, and access to
information items, such as documents, Web pages,
online catalogs, structured and semi-structured
records, multimedia objects.The organization and
access of information items should provide the user
with easy access to the information in which he is
interested”
The definition incorporates all important features of
a good information retrieval system
 Representation
 Storage
 Organization
 Access
The focus is on the user information need
Overview of the Retrieval process
Overview of the Retrieval process (2)
The Retrieval Process
It is necessary to define the text collection
before any of the retrieval processes are
initiated
This is usually done by the manager of the
database and includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and
what elements can be retrieved)

The text operations transform the original

documents and the information needs and
generate a logical view of each document
The Retrieval Process
Once the logical view of the documents is defined,
the database module builds an index of the text
An index is a critical data structure
It allows fast searching over large volumes of data
Reduces the vocabulary of the collection

Different index structures might be used , but the

most popular one is the inverted file (more on
this later)
Given the document database is indexed, the
retrieval process can be initiated
The Retrieval Process …
The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text
Next the query operations is applied before the actual query,
which provides a system representation for the user need, is
generated
The query is then processed (compared) to obtain the retrieved
documents
Before the retrieved documents are presented to the user, the retrieved
documents are ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the search
for useful information. Two choices for the user:
(i) reformulate query, run on entire collection or (ii) reformulate query, run
on result set
At this point, he might pinpoint a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by the
user to change the query formulation.
Hopefully, this modified query is a better representation of the real
user need
Detail view of the Retrieval Process
User Text
Interface
User
need
Text

Text Operations
logical view Logical view
DB
User Query Language manager
feedback Indexing Module
& Operations

Query Inverted file

Searching Index

Retrieved docs Text

Database
Ranked docs
Ranking
Issues in IR
Text representation
what makes a “good” representation?
how is a representation generated from text?
what are retrievable objects and how are they organized?

Information needs representation

what is an appropriate query language?
how can interactive query formulation and refinement be
supported?
Comparing representations
to identify relevant documents
What weighting scheme and similarity measure to be
used?
what is a “good” model of retrieval?

Evaluating effectiveness of retrieval

what are good metrics?
what constitutes a good experimental test bed?
Focus in IR System Design
Our focus during IR system design is:
In improving performance effectiveness of the
system
Effectiveness of the system is measured in terms of:
precision,
recall, …
Stemming, stop words, weighting schemes, matching
algorithms
In improving performance efficiency
The concern here is storage space usage, access time,
searching time, data transfer time …
There is space – time tradeoffs !!
Use Compression techniques, data/file structures, etc.
Subsystems of an IR system
The two subsystems of an IR system:
Indexing: is an offline process of organizing
documents using keywords extracted from the collection
Searching: is an online process of finding relevant
documents in the index list as per users query
Indexing and searching: are unavoidably connected
you cannot search that was not first indexed in some manner
indexing of documents or objects is done in order to be
searchable
there are many ways to do indexing
to index one needs an indexing language
there are many indexing languages
even taking every word in a document is an indexing
language
Knowing searching is knowing indexing
Indexing Subsystem

documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens
Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
Searching Subsystem
query parse query
query tokens
ranked
Stop list non-stoplist
document
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
End of Chapter One

Price-Rexroth Hydraulics Division
78% (9)
Price-Rexroth Hydraulics Division
512 pages
LK Valves Product Catalogue Rev. 6
No ratings yet
LK Valves Product Catalogue Rev. 6
211 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
D&D 4th Edition - Adventurer's Vault 2
100% (6)
D&D 4th Edition - Adventurer's Vault 2
159 pages
The Art of Growing Irish Potatoes in Sacks
No ratings yet
The Art of Growing Irish Potatoes in Sacks
6 pages
Kurmanji Complete
100% (2)
Kurmanji Complete
217 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
LIBRARY MANAGEMENT SYSTEM - Final
100% (1)
LIBRARY MANAGEMENT SYSTEM - Final
25 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
6 Accounting For Merchandising Businesses
100% (1)
6 Accounting For Merchandising Businesses
107 pages
Annexure-4 CertificatefromUniversity
No ratings yet
Annexure-4 CertificatefromUniversity
1 page
IR Introduction
100% (1)
IR Introduction
6 pages
ENGLISH 9 Q1 Week 1 2
No ratings yet
ENGLISH 9 Q1 Week 1 2
10 pages
Elliott Wave Theroist
100% (1)
Elliott Wave Theroist
5 pages
ACDCModule Users Guide
No ratings yet
ACDCModule Users Guide
474 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
CNC Nots 802d or 840d DX150
No ratings yet
CNC Nots 802d or 840d DX150
75 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Hazid Record
No ratings yet
Hazid Record
21 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Unity TCP Open Block Library Users Manual
No ratings yet
Unity TCP Open Block Library Users Manual
124 pages
Unit 3 (Fund. of Se) - Part III
No ratings yet
Unit 3 (Fund. of Se) - Part III
94 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Unit - 1
No ratings yet
Unit - 1
51 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Target Appraisal: Case: Dr. Reddy Laboratories (A) & (B)
No ratings yet
Target Appraisal: Case: Dr. Reddy Laboratories (A) & (B)
45 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
IR Module
No ratings yet
IR Module
80 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
63 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
1 introIR
No ratings yet
1 introIR
22 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Briefing Dealer - Update Socialization Jan 2020
No ratings yet
Briefing Dealer - Update Socialization Jan 2020
17 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
No ratings yet
YLSTD30-40K01小功率直流充电桩用户手册User Manua V1 - (EN&CN) ) 已校对
17 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
HTM - Lab - Materials
No ratings yet
HTM - Lab - Materials
85 pages
Chap 1
No ratings yet
Chap 1
22 pages
Ch1 Intro To Information Retrieval-Lina Nemri
No ratings yet
Ch1 Intro To Information Retrieval-Lina Nemri
23 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
AI Chapter Three
No ratings yet
AI Chapter Three
61 pages
Mounting Procedure: Reference: C3131320010 A1
No ratings yet
Mounting Procedure: Reference: C3131320010 A1
16 pages
RISK MGMT Chap III 2020 Sem II
No ratings yet
RISK MGMT Chap III 2020 Sem II
18 pages
Objection Deadline: March 20, 2012 at 4:00 P.M. (ET) Hearing Date: April 5, 2012 at 10:00 A.M. (ET)
No ratings yet
Objection Deadline: March 20, 2012 at 4:00 P.M. (ET) Hearing Date: April 5, 2012 at 10:00 A.M. (ET)
32 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Number System Representation - Study Notes
No ratings yet
Number System Representation - Study Notes
12 pages
Threads
No ratings yet
Threads
52 pages
Current Research Topics in Optical Sensors and Laser Diagnostics
No ratings yet
Current Research Topics in Optical Sensors and Laser Diagnostics
17 pages
IR Notes
No ratings yet
IR Notes
14 pages
Introduction To TikTok Shop Affiliate Program
No ratings yet
Introduction To TikTok Shop Affiliate Program
10 pages
Presentation Database
No ratings yet
Presentation Database
22 pages
Module 1print
No ratings yet
Module 1print
5 pages
1 introIR
No ratings yet
1 introIR
15 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Anaesthetic Considerations Patients With Mitral Valve Prolapse
No ratings yet
Anaesthetic Considerations Patients With Mitral Valve Prolapse
7 pages
Worksheet II From Ch5 - 8
No ratings yet
Worksheet II From Ch5 - 8
5 pages
Shops & Estt
No ratings yet
Shops & Estt
4 pages
Anu Arora Report
No ratings yet
Anu Arora Report
8 pages
Features: Battery Earth Fault Relays
No ratings yet
Features: Battery Earth Fault Relays
5 pages
Design and Analysis of Algorithm Module
No ratings yet
Design and Analysis of Algorithm Module
11 pages
Happy Birthday
No ratings yet
Happy Birthday
2 pages
Information On The Format of The TOEFL
No ratings yet
Information On The Format of The TOEFL
2 pages
Os Exit Exam
No ratings yet
Os Exit Exam
9 pages
Centrifugation I.: Centrifuge Selection - Tubular Bowl Centrifuge
No ratings yet
Centrifugation I.: Centrifuge Selection - Tubular Bowl Centrifuge
3 pages
Assignment Linear Algebra
No ratings yet
Assignment Linear Algebra
1 page

01 Introduction To ISR

Uploaded by

01 Introduction To ISR

Uploaded by

Introduction to

Information Storage and Retrieval

Text-based (Google, Yahoo, Lexis-Nexis,, FAST): Search

Provide immediate random access to the

Data Retrieval Info Retrieval

A query that searches for “pictures of a duck” will find

Proper Nouns often use regular old nouns

• E.g. User might search for documents about ‘car

• In this context, user is said to be browsing in

Docs Tokenizatio stop words stemming Indexing

Full Index terms

User Black box Documents

 The black box is the information retrieval system.

The text operations transform the original

Different index structures might be used , but the

Query Inverted file

Retrieved docs Text

Information needs representation

Evaluating effectiveness of retrieval

You might also like