0% found this document useful (0 votes)

97 views39 pages

Chapter 1 Introduction To ISR

This document provides an introduction to information storage and retrieval, including: - The course objectives are to familiarize students with information retrieval theories, modern concepts, indexing strategies, current research issues and trends. - The content will cover topics like text operations, term weighting, indexing structures, retrieval models, evaluation, and query languages over 7 chapters. - Instructional methods will include lectures, assignments, discussions and evaluations consisting of assessments, a final project, and quizzes.

Uploaded by

Aaron Melendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views39 pages

Chapter 1 Introduction To ISR

Uploaded by

Aaron Melendez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 39

Introduction to

Information Storage and Retrieval

e-mail: [email protected]
Office: Eshetu Chole Building, Office No. :
114
Course Objectives:
To familiarize students with the basic theories and
principles of information storage and retrieval
To introduce modern concepts of information retrieval
systems.
To acquaint students with the various indexing, matching,
organizing and evaluating strategies developed for
information retrieval (IR) systems
To enable students understand current research issues and
trends in IR
Course Content
1) Introduction to  Overview of an IR; IR vs. Database; IR vs. 1-2
ISR Data retrieval; Challenges in IR
 The Retrieval Process
 Designing an IR System

2) Text/Document  Distribution of words in texts: Luhn’s 3-5

Operations idea, Zipf’s law
 Vocabulary size: Heap’s law
 Index term selection: Lexical analysis,
Stop word Elimination, stemming
Course Content
3) Term  Term weighting techniques: Term 5-6
weighting and frequency (TF), Inverse document
similarity frequency (IDF), TF*IDF
measures  Algorithms for Similarity Measures:
Euclidean distance, inner distance, cosine
similarity
4) Indexing  What is indexing? Why Indexing? 7-9
Structures Effectiveness vs. Efficiency
 Indexing process
 Indexing structures: Sequential file,
Inverted file, Suffix tree
Course Content
5) IR Models  A Formal Characterization of IR Models 9-11
 Boolean model
 Vector space model

6) Retrieval  Why IR systems Evaluation? 11-13

Evaluation  Challenges in IR evaluation
 Measuring Performance (Recall, Precision,
Single-valued measures)

7) Query  Keyword-based queries (Boolean queries, 13-14

Languages weighted queries, etc.)
and  Relevance feedback: users relevance feedback
Operations vs. pseudo relevance feedback
Instructional Methods:
Lecture, Assignments, Discussions.

Evaluations:
Assessment 1 15%
Assessment 2 15%

Final 40%
Project 20%
Quizzes 10%
Introduction to
Information Storage and Retrieval

Chapter One: Introduction to

ISR
Information Retrieval Systems?
Document (Web page)
retrieval in response to a query
Quite effective (at some things)
Commercially successful (some
of them)
But what goes on behind the
scenes?
How do they work?
What happens beyond the
Web? Web search systems
•Lycos, Excite, Yahoo,
Google, Live, Northern
Light, Teoma, HotBot,
Baidu, …
Examples of IR systems
Conventional (library catalog): by keyword, title, author, etc.
 E.g.: You are probably familiar with www.library.unt.edu
 AAU library catalog

Text-based (Google, Yahoo, Lexis-Nexis,, FAST): Search by

keywords. Limited search using queries in natural language.
Multimedia (QBIC, WebSeek, SaFe): Search by visual appearance
(shapes, colors,… ).
Question answering systems (AskJeeves, Answerbus): Search in
(restricted) natural language
Other:
 Cross language information retrieval (uses multiple languages),
 Music retrieval
Information Retrieval
Information retrieval (IR) is the process of finding material
(usually documents) of an unstructured nature (usually text)
that satisfies an information need of the user from within
large collections (usually stored on computers).
Information is organized into (a large number of)
documents
Large collections of documents from various sources:
news articles,
research papers,
books,
digital libraries,
Web pages, etc.
Example: Web Search Engines like Google claim to index over 1.5
billion pages
General Goal of Information Retrieval
To help users find useful/relevant information based
on their information needs (with a minimum effort)
despite
The challenge is:
Increasing complexity of Information
Changing needs of user

Provide immediate random access to the document

collection.
Retrieval systems, such as Google, Yahoo, are
developed with this aim.
Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data
 Data retrieval
 Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects implies
failure
 Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be inaccurate
 small errors are tolerated
Information Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database

Data Retrieval Info Retrieval

Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
Why is IR so hard?
Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
The real problem boils down to matching the language of the
query to the language of the document.
One word
Simply matching on words is a very weak approach.
can have different semantic meanings. Consider: Take
“take a place at the table”
“take money to the bank”
“take a picture”
More Problems with IR
You can’t even tell what part of speech a word has:
“I saw her duck”

A query that searches for “pictures of a duck” will find documents

that contains:
“I saw her duck away from the ball falling from the sky”

Proper Nouns often use regular old nouns

Consider a document with “a man named Abraham owned a
Lincoln”
A word matching query for “Abraham Lincoln” may well find the
above document.
Basic Concepts in Information Retrieval:
(i) User Task and
(ii) Logical View of documents
The User Task:
two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
• Retrieval The User Task
• It is the process of retrieving information whereby
the main objective is clearly defined from the
onset of searching process.
• The user of a retrieval system has to translate his
information need into a query in the language
provided by the system.
• In this context (i.e. by specifying a set of words),
the user searches for useful information executing
a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets
• Browsing The User Task
• It is the process of retrieving information,
whereby the main objective is not clearly defined
from the beginning and whose purpose might
change during the interaction with the system.

• E.g. User might search for documents about ‘car racing’ .

Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which cover
‘Tourism in Ethiopia’.

• In this context, user is said to be browsing in the

collection and not searching, since a user may has
an interest of glancing around
Logical View of Documents
 Documents in a collection are frequently represented by a
set of index terms or keywords
 Such keywords are mostly extracted directly from the text of
the document
 These representative keywords provide a logical view of the
document

Docs Tokenization stop words stemming Indexing

Full Index terms

text

 Document representation viewed as a continuum, in which

logical view of documents might shift from full text to index
terms
Logical view of documents
If full text :
Each word in the text is a keyword
Most complex form….why?
Expensive….why?
If full text is too large, the set of representative keywords
can be reduced through transformation process called text
operation
 Itreduce the complexity of the document representation
and allow moving the logical view from that of a full text
to a set of index terms
Structure of an IR System
 An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents
that satisfy their information need.

User Black box Documents

 The black box is the information retrieval system.

To be effective in its attempt to satisfy information need of users, the
IR system must some how ‘interpret’ the contents of documents in a
collection and rank them according to their degree of relevance to
the user query.
 Thus the notion of relevance is at the centre of IR
 The primary goal of an IR system is to retrieve all the documents
which are relevant to a user query while retrieving as few non-
relevant documents as possible
Typical IR Task

 Given:
 A collection of textual natural-language documents
 A user query in the form of a textual string
 Process:
 A ranked set of documents that are assumed to be
relevant to the user query
 Measure of Effectiveness:
 Number of relevant docs from the retrieved collection
 Number of relevant docs retrieved from the whole
collection
Measure of Accuracy
Typical IR System Architecture

Document
corpus

Query IR
String System

Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
Web Search System (e.g.: Google)
Web crawler

Web Spider
Document
corpus

Query IR
String System

Ranked
Documents
1. Page1
2. Page2
3. Page3
.
What is Information Retrieval ?
A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation,
storage, organization of, and access to information
items. The organization and access of information items
should provide the user with easy access to the
information in which he is interested”
The definition incorporates all important features of a
good information retrieval system
Representation
Storage
Organization
Access
The focus is on the user information need
Overview of the Retrieval process
Overview of the Retrieval process (2)
The Retrieval Process
It is necessary to define the text collection before any
of the retrieval processes are initiated
This is usually done by the manager of the database and
includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and what elements
can be retrieved)

The text operations transform the original documents

and the information needs and generate a logical view
of each document
The Retrieval Process
Once the logical view of the documents is defined, the
database module builds an index of the text
An index is a critical data structure
It allows fast searching over large volumes of data
Reduces the vocabulary of the collection

Different index structures might be used , but the most

popular one is the inverted file (more on this later)
Given the document database is indexed, the retrieval
process can be initiated
The Retrieval Process …
The user first specifies a user need which is then parsed and transformed by
the same text operation applied to the text
Next the query operations is applied before the actual query, which provides a
system representation for the user need, is generated
The query is then processed (compared) to obtain the retrieved documents
Before the retrieved documents are presented to the user, the retrieved documents are
ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the search for useful
information. Two choices for the user:
(i) reformulate query, run on entire collection or (ii) reformulate query, run on result
set
At this point, he might pinpoint a subset of the documents seen as definitely of
interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by the user to change
the query formulation.
Hopefully, this modified query is a better representation of the real user need
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations

Query Inverted file

Searching Index

Retrieved docs Text

Database
Ranked docs
Ranking
Issues in IR
Text representation
what makes a “good” representation?
how is a representation generated from text?
what are retrievable objects and how are they organized?

Information needs representation

what is an appropriate query language?
how can interactive query formulation and refinement be supported?

Comparing representations
to identify relevant documents
What weighting scheme and similarity measure to be used?
what is a “good” model of retrieval?

Evaluating effectiveness of retrieval

what are good metrics?
what constitutes a good experimental test bed?
Focus in IR System Design
Our focus during IR system design is:
In improving performance effectiveness of the system
Effectiveness of the system is measured in terms of:
precision,
recall, …
Stemming, stop words, weighting schemes, matching
algorithms
In improving performance efficiency
The concern here is storage space usage, access time, searching
time, data transfer time …
There is space – time tradeoffs !!
Use Compression techniques, data/file structures, etc.
Subsystems of an IR system
The two subsystems of an IR system:
Indexing: is an offline process of organizing documents
using keywords extracted from the collection
Searching: is an online process of finding relevant
documents in the index list as per users query
Indexing and searching: are unavoidably connected
you cannot search that was not first indexed in some manner
indexing of documents or objects is done in order to be searchable
there are many ways to do indexing
to index one needs an indexing language
there are many indexing languages
even taking every word in a document is an indexing language
Knowing searching is knowing indexing
Indexing Subsystem

documents
Documents Assign document identifier

text Tokenize document

IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
End of Chapter One

IRS Notes
No ratings yet
IRS Notes
40 pages
Final Report For Launching A Software House
69% (13)
Final Report For Launching A Software House
32 pages
Solution.: Increase - 3
No ratings yet
Solution.: Increase - 3
5 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
Part I IR VTU M Tech SSE
No ratings yet
Part I IR VTU M Tech SSE
72 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
No ratings yet
Introduction To Information Storage and Retrieval: Chapter Four: Indexing Structure
34 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
IRS Assignment-I: 1. Define IRS & Goals. Ans
No ratings yet
IRS Assignment-I: 1. Define IRS & Goals. Ans
3 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
Unit Ii Modeling
No ratings yet
Unit Ii Modeling
15 pages
Ir MCQ-1
No ratings yet
Ir MCQ-1
22 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
Search Capabilities in Information Retrieval System
No ratings yet
Search Capabilities in Information Retrieval System
16 pages
Introduction To Information Retrieval-Ch2 Solutions
No ratings yet
Introduction To Information Retrieval-Ch2 Solutions
2 pages
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
No ratings yet
Cs8080informationretrievaltechniquesunit Ipptpdfversion 220423092105
240 pages
Irs Unit Ii Part 1
No ratings yet
Irs Unit Ii Part 1
16 pages
Irs Important Questions
0% (1)
Irs Important Questions
3 pages
Information Retrieval Systems (A70533)
No ratings yet
Information Retrieval Systems (A70533)
11 pages
Irs PPT Unit Ii
No ratings yet
Irs PPT Unit Ii
19 pages
CIT 841 TMA 4 Quiz Question
No ratings yet
CIT 841 TMA 4 Quiz Question
3 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
IR - Models
100% (3)
IR - Models
58 pages
Sheet 1
No ratings yet
Sheet 1
2 pages
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
No ratings yet
Indexing in DBMS - Ordered Indices - Primary Index - Dense Index - Sparse Index - Secondary Index - Multilevel Indices - Clustering Index in Database
7 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
100% (1)
CS470 Introduction To Database Management Systems: (Chapters 13 and 14 of The Textbook)
22 pages
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
No ratings yet
Unit - 3:: Explain Briefly About Automatic Indexing? Explain About Types of Classes Automatic Indexing?
28 pages
Unit - 3 Ir Questionbank
No ratings yet
Unit - 3 Ir Questionbank
27 pages
Unit 2 Data - Structures
No ratings yet
Unit 2 Data - Structures
84 pages
UNIT 2 IRS Up
No ratings yet
UNIT 2 IRS Up
42 pages
UNIT-6 Important Questions & Answers
No ratings yet
UNIT-6 Important Questions & Answers
20 pages
Sp09midterm Revised
No ratings yet
Sp09midterm Revised
6 pages
Ai QB
No ratings yet
Ai QB
28 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Irs Question Papers
No ratings yet
Irs Question Papers
6 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
(4th NLP'22) Final Exam
No ratings yet
(4th NLP'22) Final Exam
2 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Unit-1-Important Questions
No ratings yet
Unit-1-Important Questions
2 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
Boolean Retrieval
No ratings yet
Boolean Retrieval
34 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
No ratings yet
004artificial Intelligence 3rd Ed by Elaine Rich Kevin Knight Amp Shivashankar Nair
44 pages
Information Retrieval MCQ PDF
100% (2)
Information Retrieval MCQ PDF
4 pages
Algorithm For C Programs
No ratings yet
Algorithm For C Programs
10 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
DBDM Unit-3
No ratings yet
DBDM Unit-3
30 pages
600 Computer Mcqs
No ratings yet
600 Computer Mcqs
23 pages
Information Retrieval - Wikipedia
No ratings yet
Information Retrieval - Wikipedia
15 pages
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
The Wolf and The Crane 11
100% (4)
The Wolf and The Crane 11
7 pages
DLL English 9
No ratings yet
DLL English 9
2 pages
Checklist For Design Drawings (General)
No ratings yet
Checklist For Design Drawings (General)
2 pages
10 October 1995
No ratings yet
10 October 1995
124 pages
Unix and Shell Programming
No ratings yet
Unix and Shell Programming
19 pages
Gossen Mastersix Basic-1
No ratings yet
Gossen Mastersix Basic-1
55 pages
Giao An Tieng Anh 11 Hay Nhin La Muon Tai
No ratings yet
Giao An Tieng Anh 11 Hay Nhin La Muon Tai
315 pages
Seminar Report "Gas Turbine and Its Various Applications"
No ratings yet
Seminar Report "Gas Turbine and Its Various Applications"
8 pages
Delica n2 v1v2 Text Medium
No ratings yet
Delica n2 v1v2 Text Medium
18 pages
Arts and Crafts Movement - History, Influene and Important Figures (Contribution
No ratings yet
Arts and Crafts Movement - History, Influene and Important Figures (Contribution
66 pages
Stages of A Criminal Trial
No ratings yet
Stages of A Criminal Trial
12 pages
DT Vs NDT
No ratings yet
DT Vs NDT
2 pages
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
No ratings yet
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
11 pages
Grade 10 Data Handling QP 2024
No ratings yet
Grade 10 Data Handling QP 2024
6 pages
Report Text TV
100% (1)
Report Text TV
1 page
Book Unit 2
No ratings yet
Book Unit 2
4 pages
Poseidon Principles
No ratings yet
Poseidon Principles
73 pages
Humanities Chapter 3
No ratings yet
Humanities Chapter 3
26 pages
Math210 03notes
No ratings yet
Math210 03notes
4 pages
Comparative Study of Classifications of History
No ratings yet
Comparative Study of Classifications of History
2 pages
Viio Turbo Viper Diagram
No ratings yet
Viio Turbo Viper Diagram
2 pages
The Namesake Practice 3
No ratings yet
The Namesake Practice 3
2 pages
Strengthening Community Pharmacies Role in Early Tuberculosis Case Detection and Referrals in Jinja Municipality, Uganda - End of Project Report
No ratings yet
Strengthening Community Pharmacies Role in Early Tuberculosis Case Detection and Referrals in Jinja Municipality, Uganda - End of Project Report
37 pages
Wa0186.
No ratings yet
Wa0186.
14 pages
1 Definition of Terms
No ratings yet
1 Definition of Terms
15 pages
D T Solar Ovens Lesson Plan
No ratings yet
D T Solar Ovens Lesson Plan
6 pages
A Tangled Tale Carrol L.
No ratings yet
A Tangled Tale Carrol L.
57 pages
Cu Mid Year Listening Test
No ratings yet
Cu Mid Year Listening Test
2 pages
Log in and Out DTR
No ratings yet
Log in and Out DTR
5 pages

Chapter 1 Introduction To ISR

Uploaded by

Chapter 1 Introduction To ISR

Uploaded by

Introduction to

Information Storage and Retrieval

2) Text/Document  Distribution of words in texts: Luhn’s 3-5

6) Retrieval  Why IR systems Evaluation? 11-13

7) Query  Keyword-based queries (Boolean queries, 13-14

Chapter One: Introduction to

Text-based (Google, Yahoo, Lexis-Nexis,, FAST): Search by

Provide immediate random access to the document

Data Retrieval Info Retrieval

A query that searches for “pictures of a duck” will find documents

Proper Nouns often use regular old nouns

• E.g. User might search for documents about ‘car racing’ .

• In this context, user is said to be browsing in the

Docs Tokenization stop words stemming Indexing

Full Index terms

 Document representation viewed as a continuum, in which

User Black box Documents

 The black box is the information retrieval system.

The text operations transform the original documents

Different index structures might be used , but the most

Query Inverted file

Retrieved docs Text

Information needs representation

Evaluating effectiveness of retrieval

text Tokenize document

You might also like