0% found this document useful (0 votes)

19 views39 pages

Ch1 IR

Uploaded by

Tafa Tulu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views39 pages

Ch1 IR

Uploaded by

Tafa Tulu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Introduction to Information Storage

and Retrieval
Information Technology Department

Altaseb A.
[email protected]
Mekdela Amba UNIVERSITY
Introduction of the course
•Course Contents
•Chapter 1-7
•Assignment
s
•Readings
and
Discussion
•Hands-On
use of IR
systems
•Term paper
•Grading
•Readings

10/28/20
24
Purposes of the Course
•To describe components and kinds of IRS
•To impart a basic theoretical understanding of IR
models
•Boolean
•Vector Space
•Probabilistic (including Language Models)
•To examine major application areas of IR
including:
•Web Search
•Text categorization and clustering
•Cross language retrieval
•Text summarization
•Digital Libraries
•To understand how IR performance is measured:
•Recall/Precision
•Statistical significance
•Gain hands-on experience with IR systems
10/28/20
24
Introduction
cont’d..
Information
Retrieval
•Information
material
retrieval (IR) is the process of finding
(usually documents) of an unstructured nature
(usually text) that satisfies an information need from
within large collections (usually stored on computers).
•Information is organized into (a large number of)
documents
• Large collections of documents from various sources: news articles, research papers,
books, digital libraries, Web pages, etc.
• Example: Web Search Engines like Google claim to index Trillions of pages

10/28/20
24
General Goal of Information Retrieval
• To help users find useful information based on their
information needs (with a minimum effort) despite
• Increasing complexity of Information
• Changing needs of user

• Provide immediate random access to the document

collection.
• Retrieval systems, such as Google, Yahoo,
are developed with this aim.

10/28/20
24
Information Retrieval vs. Data Retrieval
⚫ Emphasis of IR is on the retrieval of information, rather than on the
retrieval of data

Data retrieval
 Consists mainly of determining which documents contain a set of
keywords in the user query (which is not enough to satisfy the user
information need)
 Aims at retrieving all objects that satisfy well defined semantics
 a single erroneous object among a thousand retrieved objects
implies failure
 Mainly designed for structured databases

Information retrieval
 Is concerned with retrieving information about a subject or topic than
retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might be
inaccurate
 small errors are tolerated

10/28/20
24
Information Retrieval vs. Data Retrieval
•Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text and images etc)
Matching Exact (results are Partial match, best match
always “correct”)
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive

10/28/20
24
IR vs. Knowledge Retrieval

•Knowledge Retrieval (see current

Information Extraction technology) –
answers specific questions by analysing
an unstructured information source, e.g. user
could ask “What is capital of France?”
and the system would answer “Paris” by
‘reading’ a book about France

10/28/20
24
Why is IR so hard?
•Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
•Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
• The real problem boils down to matching the language of the query to the language
of the document.
• Simply matching on words is a very brittle (no elasticity) approach. One word
can have different semantic meanings. Consider: Take
• “take a place at the table”
• “take money to the bank”
• “take a picture”

10/28/20
24
More Problems
with IR
• You can’t even tell what part of speech a word has:
•“I saw her duck”
•A query that searches for “pictures of a duck” will find documents that
contains:
• “I saw her duck away from the ball falling from the sky”

• Proper Nouns often use regular nouns

•Consider a document with “a man named Abraham owned a Lincoln”
•A word matching query for “Abraham Lincoln” may well find the above
document.

10/28/20
24
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of
documents

The User Task:

two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
10/28/20
24
The User Task Retrieval
• It is the process of retrieving information whereby the main
objective is clearly defined from the onset of searching process.
• The user of a retrieval system has to translate his information
needinto a query in the language provided by the system.
• In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets

10/28/20
24
Browsing
• It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning and
whose purpose might change during the interaction with the
system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document providing
‘direction to Addis’, and from this to documents which cover
‘Tourism in Ethiopia’.
• In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest glancing
around

10/28/20
24
Logical View of Documents

Documents in a collection are frequently represented by a set of
index terms or keywords

Such keywords are mostly extracted directly from the text of the
document

These representative keywords provide a logical view of the
document

Doc Tokenizatio stop stemming Indexing

s n words

Ful Index
l terms
tex

Document representation
tviewed as a continuum, in which
logical view of documents might shift from full text to index
te3r/21m/201s8
Logical view of documents
• If full text :
• Each word in the text is a keyword
• Most complex form
• Expensive
• If full text is too large, the set of representative keywords can be reduced through
transformation process called text operation

 It reduce the complexity of the document representation

and allow moving the logical view from that of a full text
to a set of index terms

10/28/20
24
Information Retrieval Systems?
•Document (Web page)
retrieval in response to
a query
• Quite effective (at some things)
• Commercially successful (some of
them)
•But what goes on behind
the scenes?
• How do they work?
• What happens beyond the Web?

W
e
b
10/28/20
24
Examples of IR systems
⚫ Conventional (library catalog): Search by keyword, title,
author, etc.
⚫ Text-based (Lexis-Nexis, Google, FAST): Search by
keywords.
Limited search using queries in natural language.
⚫ Multimedia (IBMs QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
⚫ Question answering systems (AskJeeves, Answerbus):
Search in (restricted) natural language
⚫ Other:
⚫ Cross language information retrieval,
⚫ Music retrieval

10/28/20
24
WebSEEk Search
Engine

10/28/20
24
Structure of an IR System
•An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
•That is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents that
satisfy their information need.

User Documents
Black box

The black box is the information retrieval system.

10/28/20
24
Structure of an IR System
•To be effective in its attempt to satisfy information
need of users, the IR system must ‘interpret’ the
contents of documents in a collection and rank them
according to their degree of relevance to the user
query.
•Thus the notion of relevance is at the center of IR
•The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while
retrieving as few non-relevant documents as
possible

10/28/20
24
Structure of an IR System
Typical IR Task
• Given: Document
• A corpus of textual natural- corpus
language documents.
• A user query in the form of a
textual string.
• Find: Quer IR
• A ranked set of documents y System
that are relevant to the Strin
query. g
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.

10/28/20
24
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

10/28/20
24
Information Retrieval and Retrieval process
•A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990)
“Information retrieval deals with representation, storage,
organization of, and access to information items. The
organization and access of information items should
provide the user with easy access to the information in
which he is interested”
•The definition incorporates all important features of a
good information retrieval system
• Representation
• Storage
• Organization
• Access
•The focus is mainly on the user information need
10/28/20
24
Overview of the Retrieval process

10/28/20
24
The Retrieval Process
•It is necessary to define the text database before any of
the retrieval processes are initiated
•This is usually done by the manager of the database and
includes specifying the following
• The documents to be used
• The operations to be performed on the text
• The text model to be used (the text structure and what elements can be retrieved)

•The text operations transform the original documents

and the information needs and generate a logical view
of them

10/28/20
24
Retrieval Process ….
•Once the logical view of the documents is
defined, the database module builds an index
of the text
•An index is a critical data structure
•It allows fast searching over large volumes
of data
•Different index structures might be used , but the
most popular one is the inverted file
•Given that the document database is indexed, the
retrieval process can be initiated

10/28/20
24
The Retrieval Process …
⚫ The user first specifies a user need which is then parsed
and transformed by the same text operation applied to the
text
⚫ Next the query operations is applied before the actual
query, which provides a system representation for the user
need, is generated
⚫ The query is then processed to obtain the retrieved
documents
⚫ Before the retrieved documents are sent to the user,
the retrieved documents are ranked according to the
likelihood of relevance

10/28/20
24
The Retrieval Process …
⚫ The user then examines the set of ranked documents in the search for
useful information. Two choices for the user:
⚫ (i) reformulate query, run on entire collection
or (ii) reformulate query, run on result set
⚫ At this point, s/he might pinpoint a subset of the documents seen as
definitely of interest and initiate a user feedback cycle
⚫ In such a cycle, the system uses the documents selected
by the user to change the query formulation.
⚫ Hopefully, this modified query is a better representation
of the real user need

10/28/20
24
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations

Query Inverted file

Searching Index

Retrieved docs Text

Database
Ranking
Ranked docs
10/28/20
24
Issues that arise in IR
•Text representation
• what makes a “good” representation?
• how is a representation generated from text?
• what are retrievable objects and how are they organized?
•information needs representation
• what is an appropriate query language? Ex. Weighting and ranking, relevance-
orientation, or semantic relativism etc
• how can interactive query formulation and refinement be supported?
•Comparing representations (to identify relevant
documents)
• What weighting scheme and similarity measure to be used?
• what is a “good” model of retrieval?
•Evaluating effectiveness of retrieval
• what are good metrics/measurements?
• what constitutes a good experimental test bed?

10/28/20
24
Focus in IR System Design
Our focus during IR system design is:
• In improving performance effectiveness of the system
• Effectiveness of the system is measured in terms of precision, recall, …
• Stemming, stop words, weighting schemes, matching algorithms

• In improving performance efficiency

• The concern here is storage space usage, access time, searching time, data
transfer time …
• Concern regarding space – time tradeoffs !!
• Use Compression techniques, data/file structures, etc.

10/28/20
24
Subsystems of an IR system
• The two subsystems of an IR system:
• Searching: is an online process of finding relevant
documents in the index list as per users query
• Indexing: is an offline process of organizing documents
using keywords extracted from the collection
• Indexing and searching: are unavoidably
connected
• you cannot search what was not first indexed
• indexing of documents or objects is done in order to be searchable
• to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing language

10/28/20
24
Indexing
Subsystem
documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed
Term weighting
terms
terms with
weights Index
10/28/20
24
Searching
Subsystem
query parse query
query tokens
ranked
Stop list non-stoplist
document
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index

10/28/20
24
Thank
you

R SQL
No ratings yet
R SQL
187 pages
Web Dynpro ABAP: James Wood and Shaan Parvaze
No ratings yet
Web Dynpro ABAP: James Wood and Shaan Parvaze
83 pages
BSC CsIt Complete RDBMS Notes
No ratings yet
BSC CsIt Complete RDBMS Notes
86 pages
Introduction To Nano Materials
No ratings yet
Introduction To Nano Materials
25 pages
20T127 Mini Project
No ratings yet
20T127 Mini Project
11 pages
Water Billing System
No ratings yet
Water Billing System
10 pages
SM7 Single Sign-On Authentication
No ratings yet
SM7 Single Sign-On Authentication
46 pages
Global E-Business and Collaboration
No ratings yet
Global E-Business and Collaboration
37 pages
Lesson 2: Empowerment Technologies Objectives
No ratings yet
Lesson 2: Empowerment Technologies Objectives
4 pages
Addis Ababa Institute of Technology Center of Information Technology and Scientific Computing Department of Software Engineering
No ratings yet
Addis Ababa Institute of Technology Center of Information Technology and Scientific Computing Department of Software Engineering
37 pages
Traffic Shaping by Token Bucket: Unit 03.04.04 CS 5220: Computer Communications
No ratings yet
Traffic Shaping by Token Bucket: Unit 03.04.04 CS 5220: Computer Communications
10 pages
Mba 3 Sem Data Mining For Business Decisions 76987 Jan 2021
No ratings yet
Mba 3 Sem Data Mining For Business Decisions 76987 Jan 2021
2 pages
Ethiopia Big Book
No ratings yet
Ethiopia Big Book
405 pages
Router Final
No ratings yet
Router Final
13 pages
Instructions For Installing & Configuring Ringmaster Lite Version 4.2A
No ratings yet
Instructions For Installing & Configuring Ringmaster Lite Version 4.2A
12 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Swot Analysis of Google
No ratings yet
Swot Analysis of Google
12 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
SCOPE Standard Corporate Presentation
No ratings yet
SCOPE Standard Corporate Presentation
30 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
ExD Company Profile - 17
No ratings yet
ExD Company Profile - 17
17 pages
UNIT 2 MainPetrochemicals
100% (1)
UNIT 2 MainPetrochemicals
42 pages
2do Documento de Gonzalo
No ratings yet
2do Documento de Gonzalo
52 pages
Laboratory Information Management Systems (Lims) Deployment: Project Charter Example
No ratings yet
Laboratory Information Management Systems (Lims) Deployment: Project Charter Example
5 pages
At SQA Testing Essentials Syllabus-Sample Exam-Answer Table
No ratings yet
At SQA Testing Essentials Syllabus-Sample Exam-Answer Table
12 pages
Individual Time Table A.Y (22-23 EVEN) (1st Two Week)
No ratings yet
Individual Time Table A.Y (22-23 EVEN) (1st Two Week)
18 pages
Using Spring Cloud Gateway With OAuth 2.0 Patterns - Baeldung
No ratings yet
Using Spring Cloud Gateway With OAuth 2.0 Patterns - Baeldung
1 page
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
OOD ch01
No ratings yet
OOD ch01
37 pages
FRD
No ratings yet
FRD
1 page
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Service and E Marketing-1
No ratings yet
Service and E Marketing-1
171 pages
Energy PPT Note
No ratings yet
Energy PPT Note
110 pages
Introduction To Research Methods: April 2024
No ratings yet
Introduction To Research Methods: April 2024
174 pages
MB0031 Set 1 & 2 Solved Assignment
50% (4)
MB0031 Set 1 & 2 Solved Assignment
18 pages
Documentfirs Draft
No ratings yet
Documentfirs Draft
53 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
KKHHJK
No ratings yet
KKHHJK
42 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Covid 19
No ratings yet
Covid 19
33 pages
Yisak Alemayehu
No ratings yet
Yisak Alemayehu
18 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
HDGDDHDGDDGGGDGD
No ratings yet
HDGDDHDGDDGGGDGD
26 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
IR Module For MIS Rift
No ratings yet
IR Module For MIS Rift
80 pages
M, y - &a Research
No ratings yet
M, y - &a Research
49 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
Chapte One
No ratings yet
Chapte One
41 pages
PWC Task 1 - Resource - Business Summary Template
No ratings yet
PWC Task 1 - Resource - Business Summary Template
2 pages
Answer To Urban Midterm 2008
No ratings yet
Answer To Urban Midterm 2008
3 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Abebe Agdew HH
No ratings yet
Abebe Agdew HH
41 pages
DENEKE
No ratings yet
DENEKE
36 pages
PDF Resize
No ratings yet
PDF Resize
41 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
Social Science Students' Institutional Email
No ratings yet
Social Science Students' Institutional Email
30 pages
Class 9 Computer Project
No ratings yet
Class 9 Computer Project
28 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
Chapter Four
No ratings yet
Chapter Four
10 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Irs Unit1
No ratings yet
Irs Unit1
15 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
IRS Study Material
100% (1)
IRS Study Material
87 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
8 pages
1 introIR
No ratings yet
1 introIR
22 pages
Buruk Tadesse - Certificate
No ratings yet
Buruk Tadesse - Certificate
1 page
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Dorian Arroyo Product Support Engineer English
No ratings yet
Dorian Arroyo Product Support Engineer English
3 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Information Storage and Retrieval: Chapter One - Introduction
No ratings yet
Information Storage and Retrieval: Chapter One - Introduction
50 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Ch1 Intro To Information Retrieval-Lina Nemri
No ratings yet
Ch1 Intro To Information Retrieval-Lina Nemri
23 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
Mit 5110 Assignment
No ratings yet
Mit 5110 Assignment
9 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
Digital Forensics Basics A Practical Guide Using Windows OS 1st Edition Nihad A Hassan - Discover The Ebook With All Chapters in Just A Few Seconds
100% (2)
Digital Forensics Basics A Practical Guide Using Windows OS 1st Edition Nihad A Hassan - Discover The Ebook With All Chapters in Just A Few Seconds
69 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Chapter 1
No ratings yet
IR Chapter 1
29 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
Module 1print
No ratings yet
Module 1print
5 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
1 introIR
No ratings yet
1 introIR
15 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
ch1 - Information Retrieval Systems
No ratings yet
ch1 - Information Retrieval Systems
52 pages
IR Module
No ratings yet
IR Module
80 pages
IR Notes
No ratings yet
IR Notes
14 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
Information Retrieval
No ratings yet
Information Retrieval
21 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
How To Request A Load Balancer in SAP Enterprise Cloud Services
No ratings yet
How To Request A Load Balancer in SAP Enterprise Cloud Services
3 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet