0% found this document useful (0 votes)
14 views18 pages

Chapter One IR

The document provides an overview of Information Storage and Retrieval (ISR) aimed at IT 3rd year students, covering topics such as the retrieval process, indexing structures, IR models, and evaluation metrics. It highlights the challenges of processing large collections of documents and the importance of relevance in information retrieval. Additionally, it discusses various types of IR systems and their architecture, emphasizing the need for effective query formulation and user feedback in the retrieval process.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views18 pages

Chapter One IR

The document provides an overview of Information Storage and Retrieval (ISR) aimed at IT 3rd year students, covering topics such as the retrieval process, indexing structures, IR models, and evaluation metrics. It highlights the challenges of processing large collections of documents and the importance of relevance in information retrieval. Additionally, it discusses various types of IR systems and their architecture, emphasizing the need for effective query formulation and user feedback in the retrieval process.

Uploaded by

bekeletamirat931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Information Storage and Retrieval

Chapter One
Introduction to ISR
Target Group –IT 3rd year students

Injibara, Ethiopia
Course Outline
Topic(s) Details
Define IR; The retrieval process; Basic structure of an IR
Overview of IR
system
Text Document Basic Laws in IR; Tokenization; Stop word detection;
Operations Stemming; Normalization; Term weighting; similarity measures
Indexing
Structures The need for indexing; sequential file; Inverted files
A Formal Characterization of IR Models; Boolean model,
IR Models
Vector space model & Probabilistic model
Retrieval Evaluation of IR systems; Relevance judgement; Retrieval
Evaluation effectiveness measures (Recall, Precision, F-measure, etc.)
Types of Query formulation; Keyword-based queries (Boolean
Query Languages
queries); Pattern matching; Natural language queries
Current Issues in IR in Local Languages; Information Extraction; Information
IR Filtering; Text Summarization, Cross-language retrieval...
Text Collections and IR
• Information is organized into (a large number of)
documents
₋ Large collections of documents available from various sources:
books, magazines, newspapers, journal articles, conference
papers, digital libraries, Web pages, etc.

• Example: How Much Data?


– Google processes 20 Petabyte a day (2008)
– Google Web Search Engine claims to index over 30 trillion
pages(1995-2014)
 It performs more than 40 000 search queries each second on
average and over 5.2 billion searches per day in 2017 and 4
trillion per year world wide.
• Wayback Machine has 50PB used storage(2014)
• Facebook has 100 PB of user data (2012)
• eBay has 6.5 PB of user data + 50 TB/day (2009)
Storage of Text
• Textual Documents
– Searchable as text
– Words are represented as ASCII/Unicode
• Image Documents
– Scanned image of text document, which is not searchable as
text: Texts (characters, words, etc.) are represented as
patterns of pixels
– Retrieval from Document Images: Two options
• Recognition-based retrieval: OCR is required to convert
document images to ASCII (may be error prone) and then
apply text IR systems on the recognized documents
• Recognition-free retrieval: Retrieval from document images
without explicit recognition.
• Search relevant documents directly from image collections
The Problem of IR
• Need
– Increasing the size and number of published documents
– Traditional methods had difficulties in document processing
– Different disciplines(Biotechnology, Genetics..) producing
different types of huge amount data Info.
need

Query
IR
Retrieval system
Document Answer list
collection

• Goal
– Find documents relevant to an information need from a
large document set
What is Information Retrieval ?

• Information retrieval is the process of searching for relevant


documents from unstructured large corpus that satisfy
information need of users
– It is a tool that finds and selects from a collection of items a
subset that serves the user’s purpose

• Information retrieval (IR) is finding material (usually


documents) of an unstructured nature (usually text) that satisfies
an information need from within large collections (usually
stored on computers).
Examples of IR System
•Much IR systems focuses more specifically on text retrieval. But there are
many other IR areas:
–Cross-language retrieval, text summarization, information filtering,
Question-answering, content-based multimedia (audio, Image and Video)
retrieval
•Text-based (Lexis-Nexis, Google, FAST):
–Search by keywords.
–Limited search using queries in natural language.
•Multimedia (WebSeek, SaFe):
–(shapes, colors,… ).
•Question answering systems (AskJeeves, Answerbus):
–Search in (restricted) natural language
•Cross language vs. Multilingual Information Retrieval
Information Retrieval serve as Bridge
• An Information Retrieval System serves as a bridge
between the world of authors and the world of
readers/users
• That is, writers present a set of ideas in a document
using a set of concepts

• Then Users seek the IR system for relevant documents


that satisfy their information need

Black box
User Documents
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Relevant Documents .
.
The Notion of Relevance
• Relevance is a subjective judgment and may include:
 Being timely (recent information)
 Being authoritative (from a trusted source)
 Satisfying the goals of the user and his/her intended use of the
information (information need)
• Relevance information is that suited to your
information need
• What is actually needed (relevant)
– Dependent on: (User, Space/time, Group and Context)

• IR is very concerned with relevance


IR System vs. Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
The Retrieval Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w

User Query DocID


Indexing
feedback Formulation
Inverted
Query
file
Searching
Index
Retrieved file
docs
Ranked docs
Ranking
The Retrieval Process
• It is necessary to define the text database before any of
the retrieval processes are initiated
• The text operations transform the original documents & the
information needs and generate a logical view of them
• Once the logical view of the documents is defined, the
database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
The Retrieval Process
• Different index structures might be used, but the most popular
one is the inverted file (more on this later) as indicated in the
slide.

• Given the document database is indexed, the retrieval


process can be initiated.

• The user first specifies a user need which is then parsed &
transformed by the same text operation applied to the text.

– Next the query operations is applied before the actual query,


which provides a system representation for the user need, is
generated.
The Retrieval Process
• The query is then processed to retrieve documents.
– Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance.

• The user then examines the set of ranked documents in the


search for useful information

• At this point, the user might pinpoint a subset of the


documents seen as definitely of interest & initiate a user
feedback cycle
– In such a cycle, the system uses the documents selected by the
user to change the query formulation

– Hopefully, this modified query is a better representation of the


real user need
Issues that arise in IR
1. Text document representation
– What makes a “good” representation?
– How is a representation generated from text?
– What are the retrievable objects & how are they organized?

2. Information need representation


– What is an appropriate query language?
– How can interactive query formulation & refinement be supported?

3. Comparing representations
– What is a “good” similarity measure & retrieval model?
– How is uncertainty represented?

4. Evaluating effectiveness of retrieval


– What are good metrics?
– What constitutes a good experimental test bed?
Students’ Reflection:
What are the main components in Information
Retrieval System?
a) ____________________________________
b) ____________________________________
c) ____________________________________

What are the main differences between Information


Retrieval System and Database Management System?
a) ____________________________________
b) ____________________________________
c) ____________________________________
17
1 18

You might also like