0% found this document useful (0 votes)
39 views95 pages

1 IRIntro

The document discusses information retrieval and information retrieval systems. It defines information retrieval, describes the stages of the IR process, and discusses key issues like representation and organization of information and how to retrieve relevant information to match a user's needs.

Uploaded by

beshahashenafe20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views95 pages

1 IRIntro

The document discusses information retrieval and information retrieval systems. It defines information retrieval, describes the stages of the IR process, and discusses key issues like representation and organization of information and how to retrieve relevant information to match a user's needs.

Uploaded by

beshahashenafe20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Information Storage and

Retrieval (ISR)

Introduction

1
What is Information Retrieval ?
 The process of actively seeking out
information relevant to a topic of interest
(van Rijsbergen)
 Typically it refers to the automatic (rather than
manual) retrieval of documents
 Information Retrieval System (IRS)
 “Document” is the generic term for an information
holder (book, chapter, article, webpage, etc)

2
Information Retrieval
 Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need
from within large collections (usually stored on
computers).
 These days we frequently think first of web search,
but there are many other cases:
 E-mail search
 Searching your laptop
 Corporate knowledge bases
 Legal information retrieval

3
Motivation (why)

 Huge amount of information


 Information overload
 Difficulty in information access

4
The stages of IR- The big picture

Information
Creation Indexing,
organizing
Indexed Retrieval
and structured • Searching
information • Browsing

5
What IR assumes?
 Collection: A set of documents
 Assume it is a static collection for the moment
 Goal: Retrieve documents with information that
is relevant to the user’s information need and
helps the user complete a task
 Information is stored (or available)
 A user has an information need
 An automated system exists from which
information can be retrieved
 The system works!!
6
Sub Topics

 Overview of IR and IR systems


 Database retrieval Vs. information retrieval
 The (information) retrieval process
 Basic Structure of an IR System
 Types of IR system

7
Overview of IR and IR
Systems

8
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
 Quite effective (at some
things)
 Commercially successful
(some of them)
Butwhat goes on behind
the scenes?
Web search systems
 How do they work? • Lycos, Excite, Yahoo, Google,
 What happens beyond the Live, Northern Light, Teoma,
Web? HotBot, Baidu, …
9
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

10
Information Retrieval - Definition

 Is an Important sub-discipline of Information


Science that is concerned with developing
theories and methods of access to
information
 Focus is on helping user find information

that matches their information need (User


Centered View)

11
Cont…

 Is a branch of applied Computer Science that


focus on representation, storage,
organization of, and access to information
items (System Centered View).

12
Cont…
 A good formal definition of information retrieval
is given in Baeze-Yates & Riberio-Neto (2011p1)

“Information retrieval deals with representation,


storage, organization of, and access to
information items such as documents, web
pages, online catalogs, structured and semi-
structured records, multimedia objects. The
representation and organization of the
information items should be such as to provide
the user with easy access to information of their
interest.”
13
Cont…

 The definition incorporates all important features of a


good information retrieval system
 Representation

 Storage

 Organization

 Access

 Evaluation

 As a field, IR focuses on advanced application of


computers
 Is about finding relevant information in large
collection of data

14
Cont…
 Information items: usually text, but possibly
also image, audio, video, etc.
 Text items are often referred to as

documents, and may be of different scope


(books, articles, paragraphs, etc.)
 IR involves helping users find information that
matches their information needs
 Its techniques and applications have reached
many fields where processing large amount of
information is essential
15
IR from different perspectives
 Conceptually,
 IR is used to cover all related problems in finding
needed information
 Historically,
 IR is about document retrieval, emphasizing
documents as a basic units
 Technically,
 IR refers to (text) string manipulation, indexing,
matching, querying, etc.

16
Information Retrieval
 Can be structured for ease of discussion as
 Text IR

 Discusses the classic problem of searching a collection of


documents for useful information
 Focuses on document images that are predominantly text
(rather than pictures)
 These are called textual images and are amenable to
automatic extraction of key words
 Multimedia IR
 Discusses how to index document images and other
binary data by extracting features from their content and
how to search them efficiently

17
Cont…

 Human computer interaction (HIC) for IR


 Discusses current trends in IR towards improved
user interface and better data visualization tools
 Application of IR
 Covers modern applications of IR (such as the Web,
bibliographic systems, and digital libraries)

18
Entities in IRS

 Two important entities


 Information need: to be represented by search
statements (query)
 Information items (documents): to be
represented by index terms or any form of
representation like summary
 Thus the process in IRS is matching these
abstractions

19
Thus the focus is on

 How to organize and represent information


items effectively and efficiently
 How to represent information needs
 How to match these two

20
Key Issues IR
 Organizing
 How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
 Retrieving
 How to find the appropriate information resources
or information-bearing objects for someone’s (or
your own) needs. Build a system that retrieves
documents that users are likely to find relevant to
their queries
 This set of assumption underlies the field of IR

21
IR is an Iterative Process
Creation

Active
Authoring
Modifying

Using Organizing
Creating Indexing

Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Networking
Utilization Disposition Searching

Inactive
22
IR-Representation/organizing

 The representation and organization


should provide the user with easy access
to the information in which he is interested
 Thus, focus is on the user information
need
 Unfortunately, Charicterization of the user
information need is not a simple problem

23
Cont…

 An example of complex User


information need:
 Find all documents that address the role of
the Federal Government in financing the
operations of the National Railroad
Transportation Corporation

24
Cont…
 Basic remarks on user information need(in the
context of the World Wide Web):
 Such full descripion of the user information need is not
necessarily a good query to be submitted to the IR system
 The user must first translate his information need into a query
which can be prossessed by the search engine( or IR
system).
 In its most common form, the translation yields a set of
keywords (or index terms) which summerizes the user
information need
 Given the user query, the key goal of the IR system is
to retrieve information that is useful or relevant to the
user.
 Emphasis is on the retrieval of information (not data)

25
A sketch of a searcher… moving through many actions towards a
general goal of satisfactory completion of research related to an
information need
IR is an Iterative Process

Repositories
Q2 Q4

Q3
Q1 Q5

Goals Q0

26
IR- Storing/Retrieving

 Information storage
 How and where is information stored?
 Retrieving information
 How is information recovered from storage
 How to find needed information
 Linked with accessing/filtering stage

27
IR - Accessing/Filtering

 Using the organization created in the


Organization/Indexing stage to:
 Select desired (or relevant) information
 Locate that information
 Retrieve the information from its storage
location (often via a network)

28
Implementation

 Thus in order to meet the above key issues


the implementation is developing an
Information System
 Retrieval system

29
More on IR
 IR is concerned with retrieval of relevant
documents from a large collection of
documents
 Relevant documents are identified
according to specific criteria (usually
called query)
 IR usually deals with Natural Language
text which is not always well structured
and could be semantically ambiguous

30
Cont…

• IR deals with very large sets of documents


 High amount of robustness, efficiency
 Domain-independent & multi-linguality
• IR considers Natural Language text mainly from a
lexical view
 Identifying possible word forms
 Elimination of stop words (e.g the, of, ...)
 Stemming (e.g., supporting, supported support)
 Selection of index terms
 Term weighting

31
Data Vs Information Retrieval

32
Data Retrieval System
 Ex: relational databases
 Deals with data that has a well defined structure
and semantics
 While IR system deals with natural language text which is
not well structured
 A single erroneous object among a thousand
retrieved objects means total failure
 While in IR small errors are likely to go unnoticed
 Data retrieval does not solve the problem of
retrieving information about a subject or topic

33
Information Vs Data Retrieval

Features Data Retrieval Information Retrieval


Matching Exact match Partial or best match
Query language Artificial natural
Query specification Complete Incomplete
Items wanted Matching relevant
Error response Sensitive Insensitive
Items Structured Not well structured
The Basic Structure of IR

35
High level structure of an IRS
 An Information Retrieval System serves as a
bridge between the world of authors and the
world of readers/users,
 That is, writers present a set of ideas in a document
using a set of concepts. Then Users seek the IR
system for relevant documents that satisfy their
information need.

Black box
User Documents
Structure of IR System

formulation uses uses process


Information
Request Query Matching Index
item

is represented by a Is based on contains

Information
Relevance Collection
need

37
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

T Rules of the game = T


Rules for subject indexing +
r Formulating query in Thesaurus (which consists of Indexing
r
a terms of
descriptors Lead-In
(Descriptive and
Subject)
a
n Vocabulary
and n
s Indexing
Language
s
l Storage of
Storage of l
profiles
a Documents
a
t t
i i
o Store1: Profiles/ Comparison/ Store2: Document o
Search requests representations
n Matching
n

Ranking
Potentially
Relevant
Documents
38
Architecture of an IR System
Indexing, retrieval and ranking
The Information Retrieval
Process

What do the basic retrieval process


looks like?

41
The IR Process
User
Interface
User need
Text Text
Text Operations Database
L o g i c a l v i e w

User Query DocID


Indexing
feedback Formulation
Inverted
Query
file
Searching
Index
Retrieved file
docs

Ranking
Ranked docs
IR Processes
 The user interface – think of it as the user interface
available with current IR systems including
 Web search engines
 It is necessary to define the text database before any
of the retrieval processes are initiated
 This is usually done by the manager of the database
and includes specifying the following
 the documents to be used
 The operations to be performed on the text

 The text model to be used (the text structure and


what elements can be retrieved)
 The text operations transform the original documents
and generate a logical view of them

43
Cont…
 Once the logical view of the documents is defined,
an index of the text is built
 An index is an optimized data structure that is
built on top of the information objects
 allowing faster access for the search process over
large volume of data
 The indexer:
 tokenizes the text (tokenization), removes words with little
semantic value (stop-words), unifies word families
(stemming)
 Different index structures might be used, but the
most popular one is the inverted file

44
Cont…
 Given the document database is indexed,
the retrieval process can be initiated
 Information retrieval is the process of matching the query
against the indexed information objects
 The user first specifies an information need
which is then parsed and transformed
 Using the same text operation applied to the text
 Then the query operations might be applied
before the actual query, which provides a
system representation for the user information
need, is generated
45
IR Processes
 The IR system responds by matching information
objects, which are relevant to a query
 Information retrieval focuses on finding relevant
information rather than simple pattern matching
 Relevance
 is a subjective notion
 depends on the task being solved and its context
 can change with time (eg. new info became available)
 can change with location (eg. the most important answer is
the closest one)
 can change with the device (eg. The best answer is a short
doc that is easier to download and visualize)
IR Processes
 A retrieval strategy (model) is an algorithm
and related structures that takes a query and
a set of documents and assigns a similarity
measure between the query and each
document
 similarity represents relevance to the user query
 Documents are ranked on the basis of their
similarity to the query
Cont…
 Before the retrieved documents are sent to the user,
the retrieved documents are ranked according to the
likelihood of relevance
 The user then examines the set of ranked
documents in the search for useful information
 This process can be repeated and the query
can be modified
 The user may need to reformulate query

48
Cont…

 At this point, he might pinpoint a subset of the


documents seen as definitely of interest and
initiate a user feedback cycle
 In such a cycle, the system uses the
documents selected by the user to change
the query formulation
 Hopefully, this modified query is a better
representation of the real user need

49
IR Processes
Information
Retrieval

Indexing Matching Ranking Query


(for optimized (searching, (with term Modification
access) clustering) boosting) (query
expansion)

Text Analysis
(tokenization,
normalization, stop
word removal,
stemming)
Types of IR Systems

51
Types of IR Systems
 Based on the nature of the information items:
 Text based IR
Multimedia
 Audio retrieval system
IR system
 Video retrieval system
 Image retrieval
 Multimodal IR system – can handle all types

52
Types of IR Systems
 Domain/Fields of the documents:
 General purpose IR
 Specialized IR(Medical, Legal, Agricultural, etc.)
 Based on languages supported:
 Mono-lingual IR
 Multilingual IR
 Cross lingual IR

53
Types of IR Systems
 Based on the nature of the output
 Document retrieval system
 Question-Answering system
 Automatic summarization systems
 Intelligent IR
 Recommender system
 Information Extraction
 Etc.

54
Factors Affecting Effective
Retrieval

What are the two factors affecting


retrieval?

55
Factors Affecting Effective Retrieval

•The effective retrieval of relevant information is


directly affected by two things
The User Task
The logical view of the documents
adopted by the retrieval system

56
The User Task

Searching

Database

Browsing/ surfing

57
Cont…

 The user task: The user task might be


one of Searching or Browsing
 Searching
 information or data
 Information need (retrieval goal) is focused and
crystalized, purposeful, often user is sophesticated
 Browsing/ surfing
 Information need (retrieval goal) is vague and impresise
 Glancing around, Often user is naive
 Both are initiated by the user

58
Users
 The user: anyone who need to find some
information
 The user groups
 group by their knowledge of the system
 novice users vs. experienced users
 end users vs. information specialists
 group by their domain knowledge
 Domain experts vs. general public
 group by information needs
 need to locate a particular item
 need some information
 need all information on a subject

59
User’s Information Needs
 At all levels of our life we need information (e.g.
crossing the road, health, nutrition, travel,…)
 Information need is the desire to know, the desire
to fill a gap of knowledge
 Example- problem: one wants to cross a road in a
high traffic area: What is the information he needs?
He needs information
 About the direction people drive (left or right)
 About the meanings of the traffic light (green, yellow, and
red)
 Sign posts, etc

60
Cont…

 People depend on information to carry


out their activities of daily life.
 need to accomplish some goals
 need to solve some problems
 People realize a lack of information
 perceive a gap in their knowledge state
 desire to fill the gap

61
Logical view of documents

 The logical view of documents


 Full text
 Set of index terms
 Any point in between full text and index
terms

62
Document Processing Steps

Logical View of a document: from full text to a set of index terms

63
Cont..
 Documents in a collection are frequently
represented through a set of index terms or
keywords
 An index term is a key word (or group of related
words) which has some meaning of its own (which
usually has the semantics of a noun)
 In its more general form, an index term is simply
any word which appears in the text of a document
collection

64
Cont…

 it is simply a word whose semantic helps in


remembering the document’s main theme
 Index terms are used to index and
summarize the document content
 How to generate index terms? (we will
learn in the coming weeks)

65
Cont…

 Key words might be extracted directly from the text


of the document or
 Keywords might be specified by a human expert
(this is frequently done in the information science
arena)
 No matter whether these representative keywords
are derived automatically or generated by a
specialist, they provide a logical view of a document
(concise logical view)

66
Cont...

 Modern computers make possible to represent a


document by its full set of words
 In this case, we say that the retrieval system adopts a
full text logical view (or representation) of the
documents
 With very large collections, however, modern
computers might have to reduce the set of
representative keywords
 This can be accomplished through the following
standard steps

67
Cont...

 Standard steps
 Recognizing document structures (titles,
sections, paragraphs, etc.)
 Break into tokens
 Usually space and punctuation delimited
 Special issues with some languages (Chinese,
compound words,)
 The elimination of stopwords (such as
articles and connectives)

68
Cont…
 Conflation: The use of stemming/
morphological analysis
 Purpose
 Overcome the variants of word forms by reducing all
words with the same root, i.e., (which reduces distinct
words to their common grammatical root)
 Most IR systems perform stemming on both text
and query
 The identification of noun groups (which
eliminates adjectives, adverbs, and verbs)
 Other further operation can also be performed
 Store in inverted index (to be discussed in later
chapters)
69
Cont…
 Such text operations reduce the complexity
of the document representation and allow
moving the logical view from that of a full
text to that of index terms
 Index - A list of important key words from
the documents

70
Cont...
 The full text is the most complete logical
view of a document, But its usage usually
implies higher computational costs
 A small set of categories/ index terms
(generated automatically or by a human
specialist) provides the most concise
logical view of a document, but its usage
might negatively affect the retrieval quality
 Several intermediate logical views (of a
document) might be adopted by an
information retrieval system as shown in
the figure
71
Cont…

 The issue of logically representing a


document should be viewed as a continuum
in which the logical view of the document
might shift (smoothly) from a full text
representation to a higher level
representation specified by a human subject

72
Cont...
 The index terms obtained are a description of
a document content and of its structure
 Models may allow reference to the text
document
 The models might also allow references to
the structure normally present in written text
(in this case we say a structured model)
 Retrieval based on index terms or keywords
might be of fairly low quality

73
Cont…
 Two major reasons for this
 The user query might be composed of too few terms
which usually implies the query context is poorly
characterized
 This problem is dealt with through transformations in the
query such as query expansion and user relevance
feedback
 The set of keywords generated for a given document
might fail to summarize its semantic content properly
 This problem is dealt with through transformations in the
text such as
 Identification of noun groups to be used as keywords
 Stemming
 The use of thesaurus

74
Cont...

 Given a set of index terms for a document, we


notice that not all the terms are equally useful for
describing the document contents
 There are index terms that are simply vague than
the others
 Deciding on the importance of a term for
summarizing the contents of a document is not a
trivial issue
 Despite this difficulty, there are properties of an
index term

75
Cont…
 Examples of such properties
 A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user might be
interested in
 A word which appears in just five documents is
quite useful because it narrows down considerably
the space of documents which might be of interest
to the user
 Thus, distinct index terms have varying relevance
when used to describe document contents
 This effect is captured through the assignment of
numerical weights to each of the index term of a
document – Term weighting
76
Challenges in IR

Why is IR a Difficult Problem?

77
Why is IR a Difficult Problem?

 The size of the web is doubling every year:


 50 million pages in November 1995, 320 million
pages in December 1997, 800 million pages in
February 1999, 1 billion pages in 2000, 4.6
billion pages in Oct. 2018, 5.59 Billion pages in
March 2023, and growing every day
 Huge amount of data (e.g., WWW) dictates
efficiency, effectiveness and user-friendliness
 Thus : Any IR system needs the capability of large
scale data processing. Use of indexes and various
representations are required

78
Cont…
 Unstructured data: difficult to capture
semantics in documents. Compare:
 “select * from Employee where Salary >
100,000”
 “retrieve all news items about corporate
takeover”
 Why is the second query more difficult to
answer? The following query is even more
difficult:
 “retrieve all news items about corporate
takeover involving an internet company”

79
Cont…
 Documents have unrestricted domains
 it is hard to predefine or pre-categorize
the subject domains of documents
 a particular subject is related to several major
topics including linguistics, psychology,
Cybernetics, Communications, Information
System design, Engineering & Technology,
Networking, Computer Science, Mathematics,
Economics, Management Science,
education …

80
Cont…
 Diversified user base: expert to casual users
 The users of information retrieval systems include
 Research scientists (that seek articles related to
particular experiments)
 Engineers (who try to determine whether a patent
covering some new idea has previously been
obtained)
 Attorney( who search for legal presidents)
 Buyers in general (who try to obtain new product
information)

81
Cont…
 Information retrieval users
 Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
 May be led by many different reasons to use the retrieval
facilities
 As a result, they require a variety of services and end
products
 In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
 a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user

82
Cont…

User Search/select Information

Info. Needs Queries Stored Information

Translating info. Matching queries


needs to queries To stored information

Query result evaluation


Does information found match user’s
information needs?
83
Cont…

 Distributed and interlinked (e.g., Hypertext


and WWW)
 Where to start a search? Unlike in a centralize
database, you have only one (or a few) database's) to
search.
 How are the information related?

 Efficiency vs. effectiveness.


 With a limited amount of resources, one can only improve
efficiency and effectiveness to a certain degree. Moreover,
improving efficiency often means degrading effectiveness,
and vice versa.

84
Other Central Concepts in IR

What else to start the actual work?

85
Other Central Concepts in IR

 Documents
 Queries
 Collections
 Evaluations
 relevance

86
Why is IR Important?

 Most information available is in textual form and has no


predefined format (e.g., emails and newsgroup articles).
 Integration of text retrieval capability in most relational
database systems. SQL already supports limited search
capability such as search based on regular expressions:
 select * from Employee where Name like ’%Lee%’.
 Increasing number of online documentation systems (no
more hardcopy!)
 Of course, the blooming of World Wide Web

87
Historical overview
 Organization and storage of knowledge for
ease of access is centuries old
 That is, the history of recording knowledge
goes as far as thousands of years.
 Important events
 Development of writing, Books and printing
technology, News publishing, Journal publishing
(economic reasons- books are not economical in
terms of money and time), Libraries (to put
publications in one centre)

88
Cont…

 Now- The World-Wide-web


 A gigantic distributed collection of
heterogeneous information items (web pages)
 New challenges

89
Cont…

 As the size of the collection grows, access to


documents becomes more difficult without
proper mechanism
 Therefore, in order to reach to documents in
libraries or other collection, access
mechanisms were necessary

90
Cont…
 Simple methods to facilitate access to
single document:
 Table of contents,
 Keyword index
 Classical methods to facilitate access to
collections of documents
 Index (keywords, authors)
 Hierarchical (Dewey-Decimal classification)

91
Cont…

 Increasing demand for information access


created Information Science as a discipline
 IR, first coined in 1952 and then got
acceptance in 1961
 becomes an important sub discipline that is
concerned with developing theories and methods
of access to information

92
Creation of Disciplines
 The mechanized era (Sparc Jones and Willett, 1997)
 IR systems were mainly used by librarians
 for carrying out bibliographic searches in place of
manual tools such as card catalogue and universal
classification systems
 The advent of word processing technology (software
+ hardware)
 a rapid, wide spread growth in the usage of IR
 Increased interest in Web-based distributed information
processing and in the application of IR techniques to
non-textual information
− The growth of knowledge  Creation of discipline

93
Cont…
 Discipline oriented era
e.g. Science from philosophy
physics from science
electricity from physics
electronics from electricity
 Similarly, information retrieval from the wider
discipline of information science
 Then came the Problem Oriented Era
 Disciplines are merged to form a new subject.
 E.g. Molecular Biology from physics and Biology
(Fosket, 1988)

94
Cont…
 Such growth of knowledge gave birth to the
creation of disciplines (domain knowledge) which
then brought about the need for classification and
indexing
 Putting related knowledge together
 E.g. Science, Arts, and Humanities
 Creation of subclasses within classes
 Designing ways and means of accessing
information (which is the area of IR)

95

You might also like