0% found this document useful (0 votes)
10 views69 pages

Chapter 1

The document provides an overview of Information Retrieval (IR), including its definitions, processes, and challenges. It discusses the components and functions of Information Retrieval Systems (IRS) and emphasizes the importance of user-centered approaches in effectively retrieving information. Key issues such as organizing and retrieving information, as well as the iterative nature of the IR process, are also highlighted.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views69 pages

Chapter 1

The document provides an overview of Information Retrieval (IR), including its definitions, processes, and challenges. It discusses the components and functions of Information Retrieval Systems (IRS) and emphasizes the importance of user-centered approaches in effectively retrieving information. Key issues such as organizing and retrieving information, as well as the iterative nature of the IR process, are also highlighted.

Uploaded by

bellhermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Information Retrieval

Chapter 1
Information Storage and
Retrieval (ISR): Basic concepts

1
Sub Topics
 Definition, Foundation, theories and principles
 The (information) retrieval process
 Factors affecting effective retrieval
 Challenges in IR
 Information retrieval system: components,
structures and functions
 Database retrieval Vs. information retrieval

2
Definition, Basic Foundation,
theories and principles

What are the key foundational concepts


regarding IR?

3
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
 Quite effective (at some things)
 Commercially successful (some
of them)
But what goes on behind the
scenes?
 How do they work? Web search systems
 What happens beyond the Web? • Lycos, Excite, Yahoo, Google,
Live, Northern Light, Teoma,
HotBot, Baidu, …
4
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
5
Information Retrieval - Definition
 Is an Important sub-discipline of Information
Science/Computer Sciences that is concerned with
developing theories and methods of access to
information

 Focusis on helping user find information that


matches their information need (User Centered View)

 Is a branch of applied Computer Science that focus on


representation, storage, organization of, and access to
information items (System Centered View).
6
Cont…
 A good formal definition of information retrieval
is given in Baeze-Yates & Riberio-Neto (1990p1)

“Information retrieval deals with representation,


storage, organization of, and access to
information items. The organization and access
of information items should provide the user with
easy access to the information in which he is
interested”

7
Cont…
 The definition incorporates all important features
of a good information retrieval system
 Representation
 Storage
 Organization
 Access
 Evaluation
 Documents Information items: usually text, but
possibly also image, audio, video, etc.

8
IR from different perspectives

 Conceptually,
 IR is used to cover all related problems in finding
needed information
 Historically,
 information retrieval is about document retrieval,
emphasizing documents as a basic units
 Technically,
 information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.

9
Information Retrieval
 Can be structured for ease of discussion as
 Text IR
 Discussesthe classic problem of searching a
collection of documents for useful information
 Focuses
is on document images that are
predominantly text (rather than pictures)
 These
are called textual images and are
amenable to automatic extraction of key words

10
Cont…
 Multimedia IR
 Discusseshow to index document images and
other binary data by extracting features from
their content and how to search them efficiently
 Human computer interaction (HIC) for IR
 Discussescurrent trends in IR towards improved
user interface and better data visualization tools
 Application of IR
 Covers
modern applications of IR (such as the
Web, bibliographic systems, and digital libraries)

11
Entities in IRS
 Two important entities
 Information need: to be represented by search
statements (query)
 Information items (documents): to be represented
by index terms or any form of representation like
summary
 Thus the process in IRS is matching this abstractions

12
Key Issues IR
 Organizing
 How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
 Retrieving
 How to find the appropriate information resources
or information-bearing objects for someone‟s (or
your own) needs.Build a system that retrieves
documents that users are likely to find relevant to
their queries
 This set of assumption underlies the field of IR

13
IR is an Iterative Process – Basic theory
Creation

Active
Authoring
Modifying
Using Organizing
Creating Indexing

Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Disposition Networking
Searching
Utilization Inactive 14
Implementation

 Thus in order to meet the above key issues the


implementation is developing an Information System
 Retrieval system

 IR deals with very large sets of documents


 High amount of robustness, efficiency
 Domain-independent & multi-linguality

 IR usually deals with NL text which is not always well


structured and could be semantically ambiguous

15
Cont…
• IR considers NL text mainly from a lexical view
 Identifying possible word forms
 Elimination of stop words (e.g the, of zu, ...)
 Stemming (e.g., supporting, supported support)
 Selection of index terms
 Term weighting

16
The Retrieval Process

What does the basic retrieval


process looks like?

17
Cont…

User
Interface queries
spider of the
Index Search
engine

Web pages

18
The Retrieval Process 


Web search engine
Web browser

Text
User
Interface

user need Text

Text Operations

logical view
logical view
Query DB Manager
Indexing
user feedback Operations Module

query inverted file

Searching Index

retrieved docs
Text
Database
Ranking
ranked docs
19
Factors Affecting Effective Retrival

The effective retrival of relevant information is


directly affected by two things
 The User Task
 The logical view of the documents adopted by
the retrival system

20
The User Task

Retrieval

Database

Browsing/ surfing

21
Cont…
 The user task: The user task might be one of rtetrival or
browsing
 Retrieval
 information or data
 Information need (retrieval goal) is focused and
crystalized, Purposeful, Often user is sophesticated
 Browsing/ surfing
 Information need (retrival goal) is vague and impresise
 Glancing around, Often user is naive
 Both are initiated by the user

22
Logical view of documents

 The logical view of documents


 Full text
 Any point in between full text and index terms
 Set of index terms

23
Document Processing Steps

24

From “Modern IR” textbook


Cont..

 Documents in a collection are frequently represented


through a set of index terms or keywords
 An index term is a key word (or group of related
words) which has some meaning of its own (which
usually has the semantics of a noun)
 In its more general form, an index term is simply
any word which appears in the text of a document
collection
 it is simply a word whose semantic helps in
remembering the document‟s main theme
 How to generate index terms? (next chapter)

25
Cont…
 Key words might be extracted directly from the
text of the document or
 Keywords might be specified by a human expert
(this is frequently done in the information
science arena)
 No matter whether these representative
keywords are derived automatically or generated
by a specialist, they provide a logical view of a
document (concise logical view)

26
Cont...
 Modern computers make possible to represent a
document by its full set of words
 In this case, we say that the retrieval system
adopts a full text logical view (or representation)
of the documents
 With very large collections, however, modern
computers might have to reduce the set of
representative keywords
 This can be accomplished through the following
standard steps

27
Cont...
 Standard steps
 Recognizing document structures (titles, sections,
paragraphs, etc.)
 Break into tokens
 Usually space and punctuation delimited
 Special issues with some languages
 The elimination of stopwords (such as articles
and connectives)

28
Cont…
 Conflation: The use of stemming/ morphological
analysis
 Purpose: Overcome the variants of word forms by
reducing all words with the same root, i.e., (which
reduces distinct words to their common grammatical
root)
 Most IR systems perform stemming on both text and
query
 The identification of noun groups (which eliminates
adjectives, adverbs, and verbs)
 Other further operation can also be performed
 Store in inverted index
29
Cont…
 Such text operations reduce the complexity of the
document representation and allow moving the
logical view from that of a full text to that of indexed
terms

 Index - A list of important key words from the


documents

 The full text is the most complete logical view of a


document, But its usage usually implies higher
computational costs

30
Cont...
 Given a set of index terms for a document, we
notice that not all the terms are equally useful for
describing the document contents
 There are index terms that are simply vague than
the others
 Deciding on the importance of a term for
summarizing the contents of a document is not a
trivial issue
 Despite this difficulty, there are properties of an
index term

31
Cont…
 Examples of such properties
 A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user night be
interested in
 A word which appears in just five documents is
quite useful because it narrows down considerably
the space of documents which might be of
interest to the user
 Thus, distinct index terms have varying relevance
when used to describe document contents
 This effect is captured through the assignment of
numerical weights to each of the index term of a
document
32
Challenges in IR
Why is IR a Difficult Problem?

33
Why is IR a Difficult Problem?

 The size of the web is doubling every year:


 50 million pages in November 1995, 320 million
pages in December 1997, 800 million pages in
February 1999, 1 billion pages in 2000, and
growing every day
 Huge amount of data (e.g., WWW) dictates
efficiency, effectiveness and user-friendliness
 Thus :Any IR system needs the capability of large
scale data processing. Use of indexes and various
representations are required

34
Cont…
 Unstructured data: difficult to capture
semantics in documents. Compare:
 “select * from Employee where Salary > 100,000”
 “retrieve all news items about corporate
takeover”
 Why is the second query more difficult to answer?
The following query is even more difficult:
 “retrieve all news items about corporate
takeover involving an internet company”

35
Cont…
 Documents have unrestricted domains
 itis hard to predefine or pre-categorize the subject
domains of documents
a particular subject is related to several major
topics including linguistics, psychology, Cybernetics,
Communications, Information System design,
Engineering & Technology, Networking, Computer
Science, Mathematics, Economics, Management
Science, education …

36
Cont…

 Diversified user base: expert to casual users


 The users of information retrieval systems include
 Research scientists (that seek articles related to
particular experiments)
 Engineers (who try to determine W/r a patent is
covering some new idea has previously been
obtained)
 Attorney( who search for legal presidents)
 Buyers in general (who try to obtain new product
information)

37
Cont…
 Information retrieval users
 Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
 May be led by many different reasons to use the retrieval
facilities
 As a result, they require a variety of services and end
products
 In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
 a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user

38
Cont…
 Distributed and interlinked (e.g., Hypertext and
WWW)
 Where to start a search? Unlike in a centralize
database, you have only one (or a few)
database's) to search.
 How are the information related?

 Efficiency vs. effectiveness.


 With a limited amount of resources, one can only
improve efficiency and effectiveness to a certain
degree. Moreover, improving efficiency often
means degrading effectiveness, and vice versa.
39
Information Retrieval System:
components, structures and
functions
How do we characterize IRS?

40
What is a system?
 Is a set of interrelated components interacting together
to achieve an objective.
 Has basic characteristics like:
 Input,output, environment, boundary, objectives,
components, interaction, interface
 Can be living or non-living
 What is “systems thinking”?
 Do you agree with this? “A system is bigger than the
sum of its components”

41
Systems thinking
 Is a mind set or way of thinking to view the world
(every thing in the world) as a system.
 It emphasizes on interaction that keeps the system
alive.
 Benefits
 Identification of a system leads to abstraction
 From abstraction you can think about essential
characteristics of specific system
 Abstraction allows analyst to gain insights into
specific system, to question assumptions, provide
documentation and manipulate the system without
disrupting the real situation

42
Cont..
 Different types of Information systems
 IRS

 DBMS

 MIS

 DSS

 ESS

43
IRS
 Is a system that is capable of storage, retrieval,
and maintenance of information items
 The processes of an IR system is to match two
abstractions
 Index terms/Key words abstracted from
information items
 Queries abstracted from user‟s information needs

Need [ ] Docs
matching the two sides

44
Cont…
 The purpose of an IRS is to capture wanted
items (information ) and to filter out unwanted
information

 Present results in format that helps user


determine relevant items
 Arbitrary(physical) order
 Relevance order

45
Basic functions of an IRS
 Analysis of doc. and organization of
information (creation of document database)
 Analysis of users preparation of a strategy to
search the database
 Actual searching or matching of users queries
with data base
 Retrieval of items that fully or partially match
the search statement

46
A crawler: Basics of crawlers
 Definition:
A Web crawler is a computer program that browses the
World Wide Web in a methodical, automated manner.
 Utilities:
 Gather pages from the Web.
 Support a search engine, perform data mining and so on.
 Object:
 Text, video, image and so on.
 Link structure.

(section B) 47
Q: How does a search
engine know that all
these pages contain
the query terms?
A: Because all of
those pages have
been crawled

48
Many names
 Crawler
 Spider
 Robot (or bot)
 Web agent
 Wanderer, worm, …
 And famous instances: googlebot, scooter, slurp,
msnbot, …

49
starting
pages
(seeds)

Crawler:
basic
idea

50
Features of a crawler
 Must provide:
 Robustness: spider traps
 Infinitely
deep directory structures:
http://foo.com/bar/foo/bar/foo/...
 Pages filled a large number of characters.
 Politeness: which pages can be crawled, and which
cannot
 robots exclusion protocol: robots.txt
 http://blog.sohu.com/robots.txt

51
Motivation for crawlers
 Support
 universal search engines (Google, Yahoo,
MSN/Windows Live, Ask, etc.)
 Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc.
 Business intelligence: keep track of potential
competitors, partners
 Monitor Web sites of interest
 Evil: harvest emails for spamming, phishing…
 … Can you think of some others?…

52
A crawler based search engine
Web Page repository

googlebot

Text & link


Query analysis

hits

Text index PageRank

Ranker

53
Two most widely used search
designs Graph traversal
(BFS or DFS?)
 Breadth First Search
 Implemented with QUEUE (FIFO)
 Finds pages along shortest paths
 If we start with “good” pages, this
keeps us close; maybe other good
stuff…
 Depth First Search
 Implemented with STACK (LIFO)
 Wander away (“lost in cyberspace”)

54
Implementation issues
 Don‟t want to fetch same page twice!
 Keep lookup table (hash) of visited pages
 The frontier grows very fast!
 May need to prioritize for large crawls
 Fetcher must be robust!
 Don‟t crash if download fails
 Timeout mechanism
 Determine file type to skip unwanted files
 Can try using extensions, but not reliable
 Can issue „HEAD‟ HTTP commands to get Content-Type
headers, but overhead of extra Internet requests

55
More implementation issues

 Fetching
 Get only the first 10-100 KB per page
 Take care to detect and break redirection
loops
 Soft fail for timeout, server not
responding, file not found, and other
errors

56
Two basic subsystems of an IR system
 Next to crawlers

 The two subsystems of an IR system:


 Searching: is an online process of finding relevant
documents in the index list that matches users query

 Indexing: is an offline process of organizing


documents using keywords extracted from the
collection
Indexing is used to speed up access to desired
information from document collection as per users
query
57
Cont…
 Indexing and searching: are inexorably connected
 You cannot search what was not first indexed in some
manner or other
 Indexing of documents is done in order to be
searchable
there are many ways to do indexing
 to index one needs to select an indexing approaches
there are many indexing languages, including
inverted file, sequential file, suffix tree, signature
file, etc..
even taking every word in a document is an
indexing language/approach
 Knowing searching is knowing indexing

58
Indexing Subsystem

documents
Documents Assign document identifier

text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
59
Searching Subsystem

query parse query


query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
60
Structure of an IR System
Search Storage
Interest profiles Documents Line
Line & Queries Information Storage and Retrieval System & data

T T
r Rules of the game = r
Rules for subject indexing +
a Thesaurus (which consists of a
Formulating query in Indexing
n terms of
Lead-In
(Descriptive and n
descriptors Subject)
s Vocabulary s
and
l Indexing l
a Storage of
Language
Storage of
a
t
profiles
Documents t
i i
o o
n Store1: Profiles/ Comparison/ Store2: Document n
Search requests Matching representations

Ranking
Adapted from Soergel, p. 19
Potentially
Relevant 61
Documents
Database Systems Vs Information
Retrieval Systems

Are they the same or not? Is there any


Overlap?

(section C) 62
DBMS vs IRS

 IRS is one of the different types of information


systems
 But it does have considerable similarity than
difference with DBMS
 Accordingly it will be logical to compare and
contrast these two systems

63
Cont…
 On the Information/data
 DBMS: structured data (often homogeneous records),
semantic unambiguity
 IR systems: unstructured (free text), ambiguity
 On the answers/results
 DBMS:
 Records (tuples) , Perfect precision and recall, each
item is relevant (no ranking) , Well defined results
 IR systems
 Documents, Imperfect precision and recall, each
item has specific relevance (ranking), fuzzy results

64
Cont…

 On their relationship
 Systems complement each other
 On their history
 DB grew out of files and traditional business
system
 IRgrew out of library science and need to
categorize/group/access books/articles

65
Cont…

Data retrieval Information retrieval

 Content Data Information


 Data object Table Document
 Matching Exact match Partial match, Best match
 Items wanted Matching Relevant
 Query language SQL (artificial) Natural
 Query specification Complete Incomplete
 Organization Highly structured less structured
 Classification Monothetic Polythetic

66
Cont…

 Data retrieval
 records contain a set of keywords
 Well defined semantics
a single erroneous object implies failure!
 Information retrieval
 information about a subject or topic
 semantics is frequently loose
 small errors are tolerated

67
Cont…
 IR system:
 interpret contents of information items
 generate a ranking which reflects relevance
 notion of relevance is most important
 Information retrieval is much more difficult than data
retrieval

68
Thank you

69

You might also like