Chapter 1
Chapter 1
Chapter 1
Information Storage and
Retrieval (ISR): Basic concepts
1
Sub Topics
Definition, Foundation, theories and principles
The (information) retrieval process
Factors affecting effective retrieval
Challenges in IR
Information retrieval system: components,
structures and functions
Database retrieval Vs. information retrieval
2
Definition, Basic Foundation,
theories and principles
3
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
Quite effective (at some things)
Commercially successful (some
of them)
But what goes on behind the
scenes?
How do they work? Web search systems
What happens beyond the Web? • Lycos, Excite, Yahoo, Google,
Live, Northern Light, Teoma,
HotBot, Baidu, …
4
Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
5
Information Retrieval - Definition
Is an Important sub-discipline of Information
Science/Computer Sciences that is concerned with
developing theories and methods of access to
information
7
Cont…
The definition incorporates all important features
of a good information retrieval system
Representation
Storage
Organization
Access
Evaluation
Documents Information items: usually text, but
possibly also image, audio, video, etc.
8
IR from different perspectives
Conceptually,
IR is used to cover all related problems in finding
needed information
Historically,
information retrieval is about document retrieval,
emphasizing documents as a basic units
Technically,
information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.
9
Information Retrieval
Can be structured for ease of discussion as
Text IR
Discussesthe classic problem of searching a
collection of documents for useful information
Focuses
is on document images that are
predominantly text (rather than pictures)
These
are called textual images and are
amenable to automatic extraction of key words
10
Cont…
Multimedia IR
Discusseshow to index document images and
other binary data by extracting features from
their content and how to search them efficiently
Human computer interaction (HIC) for IR
Discussescurrent trends in IR towards improved
user interface and better data visualization tools
Application of IR
Covers
modern applications of IR (such as the
Web, bibliographic systems, and digital libraries)
11
Entities in IRS
Two important entities
Information need: to be represented by search
statements (query)
Information items (documents): to be represented
by index terms or any form of representation like
summary
Thus the process in IRS is matching this abstractions
12
Key Issues IR
Organizing
How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
Retrieving
How to find the appropriate information resources
or information-bearing objects for someone‟s (or
your own) needs.Build a system that retrieves
documents that users are likely to find relevant to
their queries
This set of assumption underlies the field of IR
13
IR is an Iterative Process – Basic theory
Creation
Active
Authoring
Modifying
Using Organizing
Creating Indexing
Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Disposition Networking
Searching
Utilization Inactive 14
Implementation
15
Cont…
• IR considers NL text mainly from a lexical view
Identifying possible word forms
Elimination of stop words (e.g the, of zu, ...)
Stemming (e.g., supporting, supported support)
Selection of index terms
Term weighting
16
The Retrieval Process
17
Cont…
User
Interface queries
spider of the
Index Search
engine
Web pages
18
The Retrieval Process
Web search engine
Web browser
Text
User
Interface
Text Operations
logical view
logical view
Query DB Manager
Indexing
user feedback Operations Module
Searching Index
retrieved docs
Text
Database
Ranking
ranked docs
19
Factors Affecting Effective Retrival
20
The User Task
Retrieval
Database
Browsing/ surfing
21
Cont…
The user task: The user task might be one of rtetrival or
browsing
Retrieval
information or data
Information need (retrieval goal) is focused and
crystalized, Purposeful, Often user is sophesticated
Browsing/ surfing
Information need (retrival goal) is vague and impresise
Glancing around, Often user is naive
Both are initiated by the user
22
Logical view of documents
23
Document Processing Steps
24
25
Cont…
Key words might be extracted directly from the
text of the document or
Keywords might be specified by a human expert
(this is frequently done in the information
science arena)
No matter whether these representative
keywords are derived automatically or generated
by a specialist, they provide a logical view of a
document (concise logical view)
26
Cont...
Modern computers make possible to represent a
document by its full set of words
In this case, we say that the retrieval system
adopts a full text logical view (or representation)
of the documents
With very large collections, however, modern
computers might have to reduce the set of
representative keywords
This can be accomplished through the following
standard steps
27
Cont...
Standard steps
Recognizing document structures (titles, sections,
paragraphs, etc.)
Break into tokens
Usually space and punctuation delimited
Special issues with some languages
The elimination of stopwords (such as articles
and connectives)
28
Cont…
Conflation: The use of stemming/ morphological
analysis
Purpose: Overcome the variants of word forms by
reducing all words with the same root, i.e., (which
reduces distinct words to their common grammatical
root)
Most IR systems perform stemming on both text and
query
The identification of noun groups (which eliminates
adjectives, adverbs, and verbs)
Other further operation can also be performed
Store in inverted index
29
Cont…
Such text operations reduce the complexity of the
document representation and allow moving the
logical view from that of a full text to that of indexed
terms
30
Cont...
Given a set of index terms for a document, we
notice that not all the terms are equally useful for
describing the document contents
There are index terms that are simply vague than
the others
Deciding on the importance of a term for
summarizing the contents of a document is not a
trivial issue
Despite this difficulty, there are properties of an
index term
31
Cont…
Examples of such properties
A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user night be
interested in
A word which appears in just five documents is
quite useful because it narrows down considerably
the space of documents which might be of
interest to the user
Thus, distinct index terms have varying relevance
when used to describe document contents
This effect is captured through the assignment of
numerical weights to each of the index term of a
document
32
Challenges in IR
Why is IR a Difficult Problem?
33
Why is IR a Difficult Problem?
34
Cont…
Unstructured data: difficult to capture
semantics in documents. Compare:
“select * from Employee where Salary > 100,000”
“retrieve all news items about corporate
takeover”
Why is the second query more difficult to answer?
The following query is even more difficult:
“retrieve all news items about corporate
takeover involving an internet company”
35
Cont…
Documents have unrestricted domains
itis hard to predefine or pre-categorize the subject
domains of documents
a particular subject is related to several major
topics including linguistics, psychology, Cybernetics,
Communications, Information System design,
Engineering & Technology, Networking, Computer
Science, Mathematics, Economics, Management
Science, education …
36
Cont…
37
Cont…
Information retrieval users
Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
May be led by many different reasons to use the retrieval
facilities
As a result, they require a variety of services and end
products
In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user
38
Cont…
Distributed and interlinked (e.g., Hypertext and
WWW)
Where to start a search? Unlike in a centralize
database, you have only one (or a few)
database's) to search.
How are the information related?
40
What is a system?
Is a set of interrelated components interacting together
to achieve an objective.
Has basic characteristics like:
Input,output, environment, boundary, objectives,
components, interaction, interface
Can be living or non-living
What is “systems thinking”?
Do you agree with this? “A system is bigger than the
sum of its components”
41
Systems thinking
Is a mind set or way of thinking to view the world
(every thing in the world) as a system.
It emphasizes on interaction that keeps the system
alive.
Benefits
Identification of a system leads to abstraction
From abstraction you can think about essential
characteristics of specific system
Abstraction allows analyst to gain insights into
specific system, to question assumptions, provide
documentation and manipulate the system without
disrupting the real situation
42
Cont..
Different types of Information systems
IRS
DBMS
MIS
DSS
ESS
43
IRS
Is a system that is capable of storage, retrieval,
and maintenance of information items
The processes of an IR system is to match two
abstractions
Index terms/Key words abstracted from
information items
Queries abstracted from user‟s information needs
Need [ ] Docs
matching the two sides
44
Cont…
The purpose of an IRS is to capture wanted
items (information ) and to filter out unwanted
information
45
Basic functions of an IRS
Analysis of doc. and organization of
information (creation of document database)
Analysis of users preparation of a strategy to
search the database
Actual searching or matching of users queries
with data base
Retrieval of items that fully or partially match
the search statement
46
A crawler: Basics of crawlers
Definition:
A Web crawler is a computer program that browses the
World Wide Web in a methodical, automated manner.
Utilities:
Gather pages from the Web.
Support a search engine, perform data mining and so on.
Object:
Text, video, image and so on.
Link structure.
(section B) 47
Q: How does a search
engine know that all
these pages contain
the query terms?
A: Because all of
those pages have
been crawled
48
Many names
Crawler
Spider
Robot (or bot)
Web agent
Wanderer, worm, …
And famous instances: googlebot, scooter, slurp,
msnbot, …
49
starting
pages
(seeds)
Crawler:
basic
idea
50
Features of a crawler
Must provide:
Robustness: spider traps
Infinitely
deep directory structures:
http://foo.com/bar/foo/bar/foo/...
Pages filled a large number of characters.
Politeness: which pages can be crawled, and which
cannot
robots exclusion protocol: robots.txt
http://blog.sohu.com/robots.txt
51
Motivation for crawlers
Support
universal search engines (Google, Yahoo,
MSN/Windows Live, Ask, etc.)
Vertical (specialized) search engines, e.g. news,
shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential
competitors, partners
Monitor Web sites of interest
Evil: harvest emails for spamming, phishing…
… Can you think of some others?…
52
A crawler based search engine
Web Page repository
googlebot
hits
Ranker
53
Two most widely used search
designs Graph traversal
(BFS or DFS?)
Breadth First Search
Implemented with QUEUE (FIFO)
Finds pages along shortest paths
If we start with “good” pages, this
keeps us close; maybe other good
stuff…
Depth First Search
Implemented with STACK (LIFO)
Wander away (“lost in cyberspace”)
54
Implementation issues
Don‟t want to fetch same page twice!
Keep lookup table (hash) of visited pages
The frontier grows very fast!
May need to prioritize for large crawls
Fetcher must be robust!
Don‟t crash if download fails
Timeout mechanism
Determine file type to skip unwanted files
Can try using extensions, but not reliable
Can issue „HEAD‟ HTTP commands to get Content-Type
headers, but overhead of extra Internet requests
55
More implementation issues
Fetching
Get only the first 10-100 KB per page
Take care to detect and break redirection
loops
Soft fail for timeout, server not
responding, file not found, and other
errors
56
Two basic subsystems of an IR system
Next to crawlers
58
Indexing Subsystem
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
59
Searching Subsystem
T T
r Rules of the game = r
Rules for subject indexing +
a Thesaurus (which consists of a
Formulating query in Indexing
n terms of
Lead-In
(Descriptive and n
descriptors Subject)
s Vocabulary s
and
l Indexing l
a Storage of
Language
Storage of
a
t
profiles
Documents t
i i
o o
n Store1: Profiles/ Comparison/ Store2: Document n
Search requests Matching representations
Ranking
Adapted from Soergel, p. 19
Potentially
Relevant 61
Documents
Database Systems Vs Information
Retrieval Systems
(section C) 62
DBMS vs IRS
63
Cont…
On the Information/data
DBMS: structured data (often homogeneous records),
semantic unambiguity
IR systems: unstructured (free text), ambiguity
On the answers/results
DBMS:
Records (tuples) , Perfect precision and recall, each
item is relevant (no ranking) , Well defined results
IR systems
Documents, Imperfect precision and recall, each
item has specific relevance (ranking), fuzzy results
64
Cont…
On their relationship
Systems complement each other
On their history
DB grew out of files and traditional business
system
IRgrew out of library science and need to
categorize/group/access books/articles
65
Cont…
66
Cont…
Data retrieval
records contain a set of keywords
Well defined semantics
a single erroneous object implies failure!
Information retrieval
information about a subject or topic
semantics is frequently loose
small errors are tolerated
67
Cont…
IR system:
interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Information retrieval is much more difficult than data
retrieval
68
Thank you
69