Intro IR
Intro IR
Intro IR
– An Introduction –
Gheorghe Muresan
Oct 16, 2002
What is Information Retrieval ?
• What is information ?
• Relevance
Disclaimer
• Relevance and other key concepts in IR were
discussed in the previous class, so we won’t do it
again.
– We’ll take a simple view: a document is relevant if it
is about the searcher’s topic of interest
Matching
Results
What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):
Return as many relevant documents as possible and as
few non-relevant documents as possible
• Cognitive approach
– Goal (in an interactive information-seeking
environment, with a given IRS):
Support the user’s exploration of the problem
domain and the task completion.
The role of an IR system
– a modern view –
• Support the user in
– exploring a problem domain, understanding its
terminology, concepts and structure
– clarifying, refining and formulating an information
need
– finding documents that match the info need
description
• As many relevant docs as possible
• As few non-relevant documents as possible
How does it do this ?
• User interfaces and visualization tools for
– exploring a collection of documents
– exploring search results
• Query expansion based on
– Thesauri
– Lexical/statistic analysis of text / context and concept
formation
– Relevance feedback
• Indexing and matching model
How well does it do this ?
• Evaluation
– Of the components
• Indexing / matching algorithms
– Of the exploratory process overall
• Usability issues
• Usefulness to task
• User satisfaction
Role of the user interface in IR
INPUT
Problem definition
Source selection
Problem articulation
Engine
OUTPUT
Examination of results
Extraction of information
• Principle:
– Overview first
– Zoom
– Details on demand
• Usability issues
– Direct manipulation
– Dynamic, implicit queries
Information Visualization tools
• Repositories
– University of Maryland HCIL
• http://www.cs.umd.edu/projects/hcil
– InfoViz repository
• http://fabdo.fh-potsdam.de/infoviz/repository.html
• Hyperbolic trees
• Themescapes
• Workscapes
• Fisheye view
Faceted organization
• Each document is described by a set of attribute
(or facet) values
• Example:
– FilmFinder, HomeFinder
– Film
• Attributes (facets): Title, Year, Popularity, Director, Actors
Role of structure:
• support for exploration (browsing / searching)
• support for term disambiguation
• potential for efficient retrieval
• Document representative
– Select features to characterize document: terms,
phrases, citations
– Select weighting scheme for these features:
• Binary, raw/relative frequency, divergence measure
• Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or
dissimilarity / distance metric
Similarity coefficients
• Simple matching
X Y xy i
i i
• Dice’s coefficient
2 X Y 2 i xiyi
X Y i i
xi
2
yi
2
• Cosine coefficient
X Y x y i
i i
X Y x y
i
i
2
i
i
2
Clustering methods
• Non-hierarchic methods
=> partitions
– High efficiency, low effectiveness
• Hierarchic methods
=> hierarchic structures - small clusters of highly
similar documents nested within larger clusters of
less similar documents
– Divisive => monothetic classifications
– Agglomerative => polythetic classifications !!
Partitioning method
• Generic procedure:
– The first object becomes the first cluster
– Each subsequent object is matched against existing
clusters
• It is assigned to the most similar cluster if the similarity
measure is above a set threshold
• Otherwise it forms a new cluster
– Re-shuffling of documents into clusters can be done
iteratively to increase cluster similarity
HACM’s
• Generic procedure:
– Each doc to be clustered is a singleton cluster
– While there is more than one cluster, the clusters with
maximum similarity are merged and the similarities
recomputed
• A method is defined by the similarity measure
between non-singleton clusters
• Algorithms for each method differ in:
– Space (store similarity matrix ? all of it ?)
– Time (use all similarities ? use inverted files ?)
Representation of clustered
hierarchies
Scatter/Gather
• How it works
– Cluster sets of documents into general “themes”, like a table of
contents
– Display the contents of the clusters by showing topical terms
and typical titles
– User chooses subsets of the clusters and re-clusters the
documents within
– Resulting new groups have different “themes”
Document Space
Kohonen Feature Maps on Text
Search strategies
• Analytical strategy (mostly querying)
– Analyze the attributes of the information need and of the problem
domain (mental model)
• Browsing
– Follow leads by association (not much planning)
• Similarity strategy
– “more like this”
Non-search activities
• Annotating or summarizing
• Analysis
– Finding trends
– Making comparisons
– Aggregating information
– Identifying a critical subset
IRS design trade-offs
(high-level)
• General
– Easy to learn (“walk up and use”)
• Intuitive
• Standardized look-and-feel and functionality
– Simple and easy to use
– Deterministic and restrictive
• Specialized
– Complex, require training (course, tutorial)
– Increased functionality
– Customizable, non-deterministic
Query specification
• Boolean vs. free text
• Phrases / proximity
• Example:
– How do each apply to Boolean Queries
Form-Based Query Specification
(Altavista)
Form-based Query Specification
(Infoseek)
Direct Manipulation Spec.
VQUERY (Jones 98)
Menu-based Query Specification
(Young & Shneiderman 93)
Putting Results in Context
• Interfaces should
– give hints about the roles terms play in the collection
– give hints about what will happen if various terms are
combined
– show explicitly why documents are retrieved in
response to the query
– summarize compactly the subset of interest
KWIC (Keyword in Context)
• An old standard, ignored by internet search engines
– used in some intranet engines, e.g., Cha-Cha
TileBars
The formalized IR process
Real world Anomalous state of knowledge
Matching
Results
Indexing
• Automatic
– The system extracts “typical”/ “significant” terms
– The human may contribute by setting the parameters or
thresholds, or by choosing components or algorithms
• Semi-automatic
– The system’s contribution may be support in terms of word lists,
thesauri, reference system, etc, following or not the automatic
processing of the text
Manual vs. automatic indexing
• Manual
– Slow and expensive
– Is based on intellectual judgment and semantic
interpretation (concepts, themes)
– Low consistency
• Automatic
– Fast and inexpensive
– Mechanical execution of algorithms, with no intelligent
interpretation (aboutness / relevance)
– Consistent
Vocabulary
• Vocabulary (indexing language)
– The set of concepts (terms or phrases) that can be used to
index documents in a collection
• Controlled
– Specific for specialized domains
– Potential for increased consistency of indexing and precision
of retrieval
• Un-controlled (free)
– Potentially all the terms in the documents
– Potential for increased recall
Thesauri
• Capture relationships between indexing terms
– Hierarchical
– Synonymous
– Related
• Creation of thesauri
– Manual vs. automatic
• Use of thesauri
– In manual / semi-automatic / automatic fashion
– Syntagmatic co-ordination / thesaurus-based query
expansion during indexing / searching
Query indexing
• Search systems
– Automatic indexing
– Synchronization with indexing of documents
(vocabulary, algorithms, etc)
• Empiricist approach
– Statistical Language Processing
– Estimate probabilities of linguistic events: words,
phrases, sentences (Shannon)
– Inexpensive, but just as good
Automatic indexing
• There is no “best solution”
Lexical analysis
Stopword removal
Stemming
Data structure
representation
REPRESENTATION
Term significance
Word occurrence frequency is a measure for the significance of
terms and their discriminatory power (see Brown corpus).
significant terms
• Probabilistic
– Rank documents based on the estimated probability that they
are relevant to the query (derived from term counts)
• Language models
– Rank documents based on the estimated probability that the
query is a random sample of document words
Ranked retrieval
• The documents are ranked based on their score
• Advantages
– Query easy to specify
– The output is ranked based on the estimated relevance
of the documents to the query
– A wide variety of theoretical models exist
• Disadvantages
– Query less precise (although weighting can be used)
Boolean retrieval
• Documents are retrieved based on their
containing or not query terms
• Advantages
– Very precise queries can be specified
– Very easy to implement (in the simple form)
• Disadvantages
– Specifying the query may be difficult for casual users
– Lack of control over the size of the retrieved set
IR Evaluation
• Why evaluate ?
– “Quality”
• What to evaluate ?
– Qualitative vs. quantitative measures
• How to evaluate ?
– Experimental design; result analysis
• User-centered approach
– User part of the system, interacting with other
components, trying to resolve an anomalous state of
knowledge
– Task-oriented evaluation
Aspects to evaluate
INPUT
Problem definition
Source selection
Problem articulation
Engine
OUTPUT
Examination of results
Extraction of information
• Operational
– More or less “real” users
– Real of inferred information needs
– Realism
The traditional (lab) IR
experiment
• To start with you need:
– An IR system (or two)
– A collection of documents
– A collection of requests
– Relevance judgements
All docs
Retrieved
Relevant
Precision vs. Recall
| RelRetrieved | | RelRetrieved |
Precision Recall
| Retrieved | | Rel in Collection |
All docs
Retrieved
Relevant
Interactive system’s evaluation
• Definition:
Evaluation = the process of systematically
collecting data that informs us about what it is
like for a particular user or group of users to
use a product/system for a particular task in
a certain type of environment.
Problems
• Attitudes:
– Designers assume that if they and their colleagues can
use the system and find it attractive, others will too
• Features vs. usability or security
– Executives want the product on the market yesterday
• Problems “can” be addressed in versions 1.x
– Consumers accept low levels of usability
• “I’m so silly”
Two main types of evaluation
• Formative evaluation is done at different stages
of development to check that the product meets
users’ needs.
– Part of the user-centered design approach
– Supports design decisions at various stages
– May test parts of the system or alternative designs
• usability testing
• field studies
• predictive evaluation
Quick and dirty
• ‘quick & dirty’ evaluation describes the
common practice in which designers informally
get feedback from users or consultants to confirm
that their ideas are in-line with users’ needs and
are liked.
• Quick & dirty evaluations are done any time.
• The emphasis is on fast input to the design
process rather than carefully documented
findings.
Usability testing
• Usability testing involves recording typical users’
performance on typical tasks in controlled
settings. Field observations may also be used.
• As the users perform these tasks they are watched
& recorded on video & their key presses are
logged.
• This data is used to calculate performance times,
identify errors & help explain why the users did
what they did.
• User satisfaction questionnaires & interviews are
used to elicit users’ opinions.
Usability testing
• It is very time consuming to conduct and analyze
– Explain the system, do some training
– Explain the task, do a mock task
– Questionnaires before and after the test & after each
task
– Pilot test is usually needed
• Insufficient number of subjects for ‘proper’
statistical analysis
• In laboratory conditions, subjects do not behave
exactly like in a normal environment
Field studies
• Field studies are done in natural settings
• The aim is to understand what users do naturally
and how technology impacts them.
• In product design field studies can be used to:
- identify opportunities for new technology
- determine design requirements
- decide how best to introduce new technology
- evaluate technology in use
Predictive evaluation
• Experts apply their knowledge of typical users,
often guided by heuristics, to predict usability
problems.
• Another approach involves theoretically based
models.
• A key feature of predictive evaluation is that users
need not be present
• Relatively quick & inexpensive
Overview of techniques
Observing users
Don’t interfere with the subjects !
Asking users’ opinions
Interviews, questionnaires
Asking experts’ opinions
Heuristics, role-playing; suggestions for
solutions
Overview of techniques
Testing users’ performance
Time taken to complete a task, errors made,
navigation path
Satisfaction
Modeling users’ task performance
Appropriate for systems with limited functionality
Make assumptions about the user’s typical, optimal,
or poor behaviour
Simulate the user and measure performance
Web Information Retrieval
Challenges
Approaches
Challenges
• Scale, distribution of documents
• Controversy over the unit of indexing
– What is a document ? (hypertext)
– What does the use expect to be retrieved ?
• High heterogeneity
– Document structure, size, quality, level of abstraction /
specialization
– User search or domain expertise, expectations
• Retrieval strategies
– What do people want ?
• Evaluation
Web documents / data
• No traditional collection
– Huge
• Time and space to crawl index
• IRSs cannot store copies of documents
– Dynamic, volatile, anarchic, un-controlled
– Homogeneous sub-collections
• Structure
– In documents (un-/semi-/fully-structured)
– Between docs: network of inter-connected nodes
– Hyper-links - conceptual vs. physical documents
Web documents / data
• Mark-up
– HTML – look & feel
– XML – structure, semantics
– Dublin Core Metadata
– Can webpage authors be trusted to correctly mark-up /
index their pages ?
• Multi-lingual documents
• Multi-media
Theoretical models for
indexing / searching
• Content-based weighting
– As in traditional IRS, but trying to incorporate
• hyperlinks
• the dynamic nature of the Web (page validity, page caching)
• Link-based weighting
– Quality of webpages
• Hubs & authorities
• Bookmarked pages
• Iterative estimation of quality
Architecture
• Centralized
– Main server contains the index, built by an indexer,
searched by a query engine
• Advantage: control, easy update
• Disadvantage: system requirements (memory, disk,
safety/recovery)
• Distributed
– Brokers & gatherers
• Advantage: flexibility, load balancing, redundancy
• Disadvantage: software complexity, update
User variability
• Power and flexibility for expert users vs.
intuitiveness and ease of use for novice users
• Multi-modal user interface
– Distinguish between experts and beginners, offer
distinct interfaces (functionality)
– Advantage: can make assumptions on users
– Disadvantage: habit formation, cognitive shift
• Uni-modal interface
– Make essential functionality obvious
– Make advanced functionality accessible
Search strategies
• Web directories
• Query-based searching
• Link-based browsing (provided by the browser,
not the IRS)
• “More like this”
• Known site (bookmarking)
• Use
– Thesauri
– Query expansion
User modelling
• Build a model / profile of the user by recording
– the `context’
– topics of interest
– preferences
based on interpreting (his/her actions):
– Implicit or explicit relevance feedback
– Recommendations from `peers’
– Customization of the environment
Personalised systems
• Information filtering
– Ex: in a TV guide only show programs of interest
• Information
– Data organized or presented in some context
• Knowledge
– Information read, heard or seen and understood
• Wisdom
– Distilled and integrated knowledge and understanding
Meaning vs. Form
• Meaning
– Indicates what the document is about, or the topic of the
document
– Requires intelligent interpretation by a human or artificial
intelligence techniques
• Form
– Refers to the the content per se, i.e. the words that make up
the document
Data vs. Information Retrieval
• Problem cases
– Numbers: “M16”, “2001”
– Hyphenation: “MS-DOS”, “OS/2”
– Punctuation:“John’s”, “command.com”
– Case: “us”, “US”
– Phrases: “venetian blind”
Stopwords
• Very frequent words, with no power of
discrimination
• Use in an IR system
– Replace each term by the class representative (root or
most common variant)
– Replace each word by all the variants in its class
Stemming errors
• Too aggressive
– organization / organ
– police / policy
– arm / army
– execute / executive
• Too timid
– european / europe
– create / creation
– search / searcher
– cylinder / cylindrical
Inverted files
search- dictionary postings lists
index term 1
term 2 1 2 116
B-tree term k
term N
Inverted files