0% found this document useful (0 votes)

12 views108 pages

Intro IR

Uploaded by

RAtna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views108 pages

Intro IR

Uploaded by

RAtna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 108

Information Retrieval

– An Introduction –

– The view of an open-minded –

– computer scientist –

Gheorghe Muresan
Oct 16, 2002
What is Information Retrieval ?

• The process of actively seeking out information

relevant to a topic of interest (van Rijsbergen)

– Typically it refers to the automatic (rather than

manual) retrieval of documents
• Information Retrieval System (IRS)

– “Document” is the generic term for an information

holder (book, chapter, article, webpage, etc)
IR in practice
• Information Retrieval is a research-driven theoretical
and experimental discipline
– The focus is on different aspects of the information–seeking
process, depending on the researcher’s background or interest:
• Computer scientist – fast and accurate search engine
• Librarian – organization and indexing of information
• Cognitive scientist – the process in the searcher’s mind
• Philosopher – Is this really relevant ?
• …
– Progress influenced by advances in Computational
Linguistics, Information Visualization, Cognitive Psychology,
HCI, …
• Experimental vs. operational systems
• Analogy to car manufacturing
Fundamental concepts in IR

• What is information ?

• Meaning vs. form

• Data vs. Information Retrieval

• Relevance
Disclaimer
• Relevance and other key concepts in IR were
discussed in the previous class, so we won’t do it
again.
– We’ll take a simple view: a document is relevant if it
is about the searcher’s topic of interest

• We will discuss text documents, not other media

– Most current tools that search for images, video, or
other media rely on text annotations
– Real content retrieval of other media (based on shape,
color, texture, …) are not mature yet
The stages of IR

Creation Information Indexing,

organizing
Indexed
and structured Retrieval
• Searching
information
• Browsing
The formalized IR process
Real world Anomalous state of knowledge

Collection of documents Information need

Document representations Query

Matching

Results
What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):
Return as many relevant documents as possible and as
few non-relevant documents as possible

• Cognitive approach
– Goal (in an interactive information-seeking
environment, with a given IRS):
Support the user’s exploration of the problem
domain and the task completion.
The role of an IR system
– a modern view –
• Support the user in
– exploring a problem domain, understanding its
terminology, concepts and structure
– clarifying, refining and formulating an information
need
– finding documents that match the info need
description
• As many relevant docs as possible
• As few non-relevant documents as possible
How does it do this ?
• User interfaces and visualization tools for
– exploring a collection of documents
– exploring search results
• Query expansion based on
– Thesauri
– Lexical/statistic analysis of text / context and concept
formation
– Relevance feedback
• Indexing and matching model
How well does it do this ?
• Evaluation
– Of the components
• Indexing / matching algorithms
– Of the exploratory process overall
• Usability issues
• Usefulness to task
• User satisfaction
Role of the user interface in IR
INPUT
Problem definition

Source selection

Problem articulation

Engine
OUTPUT
Examination of results

Extraction of information

Integration with overall task

Information Visualization tools
for exploration
• Rely on some form of information organization

• Principle:
– Overview first
– Zoom
– Details on demand

• Usability issues
– Direct manipulation
– Dynamic, implicit queries
Information Visualization tools

• Repositories
– University of Maryland HCIL
• http://www.cs.umd.edu/projects/hcil
– InfoViz repository
• http://fabdo.fh-potsdam.de/infoviz/repository.html
• Hyperbolic trees
• Themescapes
• Workscapes
• Fisheye view
Faceted organization
• Each document is described by a set of attribute
(or facet) values
• Example:
– FilmFinder, HomeFinder
– Film
• Attributes (facets): Title, Year, Popularity, Director, Actors

• In design terms, it refers to composition.

Hierarchic organization
Science

Computing, Mathematics Physics ...

Computing

Computer Programming language Mathematics

Screen Keyboard Pascal C++ Algebra

...

Role of structure:
• support for exploration (browsing / searching)
• support for term disambiguation
• potential for efficient retrieval

In design terms it refers to inheritance.

Structuring a document
collection
• Manual, by experts - slow, expensive, infeasible
for large corpora
• Supervised categorization
• Classes or hierarchic structure established by human
experts
• Documents automatically allocated to classes
• Unsupervised classification = clustering
• Similar documents grouped together, and a structure is
expected to emerge
• Result influenced by the homogeneity/heterogeneity of the
documents, by the indexing and clustering methods and
parameters
Document Clustering
• Finds overall similarities among groups of
documents

• Finds overall similarities among groups of

documents

• Picks out some themes, ignores others

The Cluster Hypothesis
• “Similar documents tend to be relevant to the
same requests”
• Issues:
– Variants: “Documents that are relevant to the same
topics are similar”
– Simple vs. complex topics
– Evaluation, prediction
• The cluster hypothesis is the main motivation
behind document clustering
Document-document similarity

• Document representative
– Select features to characterize document: terms,
phrases, citations
– Select weighting scheme for these features:
• Binary, raw/relative frequency, divergence measure
• Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or
dissimilarity / distance metric
Similarity coefficients
• Simple matching
X Y xy i
i i

• Dice’s coefficient
2 X Y 2  i xiyi

X Y i i
xi
2
 yi
2

• Cosine coefficient
X Y  x y i
i i

X Y  x  y
i
i
2
i
i
2
Clustering methods
• Non-hierarchic methods
=> partitions
– High efficiency, low effectiveness

• Hierarchic methods
=> hierarchic structures - small clusters of highly
similar documents nested within larger clusters of
less similar documents
– Divisive => monothetic classifications
– Agglomerative => polythetic classifications !!
Partitioning method
• Generic procedure:
– The first object becomes the first cluster
– Each subsequent object is matched against existing
clusters
• It is assigned to the most similar cluster if the similarity
measure is above a set threshold
• Otherwise it forms a new cluster
– Re-shuffling of documents into clusters can be done
iteratively to increase cluster similarity
HACM’s
• Generic procedure:
– Each doc to be clustered is a singleton cluster
– While there is more than one cluster, the clusters with
maximum similarity are merged and the similarities
recomputed
• A method is defined by the similarity measure
between non-singleton clusters
• Algorithms for each method differ in:
– Space (store similarity matrix ? all of it ?)
– Time (use all similarities ? use inverted files ?)
Representation of clustered
hierarchies
Scatter/Gather
• How it works
– Cluster sets of documents into general “themes”, like a table of
contents
– Display the contents of the clusters by showing topical terms
and typical titles
– User chooses subsets of the clusters and re-clusters the
documents within
– Resulting new groups have different “themes”

• Originally used to give collection overview

• Evidence suggests more appropriate for displaying

retrieval results in context
Multi-Dimensional Metaphor for the

Document Space
Kohonen Feature Maps on Text
Search strategies
• Analytical strategy (mostly querying)
– Analyze the attributes of the information need and of the problem
domain (mental model)

• Browsing
– Follow leads by association (not much planning)

• Known site strategy

– Based on previous searches
– Indexes or starting points for browsing

• Similarity strategy
– “more like this”
Non-search activities

• Reading and interpreting

• Annotating or summarizing

• Analysis
– Finding trends
– Making comparisons
– Aggregating information
– Identifying a critical subset
IRS design trade-offs
(high-level)
• General
– Easy to learn (“walk up and use”)
• Intuitive
• Standardized look-and-feel and functionality
– Simple and easy to use
– Deterministic and restrictive

• Specialized
– Complex, require training (course, tutorial)
– Increased functionality
– Customizable, non-deterministic
Query specification
• Boolean vs. free text

• Structure analysis vs. bag of words

• Phrases / proximity

• Faceted / weighted queries (TileBars, FilmFinder)

• Graphical support (Venn diagrams, filters)

• Support for query formulation (aid-word list, thesauri,

spell-checking)
Query Specification
• Interaction Styles
– Command Language
– Form Fillin
– Menu Selection
– Direct Manipulation
– Natural Language

• Example:
– How do each apply to Boolean Queries
Form-Based Query Specification
(Altavista)
Form-based Query Specification
(Infoseek)
Direct Manipulation Spec.
VQUERY (Jones 98)
Menu-based Query Specification
(Young & Shneiderman 93)
Putting Results in Context

• Interfaces should
– give hints about the roles terms play in the collection
– give hints about what will happen if various terms are
combined
– show explicitly why documents are retrieved in
response to the query
– summarize compactly the subset of interest
KWIC (Keyword in Context)
• An old standard, ignored by internet search engines
– used in some intranet engines, e.g., Cha-Cha
TileBars
The formalized IR process
Real world Anomalous state of knowledge

Collection of documents Information need

Document representations Query

Matching

Results
Indexing

• Association of descriptors (keywords,

concepts, metadata) to documents in view
of future retrieval

• The knowledge / expectation / behavior of

the searcher needs to be anticipated
Manual and automatic indexing
• Manual
– Human indexers assign index terms to documents
– A computer system may be used to record the descriptors
generated by the human

• Automatic
– The system extracts “typical”/ “significant” terms
– The human may contribute by setting the parameters or
thresholds, or by choosing components or algorithms

• Semi-automatic
– The system’s contribution may be support in terms of word lists,
thesauri, reference system, etc, following or not the automatic
processing of the text
Manual vs. automatic indexing
• Manual
– Slow and expensive
– Is based on intellectual judgment and semantic
interpretation (concepts, themes)
– Low consistency

• Automatic
– Fast and inexpensive
– Mechanical execution of algorithms, with no intelligent
interpretation (aboutness / relevance)
– Consistent
Vocabulary
• Vocabulary (indexing language)
– The set of concepts (terms or phrases) that can be used to
index documents in a collection

• Controlled
– Specific for specialized domains
– Potential for increased consistency of indexing and precision
of retrieval

• Un-controlled (free)
– Potentially all the terms in the documents
– Potential for increased recall
Thesauri
• Capture relationships between indexing terms
– Hierarchical
– Synonymous
– Related

• Creation of thesauri
– Manual vs. automatic

• Use of thesauri
– In manual / semi-automatic / automatic fashion
– Syntagmatic co-ordination / thesaurus-based query
expansion during indexing / searching
Query indexing
• Search systems
– Automatic indexing
– Synchronization with indexing of documents
(vocabulary, algorithms, etc)

• Interactive / browsing systems

– Support tools (word list, thesauri)
– Query not necessarily explicit
Automatic indexing
• Rationalist approach
– Natural Language processing / Artificial Intelligence
– Attempts to define and use grammatical, knowledge
and reasoning rules (Chomsky)
– More computationally intensive

• Empiricist approach
– Statistical Language Processing
– Estimate probabilities of linguistic events: words,
phrases, sentences (Shannon)
– Inexpensive, but just as good
Automatic indexing
• There is no “best solution”

• An “engineering” approach is taken: creatively

combine theoretical models and techniques, test,
make adjustments until the results are satisfying

• Balance between effort/sophistication of method

and quality of results needed

• Results depend on the specific document

collection and on the type of application
Steps of automatic indexing
TEXT
Collection/document structure

Lexical analysis

Stopword removal

Stemming

Data structure
representation

REPRESENTATION
Term significance
Word occurrence frequency is a measure for the significance of
terms and their discriminatory power (see Brown corpus).

too frequent: useless discriminators

significant terms

too rare: no significant contribution to

word
the content of the document
frequency
Weighting
• Heuristics
– Based on common sense, but adjusted/engineered
following experiments. Ex:

– Terms that occur in only a few documents are often

more valuable than ones that occur in many - IDF
– The more often a term occurs in a document, the more
likely it is to be important for that document - TF
– A term that occurs the same number of times in short
document and in a long document is likely to be more
valuable for the former - DL
Document weighting
• Theoretical models
– Provide theoretical justification of the formulae
– Take advantage of mathematical theory
– Are typically adjusted by heuristics

• Probabilistic
– Rank documents based on the estimated probability that they
are relevant to the query (derived from term counts)
• Language models
– Rank documents based on the estimated probability that the
query is a random sample of document words
Ranked retrieval
• The documents are ranked based on their score

• Advantages
– Query easy to specify
– The output is ranked based on the estimated relevance
of the documents to the query
– A wide variety of theoretical models exist

• Disadvantages
– Query less precise (although weighting can be used)
Boolean retrieval
• Documents are retrieved based on their
containing or not query terms

• Advantages
– Very precise queries can be specified
– Very easy to implement (in the simple form)

• Disadvantages
– Specifying the query may be difficult for casual users
– Lack of control over the size of the retrieved set
IR Evaluation
• Why evaluate ?
– “Quality”
• What to evaluate ?
– Qualitative vs. quantitative measures
• How to evaluate ?
– Experimental design; result analysis

• Complex and controversial topic

Actors involved
• Funders
– Cost to implement, estimated savings
– User satisfaction, public recognition

• Librarian, library scientist

– Functionality
– Support for search strategies
– User satisfaction
Actors involved
• Information scientist, mathematician
– Underlying mathematical model for representing
information
– Weighting scheme, document-query matching
– System effectiveness

• Computer scientist, software developer

– System efficiency (speed, resources needed)
– Flexibility, extensibility
Need to evaluate
• Technology hype or real need ?
– Landauer, T. – “The trouble with computers”

• Does it justify its cost ?

– Pros: review and improvement of procedures and
workflow; increased efficiency; increased control and
safety
– Cons: actual cost of system; work interruption; need
for re-training

– Quality – can it be improved ?

History
• Systemic approach
– User outside the system
– Static/fixed information need
– Retrieval effectiveness measured
– Batch retrieval simulations

• User-centered approach
– User part of the system, interacting with other
components, trying to resolve an anomalous state of
knowledge
– Task-oriented evaluation
Aspects to evaluate
INPUT
Problem definition

Source selection

Problem articulation

Engine
OUTPUT
Examination of results

Extraction of information

Integration with overall task

Experimental design decisions

• Whole vs. parts

• Black box vs. diagnostic systems

• Operational vs. experimental system

One possible approach
• IR-specific evaluation
– Systemic
• Quality of search engine
• Influence of various modelling decisions (stopword
removal, stemming, indexing, weighting scheme, …)
– Interaction
• Support for query formulation
• Support for exploration of search output
• Non-specific evaluation
– Task-oriented evaluation
• Usefulness, usability
• Task completion, user satisfaction
Laboratory vs. operational settings
• Laboratory
– Typically only one or several components of the system are
evaluated
– Assumptions are made about the other components
– User behavior is typically simulated (software)
– Control over experimental variables, repeatability, observability

• Operational
– More or less “real” users
– Real of inferred information needs
– Realism
The traditional (lab) IR
experiment
• To start with you need:
– An IR system (or two)
– A collection of documents
– A collection of requests
– Relevance judgements

• Then you run your experiment:

– Input the documents
– Put each request to the system
– Collect the output
Retrieval effectiveness

All docs
Retrieved

All docs
Retrieved

Relevant
Interactive system’s evaluation

• Definition:
Evaluation = the process of systematically
collecting data that informs us about what it is
like for a particular user or group of users to
use a product/system for a particular task in
a certain type of environment.
Problems
• Attitudes:
– Designers assume that if they and their colleagues can
use the system and find it attractive, others will too
• Features vs. usability or security
– Executives want the product on the market yesterday
• Problems “can” be addressed in versions 1.x
– Consumers accept low levels of usability
• “I’m so silly”
Two main types of evaluation
• Formative evaluation is done at different stages
of development to check that the product meets
users’ needs.
– Part of the user-centered design approach
– Supports design decisions at various stages
– May test parts of the system or alternative designs

• Summative evaluation assesses the quality of a

finished product.
– May test the usability or the output quality
– May compare competing systems
What to evaluate
Iterative design & evaluation is a continuous
process that examines:

• Early ideas for conceptual model

• Early prototypes of the new system
• Later, more complete prototypes

Designers need to check that they understand

users’ requirements and that the design
assumptions hold.
Four evaluation paradigms
• ‘quick and dirty’

• usability testing

• field studies

• predictive evaluation
Quick and dirty
• ‘quick & dirty’ evaluation describes the
common practice in which designers informally
get feedback from users or consultants to confirm
that their ideas are in-line with users’ needs and
are liked.
• Quick & dirty evaluations are done any time.
• The emphasis is on fast input to the design
process rather than carefully documented
findings.
Usability testing
• Usability testing involves recording typical users’
performance on typical tasks in controlled
settings. Field observations may also be used.
• As the users perform these tasks they are watched
& recorded on video & their key presses are
logged.
• This data is used to calculate performance times,
identify errors & help explain why the users did
what they did.
• User satisfaction questionnaires & interviews are
used to elicit users’ opinions.
Usability testing
• It is very time consuming to conduct and analyze
– Explain the system, do some training
– Explain the task, do a mock task
– Questionnaires before and after the test & after each
task
– Pilot test is usually needed
• Insufficient number of subjects for ‘proper’
statistical analysis
• In laboratory conditions, subjects do not behave
exactly like in a normal environment
Field studies
• Field studies are done in natural settings
• The aim is to understand what users do naturally
and how technology impacts them.
• In product design field studies can be used to:
- identify opportunities for new technology
- determine design requirements
- decide how best to introduce new technology
- evaluate technology in use
Predictive evaluation
• Experts apply their knowledge of typical users,
often guided by heuristics, to predict usability
problems.
• Another approach involves theoretically based
models.
• A key feature of predictive evaluation is that users
need not be present
• Relatively quick & inexpensive
Overview of techniques
 Observing users
 Don’t interfere with the subjects !
 Asking users’ opinions
 Interviews, questionnaires
 Asking experts’ opinions
 Heuristics, role-playing; suggestions for
solutions
Overview of techniques
 Testing users’ performance
 Time taken to complete a task, errors made,
navigation path
 Satisfaction
 Modeling users’ task performance
 Appropriate for systems with limited functionality
 Make assumptions about the user’s typical, optimal,
or poor behaviour
 Simulate the user and measure performance
Web Information Retrieval

Challenges
Approaches
Challenges
• Scale, distribution of documents
• Controversy over the unit of indexing
– What is a document ? (hypertext)
– What does the use expect to be retrieved ?
• High heterogeneity
– Document structure, size, quality, level of abstraction /
specialization
– User search or domain expertise, expectations
• Retrieval strategies
– What do people want ?
• Evaluation
Web documents / data
• No traditional collection
– Huge
• Time and space to crawl index
• IRSs cannot store copies of documents
– Dynamic, volatile, anarchic, un-controlled
– Homogeneous sub-collections
• Structure
– In documents (un-/semi-/fully-structured)
– Between docs: network of inter-connected nodes
– Hyper-links - conceptual vs. physical documents
Web documents / data
• Mark-up
– HTML – look & feel
– XML – structure, semantics
– Dublin Core Metadata
– Can webpage authors be trusted to correctly mark-up /
index their pages ?

• Multi-lingual documents

• Multi-media
Theoretical models for
indexing / searching
• Content-based weighting
– As in traditional IRS, but trying to incorporate
• hyperlinks
• the dynamic nature of the Web (page validity, page caching)

• Link-based weighting
– Quality of webpages
• Hubs & authorities
• Bookmarked pages
• Iterative estimation of quality
Architecture
• Centralized
– Main server contains the index, built by an indexer,
searched by a query engine
• Advantage: control, easy update
• Disadvantage: system requirements (memory, disk,
safety/recovery)

• Distributed
– Brokers & gatherers
• Advantage: flexibility, load balancing, redundancy
• Disadvantage: software complexity, update
User variability
• Power and flexibility for expert users vs.
intuitiveness and ease of use for novice users
• Multi-modal user interface
– Distinguish between experts and beginners, offer
distinct interfaces (functionality)
– Advantage: can make assumptions on users
– Disadvantage: habit formation, cognitive shift
• Uni-modal interface
– Make essential functionality obvious
– Make advanced functionality accessible
Search strategies

• Web directories
• Query-based searching
• Link-based browsing (provided by the browser,
not the IRS)
• “More like this”
• Known site (bookmarking)

• A combination of the above

Web IRS evaluation
• Effectiveness - problems
– Search for documents vs. information
– What is the target collection (the crawled and indexed
Web) today ?
– Recall, relative recall, aspectual recall
– Levels of relevance, quality, hubs & authorities
– User-centered, task-oriented evaluation
• Task completion, user satisfaction
• Usability
– Is there anything specific for Web IRSs ?
More advanced topics of
IR research
Support for Relevance Feedback
• RF can improve search effectiveness … but is
rarely used

• Voluntary vs. forced feedback

• At document vs. word level

• “Magic” vs. control

Term clustering
• Based on `similarity’ between terms
– Collocation in documents, paragraphs, sentences
• Based on document clustering
– Terms specific for bottom-level document clusters are
assumed to represent a topic

• Use
– Thesauri
– Query expansion
User modelling
• Build a model / profile of the user by recording
– the `context’
– topics of interest
– preferences
based on interpreting (his/her actions):
– Implicit or explicit relevance feedback
– Recommendations from `peers’
– Customization of the environment
Personalised systems
• Information filtering
– Ex: in a TV guide only show programs of interest

• Use user model to disambiguate queries

– Query expansion
– Update the model continuously

• Customize the functionality and the look-and-feel

of the system
– Ex: skins; remember the levels of the user interface
Autonomous agents
• Purpose: find relevant information on behalf of
the user
• Input: the user profile
• Output: pull vs. push
• Positive aspects:
– Can work in the background, implicitly
– Can update the master with new, relevant info
• Negative aspects: control

• Integration with collaborative systems

Questions ?
Information Hierarchy
• Data
– The raw material of information

• Information
– Data organized or presented in some context

• Knowledge
– Information read, heard or seen and understood

• Wisdom
– Distilled and integrated knowledge and understanding
Meaning vs. Form

• Meaning
– Indicates what the document is about, or the topic of the
document
– Requires intelligent interpretation by a human or artificial
intelligence techniques

• Form
– Refers to the the content per se, i.e. the words that make up
the document
Data vs. Information Retrieval

Matching Exact match Partial match

Model Deterministic Probabilistic
Classification Monothetic Polythetic
Query specification Complete Incomplete
Error response Sensitive Insensitive

(van Rijsbergen, C.J. (1979) http://www.dcs.gla.ac.uk/Keith/Preface.html)

Relevance
• Depends on the individual and on the context

• Relevance vs. aboutness (for a topic)

• Relevance vs. usefulness (for a task)

• Relevance judgements in test collections

– Allow for system evaluation
Collection/document structure
• Examples
– Cacm
– Reuters
– Email
– Web
• Issues
– Identify indexing units / documents
• Mark-up, parsers
– Structured documents - what to index ?
– Weighting scheme
Lexical analysis

• Break up the text in words or “tokens”

• Question: “what is a word” ?

• Problem cases
– Numbers: “M16”, “2001”
– Hyphenation: “MS-DOS”, “OS/2”
– Punctuation:“John’s”, “command.com”
– Case: “us”, “US”
– Phrases: “venetian blind”
Stopwords
• Very frequent words, with no power of
discrimination

• Typically function words, not indicative of

content

• The stopwords set depends on the document

collection and on the application
Stemming
• Identify morphological variants, creating
“classes”
– system, systems
– forget, forgetting, forgetful
– analyse, analysis, analytical, analysing

• Use in an IR system
– Replace each term by the class representative (root or
most common variant)
– Replace each word by all the variants in its class
Stemming errors
• Too aggressive
– organization / organ
– police / policy
– arm / army
– execute / executive

• Too timid
– european / europe
– create / creation
– search / searcher
– cylinder / cylindrical
Inverted files
search- dictionary postings lists
index term 1
term 2 1 2 116

B-tree term k

term N
Inverted files

Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Biometric Systems & Security ISR2E1
No ratings yet
Biometric Systems & Security ISR2E1
1 page
IRS Unit - 1 & 2
No ratings yet
IRS Unit - 1 & 2
33 pages
Full Download Handbook of Research For Big Data 1st Edition Brojo Kishore Mishra PDF
100% (2)
Full Download Handbook of Research For Big Data 1st Edition Brojo Kishore Mishra PDF
40 pages
AKTU Attendance
No ratings yet
AKTU Attendance
2 pages
1 introIR
No ratings yet
1 introIR
15 pages
Smart City Surveillance
No ratings yet
Smart City Surveillance
6 pages
Unit 1
No ratings yet
Unit 1
108 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Last Year Paper AML
No ratings yet
Last Year Paper AML
2 pages
Irs Unit-4 Modified
No ratings yet
Irs Unit-4 Modified
13 pages
2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data
No ratings yet
2019, Pradha - Effective Text Data Preprocessing Technique For Sentiment Analysis in Social Media Data
8 pages
Exposé D'anglais
No ratings yet
Exposé D'anglais
3 pages
PHD Thesis List in Library Science
100% (3)
PHD Thesis List in Library Science
7 pages
Financial Technology Unit 2
No ratings yet
Financial Technology Unit 2
62 pages
James Hamilton Resume - Resume
No ratings yet
James Hamilton Resume - Resume
3 pages
IRS Unit 4 by Krishna
No ratings yet
IRS Unit 4 by Krishna
23 pages
Sourav's Org Resume
0% (1)
Sourav's Org Resume
1 page
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
The Impacts of Artificial Intelligence Techniques in Augmentation of Cybersecurity: A Comprehensive Review
No ratings yet
The Impacts of Artificial Intelligence Techniques in Augmentation of Cybersecurity: A Comprehensive Review
19 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
Irs 1
No ratings yet
Irs 1
4 pages
Irs Unit - 3
No ratings yet
Irs Unit - 3
68 pages
CC 105 - Information Management 1
No ratings yet
CC 105 - Information Management 1
6 pages
IRS Lec02 24
No ratings yet
IRS Lec02 24
24 pages
Unit - 6
No ratings yet
Unit - 6
12 pages
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
No ratings yet
Fan & Qin, 2018, Research On Text Classification Based On Improved TF-IDF Algorithm
6 pages
Search Algorithms and Systems: Definitive Reference for Developers and Engineers
From Everand
Search Algorithms and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aryan Mishra CV
No ratings yet
Aryan Mishra CV
1 page
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
Federated Learning in Financial Fraud Detection Federated Learning in Financial Fraud Detection
No ratings yet
Federated Learning in Financial Fraud Detection Federated Learning in Financial Fraud Detection
2 pages
Unit I
No ratings yet
Unit I
65 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
Lec 1 - Intro - Unit 1 Information Technology
No ratings yet
Lec 1 - Intro - Unit 1 Information Technology
102 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
M1 - Fund. of Database Sys.
No ratings yet
M1 - Fund. of Database Sys.
2 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Ai-Module 4
No ratings yet
Ai-Module 4
28 pages
Information Retrieval Systems Slip Test 2
No ratings yet
Information Retrieval Systems Slip Test 2
10 pages
Tugas 2
No ratings yet
Tugas 2
28 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
6 Expert Systems
No ratings yet
6 Expert Systems
18 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
The Joy of Cooking, 1931 (The Georgia Review, Vol. 55, Issue 1) (2001)
No ratings yet
The Joy of Cooking, 1931 (The Georgia Review, Vol. 55, Issue 1) (2001)
3 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
M.S. Ramaiah Institute of Technology,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
No ratings yet
M.S. Ramaiah Institute of Technology,: (Autonomous Institute Affiliated To Vtu) Bangalore - 560 054
1 page
Fuzzy Ontologies and Scale Free Networks
No ratings yet
Fuzzy Ontologies and Scale Free Networks
11 pages
Module 1print
No ratings yet
Module 1print
5 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
OMGawande Resume
No ratings yet
OMGawande Resume
1 page
Unit - 1
No ratings yet
Unit - 1
51 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Unit V DVT
No ratings yet
Unit V DVT
20 pages
1 Compiler Technique (Part 1) : October 2015
No ratings yet
1 Compiler Technique (Part 1) : October 2015
25 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
DBMS 2023 July
No ratings yet
DBMS 2023 July
2 pages
Assignment No 4 Java Script1
No ratings yet
Assignment No 4 Java Script1
20 pages
Information Search and Visualization: - Who Earns $50,000 Among The Residents of Eugene, Oregon?
No ratings yet
Information Search and Visualization: - Who Earns $50,000 Among The Residents of Eugene, Oregon?
9 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
IR Chapter 1 & 2
No ratings yet
IR Chapter 1 & 2
114 pages
Unit1 Introduction
No ratings yet
Unit1 Introduction
31 pages
Prglist
No ratings yet
Prglist
2 pages
1 Web Technology Assignment No 1
No ratings yet
1 Web Technology Assignment No 1
5 pages
FIRST SESSIONAL CD PAPER SET 1 FOR Even SEM 2019-20
No ratings yet
FIRST SESSIONAL CD PAPER SET 1 FOR Even SEM 2019-20
2 pages
DBMS - 3rd Year VI Semester - AICTE 2020-21 - 9 March 2021
No ratings yet
DBMS - 3rd Year VI Semester - AICTE 2020-21 - 9 March 2021
2 pages
Web Mining UNIT-II Chapter-01 - 02 - 03
No ratings yet
Web Mining UNIT-II Chapter-01 - 02 - 03
19 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
A Survey On Approaches of Web Mining in Varied Areas
No ratings yet
A Survey On Approaches of Web Mining in Varied Areas
6 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
Noida Institute of Engineering & Technology: Admissionno Rollno Studentname
No ratings yet
Noida Institute of Engineering & Technology: Admissionno Rollno Studentname
2 pages
Brochure
No ratings yet
Brochure
3 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Applied Mathematics and Computation: Seyedali Mirjalili, Siti Zaiton Mohd Hashim, Hossein Moradian Sardroudi
No ratings yet
Applied Mathematics and Computation: Seyedali Mirjalili, Siti Zaiton Mohd Hashim, Hossein Moradian Sardroudi
13 pages
3 Seminar Guidelines Updated
No ratings yet
3 Seminar Guidelines Updated
1 page
3rd Year Syllabus Computer Science & Engineering 2018-19
No ratings yet
3rd Year Syllabus Computer Science & Engineering 2018-19
21 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
CS3002 Solution Paper 2015.16 - v2
No ratings yet
CS3002 Solution Paper 2015.16 - v2
6 pages
Informationa Retrival
No ratings yet
Informationa Retrival
22 pages
Constraints
No ratings yet
Constraints
9 pages
Machine Learning
No ratings yet
Machine Learning
12 pages
Pre Ph.D. Syllabus Research Methodology
No ratings yet
Pre Ph.D. Syllabus Research Methodology
3 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
ATAL Scheme Guidelines Final
100% (1)
ATAL Scheme Guidelines Final
4 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
Around the Texts of Writing Center Work: An Inquiry-Based Approach to Tutor Education
From Everand
Around the Texts of Writing Center Work: An Inquiry-Based Approach to Tutor Education
R. Mark Hall
5/5 (1)
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
16 pages
Business Information Systems Set 1
No ratings yet
Business Information Systems Set 1
6 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
IRS Unit-1
50% (2)
IRS Unit-1
14 pages
Design Knowledge: A Visual Guide
From Everand
Design Knowledge: A Visual Guide
Rojin S. Vishkaie
No ratings yet
Clustering and Search Techniques in Information Retrieval Systems
67% (3)
Clustering and Search Techniques in Information Retrieval Systems
39 pages
How to Research Qualitatively: Tips for Scientific Working
From Everand
How to Research Qualitatively: Tips for Scientific Working
Martin Gertler
No ratings yet

Intro IR

Uploaded by

Intro IR

Uploaded by

Information Retrieval

– The view of an open-minded –

• The process of actively seeking out information

– Typically it refers to the automatic (rather than

– “Document” is the generic term for an information

• Meaning vs. form

• Data vs. Information Retrieval

• We will discuss text documents, not other media

Creation Information Indexing,

Collection of documents Information need

Document representations Query

Integration with overall task

• In design terms, it refers to composition.

Computing, Mathematics Physics ...

Computer Programming language Mathematics

Screen Keyboard Pascal C++ Algebra

In design terms it refers to inheritance.

• Finds overall similarities among groups of

• Picks out some themes, ignores others

• Originally used to give collection overview

• Evidence suggests more appropriate for displaying

• Known site strategy

• Reading and interpreting

• Structure analysis vs. bag of words

• Faceted / weighted queries (TileBars, FilmFinder)

• Graphical support (Venn diagrams, filters)

• Support for query formulation (aid-word list, thesauri,

Collection of documents Information need

Document representations Query

• Association of descriptors (keywords,

• The knowledge / expectation / behavior of

• Interactive / browsing systems

• An “engineering” approach is taken: creatively

• Balance between effort/sophistication of method

• Results depend on the specific document

too frequent: useless discriminators

too rare: no significant contribution to

– Terms that occur in only a few documents are often

• Complex and controversial topic

• Librarian, library scientist

• Computer scientist, software developer

• Does it justify its cost ?

– Quality – can it be improved ?

Integration with overall task

• Whole vs. parts

• Black box vs. diagnostic systems

• Operational vs. experimental system

• Then you run your experiment:

• Summative evaluation assesses the quality of a

• Early ideas for conceptual model

Designers need to check that they understand

• A combination of the above

• Voluntary vs. forced feedback

• At document vs. word level

• “Magic” vs. control

• Use user model to disambiguate queries

• Customize the functionality and the look-and-feel

• Integration with collaborative systems

Matching Exact match Partial match

(van Rijsbergen, C.J. (1979) http://www.dcs.gla.ac.uk/Keith/Preface.html)

• Relevance vs. aboutness (for a topic)

• Relevance vs. usefulness (for a task)

• Relevance judgements in test collections

• Break up the text in words or “tokens”

• Typically function words, not indicative of

• The stopwords set depends on the document

You might also like