Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
Retrieval Systems
• Performance evaluations
• Retrieval evaluation
• Quality of evaluation - Relevance
• Measurements of Evaluation
– Precision vs recall
– F number
– others
• Test Collections/TREC
Performance of the IR or Search Engine
• Relevance
• Coverage
• Recency
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests
• Basically “happiness”
Evaluation Query Engine Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
User’s
Information Collections
Need
Pre-process
text input
Rank or Match
Evaluation
SERP
Query Reformulation
Evaluation Workflow
IR
Information
Query Retrieval
Need (IN)
Docs
Improve
Evaluation
Query?
IN satisfied
What does the user want?
Restaurant case
• The user wants to find a restaurant serving
sashimi. User uses 2 IR systems. How we
can say which one is better?
Evaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
Why Evaluate?
• Determine if the system is useful
• Make comparative assessments with other
methods/systems
– Who’s the best?
• Test and improve systems
• Marketing
• Others?
What to Evaluate?
– Recall
• proportion of relevant material actually retrieved
– Precision
• proportion of retrieved material actually relevant
Effectiveness!
How do we measure relevance?
• Measures of relevance:
– Binary measure
• 1 relevant
• 0 not relevant
– N-ary measure
• 3 very relevant
• 2 relevant
• 1 barely relevant
• 0 not relevant
– Negative values?
• N=? consistency vs. expressiveness tradeoff
Given: we have a relevance ranking
of documents
• Have some known relevance evaluation
– Query independent – based on information need
– Experts (or you)
• Apply binary measure of relevance
– 1 - relevant
– 0 - not relevant
• Put in a query
– Evaluate relevance of what is returned
• What comes back?
– Example: lion
Relevant vs. Retrieved Documents
Retrieved
Relevant
All docs available
• Query to search
engine retrieves: D2,
D4, D5, D6, D8, D9
Contingency table of relevant and retrieved documents
relevant
Rel NotRel
All docs
Retrieved
Relevant
Precision vs. Recall
| RelRetrieved | | RelRetrieved |
Precision = Recall =
| Retrieved | | Rel in Collection |
All docs
Retrieved
Relevant
Retrieved vs. Relevant Documents
Very high precision, very low recall
retrieved
Relevant
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant retrieved
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 for both)
retrieved
Relevant
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
retrieved
Relevant
Why Precision and Recall?
Get as much of what we want while at the same time
getting as little junk as possible.
Recall is the percentage of relevant documents returned
compared to everything that is available!
Precision is the percentage of relevant documents
compared to what is returned!
Retrieve D1
Have D1
Retrieve D2
Have D1, D2
Retrieve D3
Now have D1, D2, D3
Recall Plot
• Recall when more and more documents are retrieved.
• Why this shape?
Precision Plot
• Precision when more and more documents are retrieved.
• Note shape!
Precision/recall plot
• Sequences of points (p, r)
• Similar to y = 1 / x:
– Inversely proportional!
– Sawtooth shape - use smoothed graphs
• How we can compare systems?
Recall/Precision Curves
• There is a tradeoff between Precision and Recall
– So measure Precision at different levels of Recall
• Note: this is usually an AVERAGE over MANY queries
Note that
there are
precision two separate
n1 entities
plotted on
n2 the x axis,
recall and
n3 numbers of
n4
Documents.
recall
ni is number of documents retrieved, with ni < ni+1
Actual recall/precision curve for one query
Best versus worst retrieval
Precision/Recall Curves
• Sometimes difficult to determine which of these two
hypothetical results is better:
precision x
x
x
x
recall
Precision/Recall Curve Comparison
Best
Worst
Document Cutoff Levels
• Another way to evaluate:
– Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
• top 50
• top 100
• top 500
– Measure precision at each of these levels
– Take (weighted) average over results
• This is a way to focus on how well the system ranks the
first k documents.
Problems with Precision/Recall
• Can’t know true recall value (recall for the web?)
– except in small collections
• Precision/Recall are related
– A combined measure sometimes more appropriate
• Assumes batch mode
– Interactive IR is important and has different criteria for
successful searches
• Assumes a strict rank ordering matters.
Buckland & Gey, JASIS: Jan 1994
Doc is
retrieved a b
Doc is NOT
retrieved c d
• Accuracy: (a+d) / (a+b+c+d)
• Precision: a/(a+b)
• Recall: a/(a+c)
• Why don’t we use Accuracy for IR?
– (Assuming a large collection)
• Most docs aren’t relevant
• Most docs aren’t retrieved
• Inflates the accuracy value
The F-Measure
Combine Precision and Recall into one number
2 RP P = precision
F= =2 R = recall
1/ R +1/ P R+P
F = [0,1]
F = 1; when all ranked documents are relevant
F = 0; no relevant documents have been retrieved
Harmonic mean – average of rates AKA
F1 measure,
F-score
The E-Measure
Other ways to combine Precision and Recall into one
number (van Rijsbergen 79)
1 + b2 1
E =1 -
E =1 - 2 æ1 ö 1
b 1 a ç ÷+ (1 - a )
+ èP ø R
R P
P = precision a =1 /( b 2 + 1)
R = recall
b = measure of relative importance of P or R
For example,
b = 0.5 means user is twice as interested in
precision as recall
Interpret precision and recall
• Precision can be seen as a measure of exactness or fidelity
• Recall is a measure of completeness
• Inverse relationship between Precision and Recall, where it
is possible to increase one at the cost of reducing the other.
– For example, an information retrieval system (such as a search
engine) can often increase its Recall by retrieving more
documents, at the cost of increasing number of irrelevant
documents retrieved (decreasing Precision).
– Similarly, a classification system for deciding whether or not, say,
a fruit is an orange, can achieve high Precision by only classifying
fruits with the exact right shape and color as oranges, but at the
cost of low Recall due to the number of false negatives from
oranges that did not quite match the specification.
Measures for Large-Scale Eval
• Typical user behavior in web search
systems has shown a preference for high
precision
• Also graded scales of relevance seem more
useful than just “yes/no”
• Measures have been devised to help
evaluate situations taking these into account
Rank-Based Measures
• Binary relevance
– Precision@K (P@K)
– Mean Average Precision (MAP)
– Mean Reciprocal Rank (MRR)
• Ex:
– Prec@3 of 2/3
– Prec@4 of 2/4
– Prec@5 of 3/5
• In similar fashion we have Recall@K
Sec. 8.4
A precision-recall curve
1.0
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
66
Mean Average Precision
(MAP)
• Consider rank position of each relevant doc
– K1, K2, … KR
71
fair
fair
Good
Discounted Cumulative Gain (DCG)
• Two assumptions:
– Highly relevant documents are more useful than
marginally relevant documents
– The lower the ranked position of a relevant
document, the less useful it is for the user, since
it is less likely to be examined
Discounted Cumulative Gain
• Uses graded relevance as a measure of
usefulness, or gain, from examining a
document
• Gain is accumulated starting at the top of the
ranking and may be reduced, or discounted,
at lower ranks
• Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at
rank 8 it is 1/3
Summarize a Ranking: DCG
• What if relevance judgments are in a scale of
[0,r]? r>2
• Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in
ranked order)
– CG = r1+r2+…rn
• Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
• We may use any base for the logarithm
Discounted Cumulative Gain
• DCG is the total gain accumulated at a
particular rank p:
• Alternative formulation:
2 1 0
DCGGT 2 4.6309
log 2 2 log 2 3 log 2 4
2 1 0
DCGRF 1 2 4.6309
log 2 2 log 2 3 log 2 4
1 2 0
DCGRF 2 2 4.2619
log 2 2 log 2 3 log 2 4
80
Mean Reciprocal Rank
• Consider rank position, K, of first relevant doc
– Could be – only clicked doc
1
• Reciprocal Rank score = K
Tracks
change from
year to year
TREC (cont.)
• Queries + Relevance Judgments
– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents
retrieved -- not entire collection!
• Competition
– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a
recall level of 1000 documents
Sample TREC queries (topics)
<num> Number: 168
<title> Topic: Financing AMTRAK
<desc> Description:
A document will address the role of the Federal Government in
financing the operation of the National Railroad Transportation
Corporation (AMTRAK)
Matching methods:
Precision and recall are single numbers.
Ranking methods:
Precision and recall are functions of the rank order.
Graph of Precision with Ranking: P(r)
as we retrieve the 8 documents.
Relevant? Y N Y Y N Y N Y
Precision P(r)
1
0
1 2 3 4 5 6 7 8 Rank r
What does the user want?
Restaurant case
• The user wants to find a restaurant serving
Sashimi. User uses 2 IR systems. How we
can say which one is better?
Human judgments are
• Expensive
• Inconsistent
– Between raters
– Over time
• Decay in value as documents/query mix evolves
• Not always representative of “real users”
– Rating vis-à-vis query, vs underlying need
• So – what alternatives do we have?
109
USING USER CLICKS
110
What do clicks tell us?
# of clicks received
User’s click
sequence
SVM-light SVMs
114
Interleave the two rankings
Kernel machines
Kernel machines
SVMs
This interleaving SVM-light
starts with B
Intro to SVMs
Archives of SVM
SVM-light
… 115
Remove duplicate results
Kernel machines
Kernel machines
SVMs
SVM-light
Intro to SVMs
Archives of SVM
SVM-light
… 116
Count user clicks
Kernel machines A, B
Kernel machines
Clicks
SVMs
Ranking A: 3 SVM-light A
Ranking B: 1
Intro to SVMs
SVM-light
… 117
Interleaved ranking
• Present interleaved ranking to users
– Start randomly with ranking A or ranking B to evens
out presentation bias
118
Sec. 8.6.3
A/B testing at web search
engines
• Purpose: Test a single innovation
119
Facts/entities (what happens to
clicks?)
120
RECAP
For ad hoc IR evaluation, need:
1. A document collection
2. A test suite of information needs,
expressible as queries
3. A set of relevance judgments, standardly a
binary assessment of either relevant or
nonrelevant for each query-document pair.
Precision/Recall
• You can get high recall (but low precision)
by retrieving all docs for all queries!
• Recall is a non-decreasing function of the
number of docs retrieved