Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 108

Evaluation of Information

Retrieval Systems

Thanks to Marti Hearst, Ray Larson, Chris Manning


Previously
• Big O - growth of work
• Systems need to scale for work amount n
• For large systems even O(n2) can be deadly
Previously
• Size of information
– Continues to grow
• IR an old field, goes back to the ‘40s
• IR iterative process
• Search engine most popular information
retrieval model
• Still new ones being built
Previously
• Concept of a document
• Collections of document
• Static vs dynamic
– Corpus
– Repository
– Web
• Focus on text
Evaluation of IR Systems

• Performance evaluations
• Retrieval evaluation
• Quality of evaluation - Relevance
• Measurements of Evaluation
– Precision vs recall
– F number
– others
• Test Collections/TREC
Performance of the IR or Search Engine
• Relevance
• Coverage
• Recency
• Functionality (e.g. query syntax)
• Speed
• Availability
• Usability
• Time/ability to satisfy user requests

• Basically “happiness”
Evaluation Query Engine Index

Interface

Indexer
Users

Crawler

Web
A Typical Web Search Engine
User’s
Information Collections
Need

Pre-process
text input

Parse Query Index

Rank or Match

Evaluation
SERP
Query Reformulation
Evaluation Workflow
IR
Information
Query Retrieval
Need (IN)
Docs

Improve
Evaluation
Query?

IN satisfied
What does the user want?
Restaurant case
• The user wants to find a restaurant serving
sashimi. User uses 2 IR systems. How we
can say which one is better?
Evaluation
• Why Evaluate?
• What to Evaluate?
• How to Evaluate?
Why Evaluate?
• Determine if the system is useful
• Make comparative assessments with other
methods/systems
– Who’s the best?
• Test and improve systems
• Marketing
• Others?
What to Evaluate?

• How much of the information need is satisfied.


• How much was learned about a topic.
• Incidental learning:
– How much was learned about the collection.
– How much was learned about other topics.
• How easy the system is to use.
• Usually based on what documents we retrieve
Relevance as a Measure
Relevance is everything!
• How relevant is the document retrieved
– for the user’s information need.
• Subjective, but one assumes it’s measurable
• Measurable to some extent
– How often do people agree a document is relevant to a query
• More often than expected
• How well does it answer the question?
– Complete answer? Partial?
– Background Information?
– Hints for further exploration?
What to Evaluate?
What can be measured that reflects users’ability to use a system? (Cleverdon 66)
– Coverage of Information
– Form of Presentation
– Effort required/Ease of Use
– Time and Space Efficiency
– Effectiveness

– Recall
• proportion of relevant material actually retrieved
– Precision
• proportion of retrieved material actually relevant

Effectiveness!
How do we measure relevance?
• Measures of relevance:
– Binary measure
• 1 relevant
• 0 not relevant
– N-ary measure
• 3 very relevant
• 2 relevant
• 1 barely relevant
• 0 not relevant
– Negative values?
• N=? consistency vs. expressiveness tradeoff
Given: we have a relevance ranking
of documents
• Have some known relevance evaluation
– Query independent – based on information need
– Experts (or you)
• Apply binary measure of relevance
– 1 - relevant
– 0 - not relevant
• Put in a query
– Evaluate relevance of what is returned
• What comes back?
– Example: lion
Relevant vs. Retrieved Documents

Retrieved

Relevant
All docs available

Set theoretic approach


Contingency table of relevant and retrieved documents
relevant
Rel NotRel

Ret RetRel RetNotRel Ret = RetRel + RetNotRel


retrieved
NotRet = NotRetRel + NotRetNotRel
NotRet
NotRetRel NotRetNotRel

Relevant = RetRel + NotRetRel Not Relevant = RetNotRel + NotRetNotRel

Total # of documents available N = RetRel + NotRetRel + RetNotRel + NotRetNotRel

• Precision: P= RetRel / Retrieved


• Recall: R = RetRel / Relevant P = [0,1]
R = [0,1]
Contingency table of classification of
documents
Actual Condition
Present Absent

Positive fp fp type 1 error


tp
type1
Test result
Negative fn fn type 2 error
tn
type2
present = tp + fn
positives = tp + fp
Total # of cases N = tp + fp + fn + tn negatives = fn + tn
• Precision: P= tp / tp + fp
• Recall: R = tp / tp + fn

• False positive rate  = fp/(negatives)


• False negative rate  = fn/(positives)
Example
• Documents available:
relevant not relevant
D1,D2,D3,D4,D5,D6,
D7,D8,D9,D10 retrieved
• Relevant: D1, D4, D5,
D8, D10 not retrieved
• Query to search
engine retrieves: D2,
D4, D5, D6, D8, D9
Example
• Documents available:
relevant not relevant
D1,D2,D3,D4,D5,D6,
retrieved D4,D5,D8 D2,D6,D9
D7,D8,D9,D10
• Relevant: D1, D4, D5,
D8, D10 not retrieved D1,D10 D3,D7

• Query to search
engine retrieves: D2,
D4, D5, D6, D8, D9
Contingency table of relevant and retrieved documents
relevant
Rel NotRel

Ret RetRel=3 RetNotRel=3 Ret = RetRel + RetNotRel


=3+3=6
retrieved
NotRet NotRet = NotRetRel + NotRetNotRe
NotRetRel=2 NotRetNotRel=2
=2+2=4

Relevant = RetRel + NotRetRel Not Relevant = RetNotRel + NotRetNotRel


=3+2=5 =2+2=4
Total # of docs N = RetRel + NotRetRel + RetNotRel + NotRetNotRel= 10

• Precision: P= RetRel / Retrieved = 3/6 = .5 P = [0,1]


• Recall: R = RetRel / Relevant = 3/5 = .6 R = [0,1]
What do we really want
• Find everything relevant – high recall
• Only retrieve those – high precision
Relevant vs. Retrieved

All docs
Retrieved

Relevant
Precision vs. Recall
| RelRetrieved | | RelRetrieved |
Precision = Recall =
| Retrieved | | Rel in Collection |

All docs
Retrieved

Relevant
Retrieved vs. Relevant Documents
Very high precision, very low recall

retrieved

Relevant
Retrieved vs. Relevant Documents
High recall, but low precision

Relevant retrieved
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 for both)

retrieved

Relevant
Retrieved vs. Relevant Documents
High precision, high recall (at last!)

retrieved

Relevant
Why Precision and Recall?
Get as much of what we want while at the same time
getting as little junk as possible.
Recall is the percentage of relevant documents returned
compared to everything that is available!
Precision is the percentage of relevant documents
compared to what is returned!

What different situations of recall and precision can we


have?
Experimental Results
• Much of IR is experimental!
• Formal methods are lacking
– Role of artificial intelligence hopes to change
this
• Derive much insight from these results
Rec- recall NRel - # relevant
Retrieve one document at a Prec - precision
time without replacement
and in order.

Given: only 25 documents of


which 5 are relevant (D1, D2, D4,
D15, D25)

Calculate precision and recall


after each document retrieved

Retrieve D1
Have D1
Retrieve D2
Have D1, D2
Retrieve D3
Now have D1, D2, D3
Recall Plot
• Recall when more and more documents are retrieved.
• Why this shape?
Precision Plot
• Precision when more and more documents are retrieved.
• Note shape!
Precision/recall plot
• Sequences of points (p, r)
• Similar to y = 1 / x:
– Inversely proportional!
– Sawtooth shape - use smoothed graphs
• How we can compare systems?
Recall/Precision Curves
• There is a tradeoff between Precision and Recall
– So measure Precision at different levels of Recall
• Note: this is usually an AVERAGE over MANY queries

Note that
there are
precision two separate
n1 entities
plotted on
n2 the x axis,
recall and
n3 numbers of
n4
Documents.
recall
ni is number of documents retrieved, with ni < ni+1
Actual recall/precision curve for one query
Best versus worst retrieval
Precision/Recall Curves
• Sometimes difficult to determine which of these two
hypothetical results is better:

precision x
x

x
x

recall
Precision/Recall Curve Comparison

Best

Worst
Document Cutoff Levels
• Another way to evaluate:
– Fix the number of documents retrieved at several levels:
• top 5
• top 10
• top 20
• top 50
• top 100
• top 500
– Measure precision at each of these levels
– Take (weighted) average over results
• This is a way to focus on how well the system ranks the
first k documents.
Problems with Precision/Recall
• Can’t know true recall value (recall for the web?)
– except in small collections
• Precision/Recall are related
– A combined measure sometimes more appropriate
• Assumes batch mode
– Interactive IR is important and has different criteria for
successful searches
• Assumes a strict rank ordering matters.
Buckland & Gey, JASIS: Jan 1994

Recall under various retrieval assumptions


1.0 Perfect
0.9
R Tangent
0.8
E Parabolic Parabolic
0.7
Recall Recall 1000 Documents
C 0.6
100 Relevant
A 0.5 Random
L 0.4
L 0.3
0.2
0.1
Perverse
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
Precision under various assumptions
1.0
Perfect
P 0.9
R 0.8
E 0.7 Tangent
C 0.6 Parabolic 1000 Documents
I 0.5 Recall 100 Relevant
S 0.4
Parabolic
I 0.3
Recall
O 0.2 Random
N 0.1
Perverse
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Proportion of documents retrieved
Recall vs Precision
1.0 Perfect
P 0.9
R 0.8
E 0.7 Tangent
C 0.6 Parabolic 1000 Documents
I 0.5 Recall 100 Relevant
S 0.4
I 0.3 Parabolic Recall
O 0.2 Random
N 0.1
0.0 Perverse
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
RECALL

IS 240 – Spring 2011


Relation to Contingency Table
Doc is Doc is NOT
Relevant relevant

Doc is
retrieved a b
Doc is NOT
retrieved c d
• Accuracy: (a+d) / (a+b+c+d)
• Precision: a/(a+b)
• Recall: a/(a+c)
• Why don’t we use Accuracy for IR?
– (Assuming a large collection)
• Most docs aren’t relevant
• Most docs aren’t retrieved
• Inflates the accuracy value
The F-Measure
Combine Precision and Recall into one number

2 RP P = precision
F= =2 R = recall
1/ R +1/ P R+P

F = [0,1]
F = 1; when all ranked documents are relevant
F = 0; no relevant documents have been retrieved
Harmonic mean – average of rates AKA
F1 measure,
F-score
The E-Measure
Other ways to combine Precision and Recall into one
number (van Rijsbergen 79)
1 + b2 1
E =1 -
E =1 - 2 æ1 ö 1
b 1 a ç ÷+ (1 - a )
+ èP ø R
R P
P = precision a =1 /( b 2 + 1)
R = recall
b = measure of relative importance of P or R

For example,
b = 0.5 means user is twice as interested in
precision as recall
Interpret precision and recall
• Precision can be seen as a measure of exactness or fidelity
• Recall is a measure of completeness
• Inverse relationship between Precision and Recall, where it
is possible to increase one at the cost of reducing the other.
– For example, an information retrieval system (such as a search
engine) can often increase its Recall by retrieving more
documents, at the cost of increasing number of irrelevant
documents retrieved (decreasing Precision).
– Similarly, a classification system for deciding whether or not, say,
a fruit is an orange, can achieve high Precision by only classifying
fruits with the exact right shape and color as oranges, but at the
cost of low Recall due to the number of false negatives from
oranges that did not quite match the specification.
Measures for Large-Scale Eval
• Typical user behavior in web search
systems has shown a preference for high
precision
• Also graded scales of relevance seem more
useful than just “yes/no”
• Measures have been devised to help
evaluate situations taking these into account
Rank-Based Measures
• Binary relevance
– Precision@K (P@K)
– Mean Average Precision (MAP)
– Mean Reciprocal Rank (MRR)

• Multiple levels of relevance


– Normalized Discounted Cumulative Gain
(NDCG)
Precision@K
• Set a rank threshold K

• Compute % relevant in top K

• Ignores documents ranked lower than K

• Ex:
– Prec@3 of 2/3
– Prec@4 of 2/4
– Prec@5 of 3/5
• In similar fashion we have Recall@K
Sec. 8.4

A precision-recall curve
1.0

0.8
Precision

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
66
Mean Average Precision
(MAP)
• Consider rank position of each relevant doc
– K1, K2, … KR

• Compute Precision@K for each K1, K2, … KR

• Average precision = average of P@K

• Ex: has AvgPrec of 1 1 2 3 


      0.76
3 1 3 5 
• MAP is Average Precision across multiple
queries/rankings
Average Precision
MAP
Mean average precision
– If a relevant document never gets retrieved, we
assume the precision corresponding to that
relevant doc to be zero
– MAP is macro-averaging: each query counts
equally
– Now perhaps most commonly used measure in
research papers
– Good for web search?
– MAP assumes user is interested in finding many
relevant documents for each query
– MAP requires many relevance judgments in text
collection
BEYOND BINARY
RELEVANCE

71
fair

fair

Good
Discounted Cumulative Gain (DCG)

• Popular measure for evaluating web search


and related tasks

• Two assumptions:
– Highly relevant documents are more useful than
marginally relevant documents
– The lower the ranked position of a relevant
document, the less useful it is for the user, since
it is less likely to be examined
Discounted Cumulative Gain
• Uses graded relevance as a measure of
usefulness, or gain, from examining a
document
• Gain is accumulated starting at the top of the
ranking and may be reduced, or discounted,
at lower ranks
• Typical discount is 1/log (rank)
– With base 2, the discount at rank 4 is 1/2, and at
rank 8 it is 1/3
Summarize a Ranking: DCG
• What if relevance judgments are in a scale of
[0,r]? r>2
• Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in
ranked order)
– CG = r1+r2+…rn
• Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
• We may use any base for the logarithm
Discounted Cumulative Gain
• DCG is the total gain accumulated at a
particular rank p:

• Alternative formulation:

– used by some web search companies


– emphasis on retrieving highly relevant
documents
DCG Example
• 10 ranked documents judged on 0-3
relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
• discounted gain:
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
• DCG:
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61,
9.61
Summarize a Ranking: NDCG
• Normalized Discounted Cumulative Gain
(NDCG) at rank n
– Normalize DCG at rank n by the DCG value at
rank n of the ideal ranking
– The ideal ranking would first return the
documents with the highest relevance level, then
the next highest relevance level, etc
• Normalization useful for contrasting queries
with varying numbers of relevant results

• NDCG is now quite popular in evaluating


Web search
NDCG - Example
4 documents: d1, d2, d3, d4
Ground Truth Ranking Function1 Ranking Function2
i Document Document Document
ri ri ri
Order Order Order
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203

 2 1 0 
DCGGT  2       4.6309
 log 2 2 log 2 3 log 2 4 
 2 1 0 
DCGRF 1  2       4.6309
 log 2 2 log 2 3 log 2 4 
 1 2 0 
DCGRF 2  2       4.2619
 log 2 2 log 2 3 log 2 4 

MaxDCG  DCGGT  4.6309


What if the results are not in a list?

• Suppose there’s only one Relevant Document


• Scenarios:
– known-item search
– navigational queries
– looking for a fact
• Search duration ~ Rank of the answer
– measures a user’s effort

80
Mean Reciprocal Rank
• Consider rank position, K, of first relevant doc
– Could be – only clicked doc

1
• Reciprocal Rank score = K

• MRR is the mean RR across multiple queries


How to Evaluate IR Systems?
Test Collections
Test Collections
Old Test Collections
• Cranfield 2 –
– 1400 Documents, 221 Queries
– 200 Documents, 42 Queries
• INSPEC – 542 Documents, 97 Queries
• UKCIS -- > 10000 Documents, multiple sets, 193
Queries
• ADI – 82 Document, 35 Queries
• CACM – 3204 Documents, 50 Queries
• CISI – 1460 Documents, 35 Queries
• MEDLARS (Salton) 273 Documents, 18 Queries
• Somewhat simple
Modern Well Used Test Collections
• Text Retrieval Conference (TREC) .
– The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation
series since 1992. In more recent years, NIST has done evaluations on larger document collections,
including the 25 million page GOV2 web page collection. From the beginning, the NIST test document
collections were orders of magnitude larger than anything available to researchers previously and GOV2
is now the largest Web collection easily available for research purposes. Nevertheless, the size of GOV2
is still more than 2 orders of magnitude smaller than the current size of the document collections indexed
by the large web search companies.
• NII Test Collections for IR Systems ( NTCIR ).
– The NTCIR project has built various test collections of similar sizes to the TREC collections, focusing
on East Asian language and cross-language information retrieval , where queries are made in one
language over a document collection containing documents in one or more other languages. NTCIR
• Cross Language Evaluation Forum ( CLEF ).
– Concentrated on European languages and cross-language information retrieval. CLEF
• Reuters-RCV1.
– For text classification, the most used test collection has been the Reuters-21578 collection of 21578
newswire articles; see Chapter 13 , page 13.6 . More recently, Reuters released the much larger Reuters
Corpus Volume 1 (RCV1), consisting of 806,791 documents. Its scale and rich annotation makes it a
better basis for future research.
• 20 Newsgroups .
– This is another widely used text classification collection, collected by Ken Lang. It consists of 1000
articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After
the removal of duplicate articles, as it is usually used, it contains 18941 articles.
TREC
• Text REtrieval
Conference/Competition
– http://trec.nist.gov
– Run by NIST (National
Institute of Standards &
Technology)
• Collections: > Terabytes,
• Millions of entities
– Newswire & full text news
(AP, WSJ, Ziff, FT)
– Government documents
(federal register,
Congressional Record)
– Radio Transcripts (FBIS)
– Web “subsets”
Tracks
change from
year to year
2021 Tracks

Tracks
change from
year to year
TREC (cont.)
• Queries + Relevance Judgments
– Queries devised and judged by “Information Specialists”
– Relevance judgments done only for those documents
retrieved -- not entire collection!
• Competition
– Various research and commercial groups compete (TREC
6 had 51, TREC 7 had 56, TREC 8 had 66)
– Results judged on precision and recall, going up to a
recall level of 1000 documents
Sample TREC queries (topics)
<num> Number: 168
<title> Topic: Financing AMTRAK

<desc> Description:
A document will address the role of the Federal Government in
financing the operation of the National Railroad Transportation
Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide


information on the government’s responsibility to make
AMTRAK an economically viable entity. It could also discuss
the privatization of AMTRAK as an alternative to continuing
government subsidies. Documents comparing government
subsidies given to air and bus transportation with those
provided to aMTRAK would also be relevant.
TREC
• Benefits:
– made research systems scale to large collections (pre-WWW)
– allows for somewhat controlled comparisons
• Drawbacks:
– emphasis on high recall, which may be unrealistic for what
most users want
– very long queries, also unrealistic
– comparisons still difficult to make, because systems are quite
different on many dimensions
– focus on batch ranking rather than interaction
– no focus on the WWW until recently
TREC evolution
• Emphasis on specialized “tracks”
– Interactive track
– Natural Language Processing (NLP) track
– Multilingual tracks (Chinese, Spanish)
– Filtering track
– High-Precision
– High-Performance
– Topics
• http://trec.nist.gov/
TREC Results
• Differ each year
• For the main (ad hoc) track:
– Best systems not statistically significantly different
– Small differences sometimes have big effects
• how good was the hyphenation model
• how was document length taken into account
– Systems were optimized for longer queries and all
performed worse for shorter, more realistic queries
Evaluating search engine
retrieval performance
• Recall?
• Precision?
• Order of ranking?
Evaluation Issues

To place information retrieval on a systematic basis, we


need repeatable criteria to evaluate how effective a
system is in meeting the information needs of the user
of the system.
This proves to be very difficult with a human in the loop.
It proves hard to define:
• the task that the human is attempting
• the criteria to measure success
Evaluation of Matching:
Recall and Precision
If information retrieval were perfect ...
Every hit would be relevant to the original query, and every
relevant item in the body of information would be found.

Precision: percentage (or fraction) of the hits that are


relevant, i.e., the extent to which the set of hits
retrieved by a query satisfies the requirement that
generated the query.

Recall: percentage (or fraction) of the relevant items that are


found by the query, i.e., the extent to which the query
found all the items that satisfy the requirement.
Recall and Precision with Exact
Matching: Example
• Collection of 10,000 documents, 50 on a specific topic
• Ideal search finds these 50 documents and reject all others
• Actual search identifies 25 documents; 20 are relevant but 5
were on other topics
• Precision: 20/ 25 = 0.8 (80% of hits were relevant)
• Recall: 20/50 = 0.4 (40% of relevant were found)
Measuring Precision and Recall
Precision is easy to measure:
• A knowledgeable person looks at each document that is
identified and decides whether it is relevant.
• In the example, only the 25 documents that are found need
to be examined.
Recall is difficult to measure:
• To know all relevant items, a knowledgeable person must
go through the entire collection, looking at every object to
decide if it fits the criteria.
• In the example, all 10,000 documents must be examined.
Evaluation: Precision and Recall
Precision and recall measure the results of a single query
using a specific search system applied to a specific set of
documents.

Matching methods:
Precision and recall are single numbers.
Ranking methods:
Precision and recall are functions of the rank order.
Graph of Precision with Ranking: P(r)
as we retrieve the 8 documents.
Relevant? Y N Y Y N Y N Y
Precision P(r)
1

1/1 1/2 2/3 3/4 3/5 4/6 4/7 5/8

0
1 2 3 4 5 6 7 8 Rank r
What does the user want?
Restaurant case
• The user wants to find a restaurant serving
Sashimi. User uses 2 IR systems. How we
can say which one is better?
Human judgments are
• Expensive
• Inconsistent
– Between raters
– Over time
• Decay in value as documents/query mix evolves
• Not always representative of “real users”
– Rating vis-à-vis query, vs underlying need
• So – what alternatives do we have?

109
USING USER CLICKS

110
What do clicks tell us?
# of clicks received

Strong position bias, so absolute click rates unreliable


111
Relative vs absolute ratings

User’s click
sequence

Hard to conclude Result1 > Result3


Probably can conclude Result3 > Result2 112
Pairwise relative ratings
• Pairs of the form: DocA better than DocB for a
query
– Doesn’t mean that DocA relevant to query
• Now, rather than assess a rank-ordering wrt
per-doc relevance assessments
• Assess in terms of conformance with historical
pairwise preferences recorded from user clicks
• BUT!
• Don’t learn and test on the same ranking
algorithm
– I.e., if you learn historical clicks from nozama and
compare Sergey vs nozama on this history …
113
Comparing two rankings via
clicks
(Joachims 2002)
Query: [support vector machines]
Ranking A Ranking B
Kernel machines Kernel machines

SVM-light SVMs

Lucent SVM demo Intro to SVMs

Royal Holl. SVM Archives of SVM

SVM software SVM-light

SVM tutorial SVM software

114
Interleave the two rankings
Kernel machines

Kernel machines

SVMs
This interleaving SVM-light
starts with B
Intro to SVMs

Lucent SVM demo

Archives of SVM

Royal Holl. SVM

SVM-light
… 115
Remove duplicate results
Kernel machines

Kernel machines

SVMs

SVM-light

Intro to SVMs

Lucent SVM demo

Archives of SVM

Royal Holl. SVM

SVM-light
… 116
Count user clicks
Kernel machines A, B
Kernel machines
Clicks
SVMs
Ranking A: 3 SVM-light A
Ranking B: 1
Intro to SVMs

Lucent SVM demo A


Archives of SVM

Royal Holl. SVM

SVM-light
… 117
Interleaved ranking
• Present interleaved ranking to users
– Start randomly with ranking A or ranking B to evens
out presentation bias

• Count clicks on results from A versus results


from B

• Better ranking will (on average) get more clicks

118
Sec. 8.6.3
A/B testing at web search
engines
• Purpose: Test a single innovation

• Prerequisite: You have a large search engine up and


running.

• Have most users use old system

• Divert a small proportion of traffic (e.g., 1%) to an


experiment to evaluate an innovation
– Interleaved experiment
– Full page experiment

119
Facts/entities (what happens to
clicks?)

120
RECAP
For ad hoc IR evaluation, need:
1. A document collection
2. A test suite of information needs,
expressible as queries
3. A set of relevance judgments, standardly a
binary assessment of either relevant or
nonrelevant for each query-document pair.
Precision/Recall
• You can get high recall (but low precision)
by retrieving all docs for all queries!
• Recall is a non-decreasing function of the
number of docs retrieved

• In a good system, precision decreases as


either number of docs retrieved or recall
increases
– A fact with strong empirical confirmation
Difficulties in using
precision/recall
• Should average over large corpus/query ensembles
• Need human relevance assessments
– People aren’t reliable assessors
• Assessments have to be binary
– Nuanced assessments?
• Heavily skewed by corpus/authorship
– Results may not translate from one domain to another
What to Evaluate?
• Want an effective system
• But what is effectiveness
– Difficult to measure
– Recall and Precision are standard measures
• F measure frequently used
• Google stressed precision!
Evaluation of IR Systems
• Performance evaluations
• Retrieval evaluation
• Quality of evaluation - Relevance
• Measurements of Evaluation
– Precision vs recall
• Test Collections/TREC

You might also like