Chapter3 MA212 Evaluation
Chapter3 MA212 Evaluation
University
Fist Semester- 2024-2025
3
IR as an Experimental Science!
Formulate a research question: the hypothesis
Design an experiment to answer the question
Perform the experiment
● Compare with a baseline “control”
Does the experiment answer the question?
● Are the results significant? Or is it just luck?
Report the results!
Repeat…
4
Questions About the Black Box
Example “question”: Does expanding the query with synonyms
improve retrieval performance?
Corresponding experiment?
● Expand queries with synonyms and compare against baseline
unexpanded queries
Questions That Involve Users
Example “question”: Is letting users weight search terms better?
Corresponding experiment?
● Build two different interfaces, one with term weighting functionality,
and one without; run a user study
5
Types of Evaluation Strategies
System-centered studies
● Given documents, queries, and relevance judgments
● Try several variations of the system
● Measure which system returns the “best” hit list
● Laboratory experiment
User-centered studies
● Given several users, and at least two retrieval systems
● Have each user try the same task on both systems
● Measure which system works the “best”
6
Why is Evaluation Important?
The ability to measure differences underlies experimental
science
● How well do our systems work?
● Is A better than B?
● Is it really?
● Under what conditions?
Evaluation drives what to research
● Identify techniques that work and don’t work
7
Evaluation Criteria
Effectiveness
● How “good” are the documents that are returned?
● System only, human + system
Efficiency
● Retrieval time (query latency/response time), indexing time, index size
Usability
● Learnability, flexibility
● Novice vs. expert users
8
Cranfield Paradigm (Lab setting)
Query
Document Query
Query
Collection Query
Query
IR System
Search Results
Evaluation
Relevance Judgments
Module
Document
Collection IR System IR System IR System IR System
R1 R2 R3 R4
Qrels
Evaluation Evaluation Evaluation Evaluation
Module Module Module Module
m1 m2 m3 m4
M = average (mi) 10
Reusable Test Collections
1. Collection of documents
● Should be “representative”
● Things to consider: size, sources, genre, topics, …
2. Sample of information needs
● Should be “randomized” and “representative”
● Usually formalized topic statements
3. Known relevance judgments
● Assessed by humans, for each topic-document pair (topic, not query!)
● Binary judgments make evaluation easier, but could be graded
11
Good Effectiveness Measures
Should capture some aspect of what the user wants
● That is, the measure should be meaningful
12
10
13
10
Set-based Rank-based
measures measures
16
17
Set-Based Measures
Assuming IR system returns set of retrieved results without
ranking.
No certain number of results per query
Recall:
What fraction of the relevant
relevant irrelevant
docs were retrieved? FP TN
TP FN
22
11 Precision & Recall?
For query Q, collection has 9 relevant documents and systems A-G retrieved following results:
E G
R
R
R R
R
R
R
Trade-off between P & R
Precision: The ability to retrieve top-ranked documents that are
mostly relevant.
Recall: The ability to retrieve all of the relevant items in the corpus.
24
Trade-off between P & R
Returns relevant documents but
misses many useful ones too The ideal
Precision
0 1
Recall
One Measure? F-Measure
28
29
Rank-based IR measures
A B
Consider systems A & B 1 R
● Both retrieved 10 docs, only 5 are relevant 2 R
3 R R
● P, R & F are the same for both systems
4 R
● Should their performance considered equal? 5 R R
6
7 R
Ranked IR requires taking “ranks” into 8
consideration! 9 R
10 R
How to do that?
Which is the Best Ranked List?
For query Q, collection has 9 relevant documents and systems A-G produced following results:
A B C D E F G
1 R 1 1 R 1 R 1 R 1 1
2 2 2 R 2 R 2 2 2 R
3 R 3 3 3 R 3 R 3 R 3 R
4 4 4 4 4 4 R 4 R
5 R 5 5 R 5 R 5 5 R 5 R
6 6 R 6 6 6 6 R
7 R 7 R 7 7 7 R 7 R
8 8 R 8 8 8 8
9 R 9 R 9 R 9 9
10 10 R 10 10 10
11 R 11 11 11 R
12 12 R 12 12
Precision @ K
k (afixed number of documents)
Have a cut-off on the ranked list at rank k, then calculate
precision!
Perhaps appropriate for most of web search: all people want are
good matches on the first one or two results pages.
32
Precision @ 5?
For query Q, collection has 9 relevant documents and systems A-G produced following results:
A B C D E F G
1 R 1 1 R 1 R 1 R 1 1
2 2 2 R 2 R 2 2 2 R
3 R 3 3 3 R 3 R 3 R 3 R
4 4 4 4 4 4 R 4 R
5 R 5 5 R 5 R 5 5 R 5 R
6 6 R 6 6 6 6 R
7 R 7 R 7 7 7 R 7 R
8 8 R 8 8 8 8
9 R 9 R 9 R 9 9
10 10 R 10 10 10
11 R 11 11 11 R
12 12 R 12 12
When a user can stop?
P@k assumes every user will stop inspecting results at rank k.
Is that realistic?
IR objective: “satisfy user’s information need”
Assumption: a user will stop once his/her information need is
satisfied.
Where?
34
When a user can stop?
User will keep looking for relevant docsin the ranked list, read
them, then stop once he/she feels satisfied → user will stop at a
relevant document
1 R 1/1=1.00 1 1
2 R 2/2=1.00 2 2 R 1/2=0.50
3 3 R 1/3=0.33 3
4 4 4
5 R 3/5=0.60 5 5 R 2/5=0.40
6 6 6
7 7 R 2/7=0.29 7
8 8 8 R 3/8=0.375
9 R 4/9=0.44 3 9
=0
10 ∞
AP = 3.04 / 5 AP = 0.62 / 3 AP = 1.275 / 7
= 0.608 = 0.207 = 0.182
36
Mean Average Precision (MAP)
Q1 Q2 Q3
(has 5 rel. docs) (has 3 rel. docs) (has 7 rel. docs)
1 R 1/1=1.00 1 1
2 R 2/2=1.00 2 2 R 1/2=0.50
3 3 R 1/3=0.33 3
4 4 4
5 R 3/5=0.60 5 5 R 2/5=0.40
6 6 6
7 7 R 2/7=0.29 7
8 8 8 R 3/8=0.375
9 R 4/9=0.44 9
10
AP = 0.608 AP = 0.207 AP = 0.182
45
nDCG Example
÷
nDCG Example
÷
50
Reusable Test Collections
Document Collection
Topics (sample of
information needs)
Relevance judgments (qrels)
52
How can we get it?
For web search, companies apply their own studies to assess the
performance of their search engine.
Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries
61
Evaluation at Large Search Engines
Recall is difficult
to measure on the web – why?
Search engines often use
● precision at top k, e.g., k = 10
● measures that reward you more for getting rank 1 right than for getting
rank 10 right (nDCG).
● non-relevance-based measures.
• Clickthrough on first result: not very reliable if you look at a single clickthrough …
but pretty reliable in the aggregate.
• Studies of user behavior in the lab.
• A/B testing
63
A/B testing (online testing)
Purpose: Test a single innovation.
Prerequisite: You have a large search engine up and running.
Have most users use old system.
Divert a small proportion of traffic (e.g., 1%) to the new system
that includes the innovation
Evaluate with an “automatic” measure like clickthrough on first
result.
Now we can directly see if the innovation does improve user
happiness.
Probably the evaluation methodology that large search engines
trust most.
64