0% found this document useful (0 votes)

14 views63 pages

Chapter3 MA212 Evaluation

Uploaded by

alaaabdo347890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views63 pages

Chapter3 MA212 Evaluation

Uploaded by

alaaabdo347890

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Faculty of Artificial Intelligence – KFS

University
Fist Semester- 2024-2025

MA212 : Information Retrieval and Web Search

Grade: General - Second Year
Dr. Marwa Elseddik
Lecturer, Department of Robotics and Intelligent Machines
2. Evaluating Search Effectiveness
Today’s Roadmap
 Evaluation in IR: why and how?
 Set-based measures
 Rank-based measures
 Building test collections

 Evaluation at Large Search Engines

3
IR as an Experimental Science!
 Formulate a research question: the hypothesis
 Design an experiment to answer the question
 Perform the experiment
● Compare with a baseline “control”
 Does the experiment answer the question?
● Are the results significant? Or is it just luck?
 Report the results!
 Repeat…

4
Questions About the Black Box
 Example “question”: Does expanding the query with synonyms
improve retrieval performance?
 Corresponding experiment?
● Expand queries with synonyms and compare against baseline
unexpanded queries
Questions That Involve Users
 Example “question”: Is letting users weight search terms better?
 Corresponding experiment?
● Build two different interfaces, one with term weighting functionality,
and one without; run a user study
5
Types of Evaluation Strategies
 System-centered studies
● Given documents, queries, and relevance judgments
● Try several variations of the system
● Measure which system returns the “best” hit list
● Laboratory experiment

 User-centered studies
● Given several users, and at least two retrieval systems
● Have each user try the same task on both systems
● Measure which system works the “best”

6
Why is Evaluation Important?
 The ability to measure differences underlies experimental
science
● How well do our systems work?
● Is A better than B?
● Is it really?
● Under what conditions?
 Evaluation drives what to research
● Identify techniques that work and don’t work

7
Evaluation Criteria
 Effectiveness
● How “good” are the documents that are returned?
● System only, human + system
 Efficiency
● Retrieval time (query latency/response time), indexing time, index size
 Usability
● Learnability, flexibility
● Novice vs. expert users

8
Cranfield Paradigm (Lab setting)
Query
Document Query
Query
Collection Query
Query
IR System

Search Results

Evaluation
Relevance Judgments
Module

Measure of These are

Effectiveness the 4 things we need!
9
Cranfield Paradigm (Lab setting)
Q1 Q2 Q3 Q4

Document
Collection IR System IR System IR System IR System

R1 R2 R3 R4

Qrels
Evaluation Evaluation Evaluation Evaluation
Module Module Module Module

m1 m2 m3 m4

M = average (mi) 10
Reusable Test Collections
1. Collection of documents
● Should be “representative”
● Things to consider: size, sources, genre, topics, …
2. Sample of information needs
● Should be “randomized” and “representative”
● Usually formalized topic statements
3. Known relevance judgments
● Assessed by humans, for each topic-document pair (topic, not query!)
● Binary judgments make evaluation easier, but could be graded

11
Good Effectiveness Measures
 Should capture some aspect of what the user wants
● That is, the measure should be meaningful

 Should be easily replicated by other researchers

 Should be easily comparable

● Optimally, expressed as a single number

12
10

13
10

A test collection has 3 components (choose 3):

➢ Set of information needs (topics)
➢ Collection of documents
➢ Set of queries for one information need
➢ Set of evaluation measures
➢ Relevance judgments
➢ Set of IR systems
Relevance judgments indicate ...
➢ how good the IR system is.
➢ which documents are relevant to which topics.
➢ which topics are good for evaluating the IR systems. 14
Effectiveness Evaluation Measures

Set-based Rank-based
measures measures

16
17
Set-Based Measures
 Assuming IR system returns set of retrieved results without
ranking.
 No certain number of results per query

 Suitable for Boolean Search

Which is the Best Set?
For query Q, collection has 9 relevant documents and systems A-G retrieved following results:
A B C D E F G
R R R R
R R R
R R R R R
R R
R R R R R
R R
R R R R
R
R R R
R
R R
R
Precision and Recall
 Precision:
Relevant
What fraction of these retrieved documents
docs are relevant?

 Recall:
What fraction of the relevant

relevant irrelevant
docs were retrieved? FP TN

TP FN

retrieved not retrieved

Precision & Recall?
For query Q, collection has 9 relevant documents and systems A-G retrieved following results:
A B C D E F G
R R R R
R R R
R R R R R
R R
R R R R R
R R
R R R R
R
R R R
R
R R
R
11

22
11 Precision & Recall?
For query Q, collection has 9 relevant documents and systems A-G retrieved following results:
E G
R
R
R R
R
R

R
Trade-off between P & R
 Precision: The ability to retrieve top-ranked documents that are
mostly relevant.
 Recall: The ability to retrieve all of the relevant items in the corpus.

 Retrieve more docs:

● Higher chance to find all relevant docs → R ↑↑
● Higher chance to find more irrelevant docs → P ↓↓

24
Trade-off between P & R
Returns relevant documents but
misses many useful ones too The ideal

Precision

Returns most relevant

documents but includes
lots of junk

0 1
Recall
One Measure? F-Measure

 Harmonic mean of recall and precision

● emphasizes the importance of small values, whereas the arithmetic mean is
affected more by outliers that are unusually large.

 Beta controls relative importance of precision and recall

● Beta = 1, precision and recall equally important ➔ F1
● Beta = 5, recall is 5 times more important than precision
26
F1?
For query Q, collection has 9 relevant documents and systems A-G retrieved following results:
A B C D E F G
R R R R
R R R
R R R R R
R R
R R R R R
R R
R R R R
R
R R R
R
R R
R
Today’s Roadmap
 Evaluation in IR: why and how?
 Set-based measures
 Rank-based measures
 Building test collections

 Evaluation at Large Search Engines

28
29
Rank-based IR measures
A B
 Consider systems A & B 1 R
● Both retrieved 10 docs, only 5 are relevant 2 R
3 R R
● P, R & F are the same for both systems
4 R
● Should their performance considered equal? 5 R R
6
7 R
 Ranked IR requires taking “ranks” into 8
consideration! 9 R
10 R

How to do that?
Which is the Best Ranked List?
For query Q, collection has 9 relevant documents and systems A-G produced following results:
A B C D E F G
1 R 1 1 R 1 R 1 R 1 1
2 2 2 R 2 R 2 2 2 R
3 R 3 3 3 R 3 R 3 R 3 R
4 4 4 4 4 4 R 4 R
5 R 5 5 R 5 R 5 5 R 5 R
6 6 R 6 6 6 6 R
7 R 7 R 7 7 7 R 7 R
8 8 R 8 8 8 8
9 R 9 R 9 R 9 9
10 10 R 10 10 10
11 R 11 11 11 R
12 12 R 12 12
Precision @ K
 k (afixed number of documents)
 Have a cut-off on the ranked list at rank k, then calculate
precision!

 Perhaps appropriate for most of web search: all people want are
good matches on the first one or two results pages.

32
Precision @ 5?
For query Q, collection has 9 relevant documents and systems A-G produced following results:
A B C D E F G
1 R 1 1 R 1 R 1 R 1 1
2 2 2 R 2 R 2 2 2 R
3 R 3 3 3 R 3 R 3 R 3 R
4 4 4 4 4 4 R 4 R
5 R 5 5 R 5 R 5 5 R 5 R
6 6 R 6 6 6 6 R
7 R 7 R 7 7 7 R 7 R
8 8 R 8 8 8 8
9 R 9 R 9 R 9 9
10 10 R 10 10 10
11 R 11 11 11 R
12 12 R 12 12
When a user can stop?
 P@k assumes every user will stop inspecting results at rank k.

Is that realistic?
 IR objective: “satisfy user’s information need”
 Assumption: a user will stop once his/her information need is
satisfied.
Where?

34
When a user can stop?
 User will keep looking for relevant docsin the ranked list, read
them, then stop once he/she feels satisfied → user will stop at a
relevant document

Which relevant document

the user will stop at?

 What about calculating the averages over all potential stops?

● Every time you find relevant doc at rank x, calculate P@x, then take the
average at the end.
35
Average Precision (AP)
Q1 Q2 Q3
(has 5 rel. docs) (has 3 rel. docs) (has 7 rel. docs)

1 R 1/1=1.00 1 1
2 R 2/2=1.00 2 2 R 1/2=0.50
3 3 R 1/3=0.33 3
4 4 4
5 R 3/5=0.60 5 5 R 2/5=0.40
6 6 6
7 7 R 2/7=0.29 7
8 8 8 R 3/8=0.375
9 R 4/9=0.44 3 9
=0
10 ∞
AP = 3.04 / 5 AP = 0.62 / 3 AP = 1.275 / 7
= 0.608 = 0.207 = 0.182
36
Mean Average Precision (MAP)
Q1 Q2 Q3
(has 5 rel. docs) (has 3 rel. docs) (has 7 rel. docs)

1 R 1/1=1.00 1 1
2 R 2/2=1.00 2 2 R 1/2=0.50
3 3 R 1/3=0.33 3
4 4 4
5 R 3/5=0.60 5 5 R 2/5=0.40
6 6 6
7 7 R 2/7=0.29 7
8 8 8 R 3/8=0.375
9 R 4/9=0.44 9
10
AP = 0.608 AP = 0.207 AP = 0.182

MAP = (0.608+0.207+0.182)/3 = 0.332 37

AP/MAP
A mix between precision and recall.
 Highly focus on finding relevant documents as early as possible.
 MAP is the most commonly-used evaluation metric for most IR
search tasks.

 When we have only 1

relevant doc for every topic
➔ MAP = MRR (mean reciprocal rank)

 Uses binary relevance: 𝑟𝑟𝑟𝑟𝑟𝑟= 0 or 1

Binary vs. Graded Relevance
 Some docs are more relevant to a topic than other relevant ones!
● We need non-binary relevance.
 Two assumptions:
● Highly relevant documents are more useful than marginally relevant.
● The lower the ranked position of a relevant document, the less useful it
is for the user, since it is less likely to be examined.
 Discounted Cumulative Gain (DCG)
● Uses graded relevance as a measure of the usefulness
● The most popular for evaluating web search
Binary vs. Graded Relevance
 Some docs are more relevant to a topic than other relevant ones!
● We need non-binary relevance.

1. Higher relevance docs should have higher value (gain) in

evaluation.

2. But this value will decay (be discounted) if it appears lower in

the ranked list, since it is less likely to be examined.
Discounted Cumulative Gain (DCG)
 Gain is accumulated starting at the top of the ranking and may
be reduced, or discounted, at lower ranks
 Users care more about high-ranked documents, so we discount
results by 1/log2(rank)
● the discount at rank 4 is 1/2, and at rank 8 is 1/3
 DCGk is the total gain accumulated at a particular rank k (sum of
DG up to rank k):
𝑘 𝑟𝑟𝑟𝑟𝑟𝑟𝑖
𝐷𝐷𝐷𝐷𝐷𝐷𝑘𝑘= 𝑟𝑟𝑟𝑟𝑟𝑟1+ �
𝑖=2 𝑟𝑙𝑙𝑙𝑙 2 (𝑟𝑟)
41
DCG Example

k G DG DCG@k iG iDG iDCG@k NDCG@k

1 3 3 3 3 3.00 3 1.00
2 2 2 5 3 3.00 6 0.83
3 3 1.89 6.89 3 1.89 7.89 0.87
4 0 0 6.89 2 1.00 8.89 0.78
5 0 0 6.89 2 0.86 9.75 0.71
6 1 0.39 7.28 2 0.77 10.52 0.69
7 2 0.71 7.99 1 0.36 10.88 0.73
8 2 0.67 8.66 0 0.00 10.88 0.80
9 3 0.95 9.61 0 0.00 10.88 0.88
10 0 0 9.61 0 0.00 10.88 0.88
DCG Example

k G DG DCG@k iG iDG iDCG@k NDCG@k

k G DG DCG@10 iG iDG iDCG@k NDCG@k

1 3 3 3 DCG
3.00 can
3 be any1.00
2 2 2 3 3.00 6 0.83
3 3 1.89
positive
3
real number!
1.89 7.89 0.87
4 0 0 2 1.00 8.89 0.78
5 0 0 2 Why
9.75 is that
0.86 a problem?
0.71
6 1 0.39 2 0.77 10.52 0.69
7 2 0.71 1 0.36 10.88 0.73
8 2 0.67 0 0.00 10.88 0.80
9 3 0.95 0 0.00 10.88 0.88
10 0 0 9.61 0 0.00 10.88 0.88
Normalized DCG (nDCG)
 DCG values are often normalized by comparing the DCG at each
rank with the DCG value for the perfect ranking
● makes averaging easier for queries with different numbers of relevant
documents

 NDCG@k = DCG@k / iDCG@k (divide actual by ideal)

 nDCG  1 at any rank position
 ideal ranking has nDCG of 1.0

45
nDCG Example

k G DG DCG@10 iG iDG iDCG@k nDCG@k

1 3 3 3 3.00 3 1.00
2 2 2 3 3.00 6 0.83
3 3 1.89 3 1.89 7.89 0.87
4 0 0 2 1.00 8.89 0.78
5 0 0 2 0.86 9.75 0.71
6 1 0.39 2 0.77 10.52 0.69
7 2 0.71 1 0.36 10.88 0.73
8 2 0.67 0 0.00 10.88 0.80
9 3 0.95 0 0.00 10.88 0.88
10 0 0 9.61 0 0.00 10.88 0.88
nDCG Example

k G DG DCG@10 iG iDG iDCG@k nDCG@k

1 3 3 3 3 3 1.00
2 2 2 3 3 6 0.83
3 3 1.89 3 1.89 7.89 0.87
4 0 0 2 1.00 8.89 0.78
5 0 0 2 0.86 9.75 0.71
6 1 0.39 2 0.77 10.52 0.69
7 2 0.71 1 0.36 10.88 0.73
8 2 0.67 0 0.00 10.88 0.80
9 3 0.95 0 0.00 10.88 0.88
10 0 0 9.61 0 0.00 10.88 0.88
nDCG Example

k G DG DCG@10 iG iDG iDCG@10 nDCG@k

1 3 3 3 3 1.00
2 2 2 3 3 0.83
3 3 1.89 3 1.89 0.87
4 0 0 2 1.00 0.78
5 0 0 2 0.86 0.71
6 1 0.39 2 0.77 0.69
7 2 0.71 1 0.36 0.73
8 2 0.67 0 0.00 0.80
9 3 0.95 0 0.00 0.88
10 0 0 9.61 0 0.00 10.88 0.88

÷
nDCG Example

k G DG DCG@10 iG iDG iDCG@10 nDCG@k

1 3 3 3 3
2 2 2 3 3
3 3 1.89 3 1.89
4 0 0 2 1.00
5 0 0 2 0.86
6 1 0.39 2 0.77
7 2 0.71 1 0.36
8 2 0.67 0 0.00
9 3 0.95 0 0.00
10 0 0 9.61 0 0.00 10.88 0.88

÷
50
Reusable Test Collections
 Document Collection
 Topics (sample of
information needs)
 Relevance judgments (qrels)

52
How can we get it?
 For web search, companies apply their own studies to assess the
performance of their search engine.
 Web-search performance is monitored by:
● Traffic
● User clicks and session logs
● Labelling results for selected users’ queries

 Academia (or lab settings):

● Someone goes out and builds them (expensive)
● As a byproduct of large scale evaluations (collaborative effort)
 IR Evaluation Campaigns are created for this reason
IR Evaluation Campaigns
 IR test collections are provided for scientific communities to develop
better IR methods.
 Collections and queries are provided, relevance judgements are built
during the campaign.
 TREC = Text REtrieval Conference http://trec.nist.gov/
● Main IR evaluation campaign, sponsored by NIST (US gov).
● Series of annual evaluations, started in 1992.
 Other evaluation campaigns
● CLEF: European version (since 2000)
● NTCIR: Asian version (since 1999)
● FIRE: Indian version (since 2008)
TREC Tracks and Tasks
 TREC (and other campaigns) are formed of a set of tracks, each track
is about (one or more) search tasks.
● Each track/task is about searching a set of documents of given genre and
domain.
 Examples
● TREC Web track
● TREC Medical track
● TREC Legal track → CLEF-IP track → NTCIR patent mining track
● TREC Microblog track
• Adhoc search task
• Filtering task
TREC Collection
A set of hundreds of thousands or millions of docs
● 1B in case of web search (TREC ClueWeb09)
 The typical format of a document:
<DOC>
<DOCNO> 1234 </DOCNO>
<TEXT>
This is the document.
Multilines of plain text.
</TEXT>
</DOC>
TREC Topic
 Topic: a statement of information need
 Multiple topics (~50) developed (mostly) at NIST for a collection.
 Developed by experts and associated with additional details.
● Title: the query text
● Description: description of what is meant by the query.
● Narrative: what should be considered relevant.
<num>189</num>
<title>Health and Computer Terminals</title>
<desc>Is it hazardous to the health of individuals to work with computer terminals on a
daily basis?</desc>
<narr>Relevant documents would contain any information that expands on any
physical disorder/problems that may be associated with the daily working with
computer terminals. Such things as carpel tunnel, cataracts, and fatigue have been said
to be associated, but how widespread are these or other problems and what is being
done to alleviate any health problems</narr>
Relevance Judgments
 For each topic, set of relevant docs is required to be known for an
effective evaluation!
 Exhaustive assessment is usually impractical
● TREC usually has 50 topics
● Collection usually has >1 million documents
 Random sampling won’t work
● If relevant docs are rare, none may be found!
 IR systems can help focus the sample (Pooling)
● Each system finds some relevant documents
● Different systems find different relevant documents
● Together, enough systems will find most of them
58
Pooled Assessment Methodology
1. Systems submit top 1000 documents per topic
2. Top 100 documents from each are manually judged
• Single pool, duplicates removed, arbitrary order
• Judged by the person who developed the topic
3. Treat unevaluated documents as not relevant
4. Compute MAP (or others) down to 1000 documents

 To make pooling work:

● Good number of participating systems
● Systems must do reasonably well
● Systems must be different (not all “do the same thing”)
59
60
How can we “rank” search results?

61
Evaluation at Large Search Engines
 Recall is difficult
to measure on the web – why?
 Search engines often use
● precision at top k, e.g., k = 10
● measures that reward you more for getting rank 1 right than for getting
rank 10 right (nDCG).
● non-relevance-based measures.
• Clickthrough on first result: not very reliable if you look at a single clickthrough …
but pretty reliable in the aggregate.
• Studies of user behavior in the lab.
• A/B testing

63
A/B testing (online testing)
 Purpose: Test a single innovation.
 Prerequisite: You have a large search engine up and running.
 Have most users use old system.
 Divert a small proportion of traffic (e.g., 1%) to the new system
that includes the innovation
 Evaluate with an “automatic” measure like clickthrough on first
result.
 Now we can directly see if the innovation does improve user
happiness.
 Probably the evaluation methodology that large search engines
trust most.
64

UE20CS332 Unit3 Slides
No ratings yet
UE20CS332 Unit3 Slides
461 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
Lecture 7 - Evaluation in IR, Relevance Feedback, Query Expansion
No ratings yet
Lecture 7 - Evaluation in IR, Relevance Feedback, Query Expansion
79 pages
Lecture 6
No ratings yet
Lecture 6
58 pages
Chapter 6-8IR Revised
No ratings yet
Chapter 6-8IR Revised
76 pages
Evaluation 1
No ratings yet
Evaluation 1
63 pages
IR Unit 5
No ratings yet
IR Unit 5
5 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
46 pages
Unit-V
No ratings yet
Unit-V
54 pages
Slides Chap04 PDF
No ratings yet
Slides Chap04 PDF
144 pages
Topic 6 W7 W8 - IREvaluation - Uodated
No ratings yet
Topic 6 W7 W8 - IREvaluation - Uodated
37 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Lecture5 6
No ratings yet
Lecture5 6
30 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Ip 8
No ratings yet
Ip 8
51 pages
Retrieval Performance Evaluation
No ratings yet
Retrieval Performance Evaluation
31 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
28 pages
Evaluation
No ratings yet
Evaluation
41 pages
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
No ratings yet
Evaluation of Information Retrieval Systems: Thanks To Marti Hearst, Ray Larson, Chris Manning
108 pages
10 Evaluation FSS20
No ratings yet
10 Evaluation FSS20
24 pages
L15 IRSW Evaluation
No ratings yet
L15 IRSW Evaluation
49 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
R Object-oriented Programming
From Everand
R Object-oriented Programming
Kelly Black
3/5 (1)
CS336 MIR w5 Evaluation
No ratings yet
CS336 MIR w5 Evaluation
38 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
6 Lec 2025
No ratings yet
6 Lec 2025
20 pages
1727759531-6 Evaluation in Information Retrieval
No ratings yet
1727759531-6 Evaluation in Information Retrieval
24 pages
5-Retrieval Effectiveness
No ratings yet
5-Retrieval Effectiveness
20 pages
09 Evaluation
No ratings yet
09 Evaluation
22 pages
3 Retrieval Evaluation
No ratings yet
3 Retrieval Evaluation
31 pages
5 Retrieval Evaluation
No ratings yet
5 Retrieval Evaluation
20 pages
IR Chapt 5
No ratings yet
IR Chapt 5
55 pages
6 Retrieval Effectiveness
No ratings yet
6 Retrieval Effectiveness
18 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Chapter 5 Retrieval Efective
No ratings yet
Chapter 5 Retrieval Efective
24 pages
SIT772 Lecture 10
No ratings yet
SIT772 Lecture 10
34 pages
Click To Edit Master Title Style: Evaluation Techniques For
No ratings yet
Click To Edit Master Title Style: Evaluation Techniques For
15 pages
Minimize The Overhead of A User Locating Needed Information Precision and Recall
No ratings yet
Minimize The Overhead of A User Locating Needed Information Precision and Recall
14 pages
IR Lecture 5b
No ratings yet
IR Lecture 5b
36 pages
ISR Chap... 6
No ratings yet
ISR Chap... 6
14 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
13 pages
Sec 5
No ratings yet
Sec 5
14 pages
5 Retrieval Effectiveness
No ratings yet
5 Retrieval Effectiveness
20 pages
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
No ratings yet
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
65 pages
Unit3 ISR
No ratings yet
Unit3 ISR
15 pages
5 Retrievalefective
No ratings yet
5 Retrievalefective
22 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
Evaluation and Result Summaries
No ratings yet
Evaluation and Result Summaries
60 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
IR - Chapter 5
No ratings yet
IR - Chapter 5
28 pages
Performance Evaluation of Information Retrieval Systems
No ratings yet
Performance Evaluation of Information Retrieval Systems
45 pages
6 Retrieval Evaluation
No ratings yet
6 Retrieval Evaluation
28 pages
Title: Perform Evaluation of Any Popular Search Engine Based On Relevancy. (E.g Google) Theory
No ratings yet
Title: Perform Evaluation of Any Popular Search Engine Based On Relevancy. (E.g Google) Theory
9 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ch5 Retrieval Evaluation 2021
No ratings yet
Ch5 Retrieval Evaluation 2021
26 pages
IR Evaluation Tugas Kampus
No ratings yet
IR Evaluation Tugas Kampus
25 pages
Information Retrieval CMSC 476/676: Evaluation and Result Summaries
No ratings yet
Information Retrieval CMSC 476/676: Evaluation and Result Summaries
45 pages
Evaluation of Information Retrieval Systems
No ratings yet
Evaluation of Information Retrieval Systems
9 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
TREC Evalution Measures
No ratings yet
TREC Evalution Measures
10 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Sem 6
No ratings yet
Sem 6
12 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
IR Practical
No ratings yet
IR Practical
24 pages
Introduction To Object Detection
No ratings yet
Introduction To Object Detection
24 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
No ratings yet
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
8 pages
93512information Retrieval LecturesNotes2024
No ratings yet
93512information Retrieval LecturesNotes2024
153 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Revised2 (Unmasked) ENHANCED RETRIEVAL FOR QA SYSTEM TAILORED FOR NEPALI LEGAL DOCUMENTS FOCUSING ON PSC EXAMS USING GPT-4 AND RAG FRAMEWORK PDF
No ratings yet
Revised2 (Unmasked) ENHANCED RETRIEVAL FOR QA SYSTEM TAILORED FOR NEPALI LEGAL DOCUMENTS FOCUSING ON PSC EXAMS USING GPT-4 AND RAG FRAMEWORK PDF
7 pages
Unit Iv - Irt
No ratings yet
Unit Iv - Irt
62 pages
Wildfire - and - Smoke - Detection - Using - YOLO - NAS - ICMI - 2024
No ratings yet
Wildfire - and - Smoke - Detection - Using - YOLO - NAS - ICMI - 2024
6 pages
Evaluating Recommender Sytems
No ratings yet
Evaluating Recommender Sytems
39 pages
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
No ratings yet
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
4 pages
Web Intelligence: What Is Webintelligence?
No ratings yet
Web Intelligence: What Is Webintelligence?
25 pages
1 s2.0 S2666521225000031 Main - DessyNovita
No ratings yet
1 s2.0 S2666521225000031 Main - DessyNovita
9 pages
CS 3308 Learning Journal 6
No ratings yet
CS 3308 Learning Journal 6
8 pages
Chapter 5 IR Evaluation
No ratings yet
Chapter 5 IR Evaluation
45 pages
Chapter-5: Retrieval Effectiveness
No ratings yet
Chapter-5: Retrieval Effectiveness
25 pages
CS 3308 Learning Journal Unit 6
No ratings yet
CS 3308 Learning Journal Unit 6
7 pages
2023 SpliTech GasparovicB - mausaG.rukavinaJ - Lergaj. EvaluatingYOLOV5YOLOV6YOLOV7andYOLOV8inUnderwaterEnvironmentIsThereRealImprovement
No ratings yet
2023 SpliTech GasparovicB - mausaG.rukavinaJ - Lergaj. EvaluatingYOLOV5YOLOV6YOLOV7andYOLOV8inUnderwaterEnvironmentIsThereRealImprovement
5 pages
1 s2.0 S2666651021000176 Main
No ratings yet
1 s2.0 S2666651021000176 Main
6 pages
Hindi To English and Marathi To English Cross Lang
No ratings yet
Hindi To English and Marathi To English Cross Lang
9 pages

Chapter3 MA212 Evaluation

Uploaded by

Chapter3 MA212 Evaluation

Uploaded by

Faculty of Artificial Intelligence – KFS

MA212 : Information Retrieval and Web Search

 Evaluation at Large Search Engines

Measure of These are

 Should be easily replicated by other researchers

 Should be easily comparable

A test collection has 3 components (choose 3):

 Suitable for Boolean Search

retrieved not retrieved

 Retrieve more docs:

Returns most relevant

 Harmonic mean of recall and precision

 Beta controls relative importance of precision and recall

 Evaluation at Large Search Engines

Which relevant document

 What about calculating the averages over all potential stops?

MAP = (0.608+0.207+0.182)/3 = 0.332 37

 When we have only 1

 Uses binary relevance: 𝑟𝑟𝑟𝑟𝑟𝑟= 0 or 1

1. Higher relevance docs should have higher value (gain) in

2. But this value will decay (be discounted) if it appears lower in

k G DG DCG@k iG iDG iDCG@k NDCG@k

k G DG DCG@k iG iDG iDCG@k NDCG@k

k G DG DCG@10 iG iDG iDCG@k NDCG@k

 NDCG@k = DCG@k / iDCG@k (divide actual by ideal)

k G DG DCG@10 iG iDG iDCG@k nDCG@k

k G DG DCG@10 iG iDG iDCG@k nDCG@k

k G DG DCG@10 iG iDG iDCG@10 nDCG@k

k G DG DCG@10 iG iDG iDCG@10 nDCG@k

 Academia (or lab settings):

 To make pooling work:

You might also like