CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CAIM: Cerca i Anàlisi d’Informació Massiva

FIB, Grau en Enginyeria Informàtica

Slides by Marta Arias, José Luis Balcázar,


Ramon Ferrer-i-Cancho, Ricard Gavaldá
Department of Computer Science, UPC

Fall 2018
http://www.cs.upc.edu/~caim

1 / 18
4. Evaluation and Relevance Feedback
Evaluation of Information Retrieval Usage, I
What are we exactly to do?

In the Boolean model, the specification is unambiguous:


We know what we are to do:
Retrieve and provide to the user
all those documents
that satisfy the query.

But, is this what the user really wants?


Sorry, but usually. . . no.

3 / 18
Evaluation of Information Retrieval Usage, II
Then, what exactly are we to optimize?

Notation:
D: set of all our documents on which the user asks one query;
A: answer set: documents that the system retrieves as
answer;
R: relevant documents: those that the user actually wishes to
see as answer.
(But no one knows this set, not even the user!)

Unreachable goal: A = R, that is:


I P r(d ∈ A|d ∈ R) = 1 and
I P r(d ∈ R|d ∈ A) = 1.

4 / 18
The Recall and Precision measures

Let’s settle for:


|R∩A|
I high recall, :
|R|

P r(d ∈ A|d ∈ R) not too much below 1,


|R∩A|
I high precision, :
|A|

P r(d ∈ R|d ∈ A) not too much below 1.

Difficult balance. More later.

5 / 18
Recall and Precision, II
Example: test for tuberculosis (TB)

I 1000 people, out of which 50 have TB


I test is positive on 40 people, of which 35 really have TB

Recall
% of true TB that test positive = 35 / 50 = 70 %

Precision
% of positives that really have TB = 35 / 40 = 87.5 %
I Large recall: few sick people go away undetected
I Large precision: few people are scared unnecessarily (few
false alarms)

6 / 18
Recall and Precision, III. Confusion matrix
Equivalent definition

Confusion matrix
Answered
relevant not relevant
relevant tp fn
Reality
not relevant fp tn

|R∩A| tp
I |R| = tp + f n I Recall = = tp+f n
|R|
I |A| = tp + f p |R∩A| tp
I Precision = =
I |R ∩ A| = tp |A| tp+f p

7 / 18
How many documents to show?

We rank all documents according to some measure.


How many should we show?
I Users won’t read too large answers.
I Long answers are likely to exhibit low precision.
I Short answers are likely to exhibit low recall.

We analyze precision and recall as functions of the number of


documents k provided as answer.

8 / 18
Rank-recall and rank-precision plots

(Source: Prof. J. J. Paijmans, Tilburg)

9 / 18
A single “precision and recall” curve
x-axis for recall, and y-axis for precision.
(Similar to, and related to, the ROC curve in predictive models.)

(Source: Stanford NLP group)


Often: Plot 11 points of interpolated precision, at 0 %, 10 %,
20 %, . . . , 100 % recall
10 / 18
Other measures of effectiveness

I AUC: Area under the curve of the plots above, relative to


best possible
2
I F-measure: 1 1
+
recall precision
I Harmonic mean. Closer to min of both than arithmetic mean
2
I α-F-measure: α 1−α
+
recall precision

11 / 18
Other measures of effectiveness, II

Take into account the documents previously known to the user.

I Coverage:
|relevant & known & retrieved| / |relevant & known|

I Novelty:
|relevant & retrieved & UNknown| / |relevant & retrieved|

12 / 18
Relevance Feedback, I
Going beyond what the user asked for

The user relevance cycle:

1. Get a query q
2. Retrieve relevant documents for q
3. Show top k to user
4. Ask user to mark them as relevant / irrelevant
5. Use answers to refine q
6. If desired, go to 2

13 / 18
Relevance Feedback, II
How to create the new query?

Vector model: queries and documents are vectors


Given a query q, and a set of documents, split into relevant R
and nonrelevant N R sets, build a new query q 0 :
Rocchio’s Rule:
1 X 1 X
q0 = α · q + β · · d−γ· · d
|R| |N R|
d∈R d∈N R

I All vectors q and d’s must be normalized (e.g., unit length).


I Weights α, β, γ, scalars, with α > β > γ ≥ 0; often γ = 0.
α: degree of trust on the original user’s query,
β: weight of positive information (terms that do not appear on
the query but do appear in relevant documents),
γ: weight of negative information.

14 / 18
Relevance Feedback, III

In practice, often:
I good improvement of the recall for first round,
I marginal for second round,
I almost none beyond.

In web search, precision matters much more than recall, so the


extra computation time and user patience may not be
productive.

15 / 18
Relevance Feedback, IV
. . . as Query Expansion

It is a form of Query Expansion:

The new query has non-zero weights on words


that were not in the original query

16 / 18
Pseudorelevance feedback

Do not ask anything from the user!

I User patience is precious resource. They’ll just walk away.


I Assume you did great in answering the query!
I That is, top-k documents in the answer are all relevant
I No interaction with user
I But don’t forget that the search will feel slower.
I Stop, at the latest, when you get the same top k
documents.

17 / 18
Pseudorelevance feedback, II

Alternative sources of feedback / query refinement:

I Links clicked / not clicked on.


I Think time / time spent looking at item.
I User’s previous history.
I Other users’ preferences!
I Co-occurring words: Add words that often occur with words
in the query - for query expansion.

18 / 18

You might also like