Lect 13-Text Ranking
Lect 13-Text Ranking
Text-based Ranking
(1° generation)
Doc is a binary vector
Binary vector X,Y in {0,1}D
X Y What’s wrong ?
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Normalization
Dice coefficient (wrt avg #terms):
NO, triangular
2 X Y /(| X | | Y |)
OK, triangular
X Y / X Y
What’s wrong in binary vect?
Overlap matching doesn’t consider:
Term frequency in a document
Talks more of t ? Then t should be weighted more.
Length of documents
score should be normalized
A famous “weight”: tf-idf
wt ,d tf t ,d log(n / nt )
tf t,d Number of occurrences of term t in doc d
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
cosine(query,document)
Dot product
D
qd q d qi d i
cos(q , d ) i 1
qd q d
i 1 q i 1 i
D 2 2D
i d
Storage
wt ,d tf t ,d log(n / nt )
For every term t, we have in memory the
length nt of its posting list, so the IDF is
implicitly available.
Approximate retrieval
Sec. 7.1.1
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
Search:
If |Q| = q terms, merge their preferred lists ( mq answers).
Compute COS between Q and these docs, and choose the top
k.
Need to pick m>k to work well empirically.
Approach #4: Fancy-hits heuristic
Preprocess:
Assign docID by decreasing PR weight
Sort by docID = order by decring PR weight
Define FH(t) = m docs for t with highest tf-idf weight
Define IL(t) = the rest
Idea: a document that scores high should be in FH or in the front of IL
Search for a t-term query:
First FH: Compute the score of all docs in their FH, like
Champion Lists, and keep the top-k docs.
Then IL: scan ILs and check the common docs
Compute the score and possibly insert them into the top-k.
Stop when M docs have been checked or the PR score
becomes smaller than some threshold.
TF-IDF < 10
TF-IDF >= 10 PR = x and decreasing
PR PR
Pisa
Top-m by TF-IDF
TF-IDF < 20
TF-IDF >= 20 Same PR = x and decreasing
PR PR
Torre
Top-m by TF-IDF
Modeling authority
Assign to each document a query-
independent quality score in [0,1] to each
document d
Denote this by g(d)
Query
Leader Follower
Sec. 7.1.6
followers.
Sec. 7.1.6
General variants
Have each follower attached still to the
nearest leader.
Exact retrieval
Goal
Given a query Q, find the exact top K docs
for Q, using some ranking function r
Simplest Strategy:
1) Find all documents in the intersection
2) Compute score r(d) for all these documents d
3) Sort results by score
4) Return top K results
Background
Score computation is a large fraction of the CPU
work on a query
Generally, we have a tight budget on latency (say,
100ms)
We can’t exhaustively score every document!
rye 304
catcher 273
the 762
in 589
35
Sort Pointer
Sort the pointers to the inverted lists by
increasing document id
catcher 273
rye 304
in 589
the 762
Find Pivot
Find the “pivot”: The first pointer in this order
for which the sum of upper-bounds of the terms
is at least equal to the threshold
Threshold = 6.8
catcher 273 UBcatcher = 2.3
Pivot 37
Prune docs that have no hope
Threshold = 6.8
Pivot
Compute pivot’s score
If 589 is present in enough postings (soft AND),
compute its full score – else move pointers right of
589
If 589 is inserted in the current top-K, update threshold!
Advance and pivot again …
catcher 589
rye 589
in 589
the 762
WAND summary
Relevance feedback
Sec. 9.1
Relevance Feedback
Relevance feedback: user feedback on
relevance of docs in initial set of results
User issues a (short, simple) query
The user marks some results as relevant or
non-relevant.
The system computes a better representation
of the information need based on feedback.
Relevance feedback can go through one or
more iterations.
Sec. 9.1.1
Rocchio (SMART)
Used in practice:
1 1
qm q0
Dr d j
Dnr d j
d j Dr d j Dnr
Query Expansion
In relevance feedback, users give
additional input (relevant/non-relevant) on
documents, which is used to reweight terms
in the documents
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Is it good ?
How fast does it index
Number of documents/hour
(Average document size)
Retrieved
Relevant
Precision vs. Recall
Precision: % docs retrieved that are relevant [issue “junk”
found]
Recall: % docs relevant that are retrieved [issue “info” found]
collection
Retrieved
Relevant
How to compute them
Precision: fraction of retrieved docs that are relevant
Recall: fraction of relevant docs that are retrieved
Relevant Not Relevant
Retrieved tp (true positive) fp (false positive)
precision
x
x
x
recall
A common picture
precision x
x
x
x
recall
F measure
Combined measure (weighted harmonic mean):
1
F
1 1
(1 )
P R
People usually use balanced F1 measure
i.e., with = ½ thus 1/F = ½ (1/P + 1/R)