lecture7b-efficient-scoring
lecture7b-efficient-scoring
Today’s focus
Retrieval – get docs matching query from inverted
index
Scoring+ranking
Assign a score to each doc
Pick K highest scoring docs
Our emphasis today will be on doing this efficiently,
rather than on the quality of the ranking
2
Introduction to Information Retrieval
Background
Score computation is a large (10s of %) fraction of
the CPU work on a query
Generally, we have a tight budget on latency (say, 250ms)
CPU provisioning doesn’t permit exhaustively scoring every
document on every query
Today we’ll look at ways of cutting CPU usage for
scoring, without compromising the quality of results
(much)
Basic idea: avoid scoring docs that won’t make it into
the top K
3
Introduction to Information Retrieval Ch. 6
5
Introduction to Information Retrieval Sec. 7.1
7
Introduction to Information Retrieval
Non-safe ranking
Non-safe ranking may be okay
Ranking function is only a proxy for user happiness
Documents close to top K may be just fine
Index elimination
Only consider high-idf query terms
Only consider docs containing many query terms
Champion lists
High/low lists, tiered indexes
Order postings by g(d) (query-indep. quality score)
8
Introduction to Information Retrieval
Safe ranking
When we output the top K docs, we have a proof
that these are indeed the top K
Does this imply we always have to compute all N
cosines?
We’ll look at pruning methods
So we only fully score some J documents
Do we have to sort the J cosine scores?
9
Introduction to Information Retrieval Sec. 7.1
.3 .8 .1 .2
.1
Introduction to Information Retrieval
WAND scoring
An instance of DAAT scoring
Basic idea reminiscent (Serving to bring to mind) of
branch and bound
We maintain a running threshold score – e.g., the Kth
highest score computed so far
We prune away all docs whose cosine scores are
guaranteed to be below the threshold
We compute exact cosine scores for only the un-pruned
docs
Upper bounds
At all times for each query term t, we maintain an
upper bound UBt on the score contribution of any doc
to the right of the finger
Max (over docs remaining in t’s postings) of wt(doc)
finger
t 3 7 11 17 29 38 57 79 UBt = wt(38)
Pivoting
Query: catcher in the rye
Let’s say the current finger positions are as below
Threshold = 6.8
Pivot
15
Introduction to Information Retrieval
Update UB’s
Pivot
16
Introduction to Information Retrieval
catcher 589
rye 589
in 589
the 762
17
Introduction to Information Retrieval
WAND summary
In tests, WAND leads to a 90+% reduction in score
computation
Better gains on longer queries
Nothing we did was specific to cosine ranking
We need scoring to be additive by term
WAND and variants give us safe ranking
Possible to devise “careless” variants that are a bit faster
but not safe (see summary in Ding+Suel 2011)
Ideas combine some of the non-safe scoring we
considered
18