0% found this document useful (0 votes)
3 views

lecture7b-efficient-scoring

l;,.

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture7b-efficient-scoring

l;,.

Uploaded by

Farah Jahangir
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction to Information Retrieval

Information Retrieval and


Data Mining
Dr. Abdul Majid , DCIS

Lecture 7b: Efficient scoring


Introduction to Information Retrieval

Today’s focus
 Retrieval – get docs matching query from inverted
index
 Scoring+ranking
 Assign a score to each doc
 Pick K highest scoring docs
 Our emphasis today will be on doing this efficiently,
rather than on the quality of the ranking

2
Introduction to Information Retrieval

Background
 Score computation is a large (10s of %) fraction of
the CPU work on a query
 Generally, we have a tight budget on latency (say, 250ms)
 CPU provisioning doesn’t permit exhaustively scoring every
document on every query
 Today we’ll look at ways of cutting CPU usage for
scoring, without compromising the quality of results
(much)
 Basic idea: avoid scoring docs that won’t make it into
the top K
3
Introduction to Information Retrieval Ch. 6

Recap: Queries as vectors


 Vector space scoring
 We have a weight for each term in each doc
 Represent queries as vectors in the space
 Rank documents according to their cosine similarity to the
query in this space
 Or something more complex: BM25, proximity, …
 Vector space scoring is
 Entirely query dependent
 Additive on term contributions – no conditionals etc.
 Context insensitive (no interactions between query terms)
Introduction to Information Retrieval

TAAT vs DAAT techniques


 TAAT = “Term At A Time”
 Scores for all docs computed concurrently, one query term
at a time
 DAAT = “Document At A Time”
 Total score for each doc (incl all query terms) computed,
before proceeding to the next
 Each has implications for how the retrieval index is
structured and stored

5
Introduction to Information Retrieval Sec. 7.1

Efficient cosine ranking


 Find the K docs in the collection “nearest” to the
query  K largest query-doc cosines.
 Efficient ranking:
 Choosing the K largest cosine values efficiently.
 Can we do this without computing all N cosines?
Introduction to Information Retrieval

Safe vs non-safe ranking


 The terminology “safe ranking” is used for methods
that guarantee that the K docs returned are the K
absolute highest scoring documents
 (Not necessarily just under cosine similarity)
 Is it ok to be non-safe?
 If it is – then how do we ensure we don’t get too far
from the safe solution?
 How do we measure if we are far?

7
Introduction to Information Retrieval

Non-safe ranking
 Non-safe ranking may be okay
 Ranking function is only a proxy for user happiness
 Documents close to top K may be just fine
 Index elimination
 Only consider high-idf query terms
 Only consider docs containing many query terms
 Champion lists
 High/low lists, tiered indexes
 Order postings by g(d) (query-indep. quality score)

8
Introduction to Information Retrieval

Safe ranking
 When we output the top K docs, we have a proof
that these are indeed the top K
 Does this imply we always have to compute all N
cosines?
 We’ll look at pruning methods
 So we only fully score some J documents
 Do we have to sort the J cosine scores?

9
Introduction to Information Retrieval Sec. 7.1

Computing the K largest cosines:


selection vs. sorting
 Typically we want to retrieve the top K docs (in the
cosine ranking for the query)
 not to totally order all docs in the collection
 Can we pick off docs with K highest cosines?
 Let J = number of docs with nonzero cosines
 We seek the K best of these J
Introduction to Information Retrieval Sec. 7.1

Use heap for selecting top K


 Binary tree in which each node’s value > the values of
children
 Takes 2J operations to construct, then each of K
“winners” read off in O(log J) steps.
 For J=1M, K=100, this is about 10% of the cost of
sorting.
10
.9 .3

.3 .8 .1 .2

.1
Introduction to Information Retrieval

WAND scoring
 An instance of DAAT scoring
 Basic idea reminiscent (Serving to bring to mind) of
branch and bound
 We maintain a running threshold score – e.g., the Kth
highest score computed so far
 We prune away all docs whose cosine scores are
guaranteed to be below the threshold
 We compute exact cosine scores for only the un-pruned
docs

Broder et al. Efficient Query Evaluation using a Two-Level Retrieval


Process.
12
Introduction to Information Retrieval

Index structure for WAND


 Postings ordered by docID
 Assume a special iterator on the postings of the form
“go to the first docID greater than or equal to X”
 Typical state: we have a “finger” at some docID in the
postings of each query term
 Each finger moves only to the right, to larger docIDs
 Invariant – all docIDs lower than any finger have
already been processed, meaning
 These docIDs are either pruned away or
 Their cosine scores have been computed
13
Introduction to Information Retrieval

Upper bounds
 At all times for each query term t, we maintain an
upper bound UBt on the score contribution of any doc
to the right of the finger
 Max (over docs remaining in t’s postings) of wt(doc)

finger

t 3 7 11 17 29 38 57 79 UBt = wt(38)

As finger moves right, UB drops


14
Introduction to Information Retrieval

Pivoting
 Query: catcher in the rye
 Let’s say the current finger positions are as below
Threshold = 6.8

catcher 273 UBcatcher =


2.3

304 UBrye = 1.8


rye

in 589 UBin = 3.3

the 762 UBthe = 4.3

Pivot
15
Introduction to Information Retrieval

Prune docs that have no hope


 Terms sorted in order of finger positions
 Move fingers to 589 or right
Threshold = 6.8

catcher 273 Hopeless docs UBcatcher =


2.3

304 UBrye = 1.8


rye Hopeless docs

in 589 UBin = 3.3

the 762 UBthe = 4.3

Update UB’s
Pivot
16
Introduction to Information Retrieval

Compute 589’s score if need be


 If 589 is present in enough postings, compute its full
cosine score – else some fingers to right of 589
 Pivot again …

catcher 589

rye 589

in 589

the 762
17
Introduction to Information Retrieval

WAND summary
 In tests, WAND leads to a 90+% reduction in score
computation
 Better gains on longer queries
 Nothing we did was specific to cosine ranking
 We need scoring to be additive by term
 WAND and variants give us safe ranking
 Possible to devise “careless” variants that are a bit faster
but not safe (see summary in Ding+Suel 2011)
 Ideas combine some of the non-safe scoring we
considered

18

You might also like