Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
Vasily Sidorov
1
Agenda
▪ Speeding up vector space ranking
▪ Putting together a complete search
system
▪ Will require learning about a number of
miscellaneous topics and heuristics
2
Computing cosine scores
3
Efficient cosine ranking
▪ Find the K docs in the collection “nearest” to the
query K largest query-doc cosines
▪ Efficient ranking:
▪ Computing a single cosine efficiently
▪ Choosing the K largest cosine values efficiently
▪ Can we do this without computing all N cosines?
4
Sec. 7.1
5
Sec. 7.1
6
Sec. 7.1
7
Use heap for selecting top K
▪ Binary tree in which each node’s value > the values
of children
▪ Takes 2J operations to construct, then each of K
“winners” read off in 2log J steps
▪ For J=1M, K=100, this is about 10% of the cost of
sorting
1
.9 .3
.3 .8 .1
.1 8
Sec. 7.1.1
Bottlenecks
▪ Primary computational bottleneck in scoring: cosine
computation
▪ Can we avoid all this computation?
▪ Yes, but may sometimes get it wrong
▪ a doc not in the top K may creep into the list of K
output docs
▪ Is this such a bad thing?
9
Sec. 7.1.1
10
Sec. 7.1.1
Generic approach
▪ Find a set A of contenders, with K < |A| << N
▪ A does not necessarily contain the top K, but has
many docs from among the top K
▪ Return the top K docs in A
▪ Think of A as pruning non-contenders
▪ The same approach is also used for other (non-
cosine) scoring functions
▪ Will look at several schemes following this approach
11
Sec. 7.1.2
Index Elimination
▪ Basic algorithm cosine computation algorithm only
considers docs containing at least one query term
▪ Take this further:
▪ Only consider high-idf query terms
▪ Only consider docs containing many query terms
12
High-idf query terms only
▪ For a query such as “the catcher in the rye”
▪ Only accumulate scores from catcher and rye
▪ Intuition: in and the contribute little to the scores
and so don’t alter rank-ordering much
▪ Benefit:
▪ Postings of low-idf terms have many docs → these (many)
docs get eliminated from set A of contenders
13
Docs containing many query terms
▪ Any doc with at least one query term is a candidate
for the top K output list
▪ For multi-term queries, only compute scores for docs
containing several of the query terms
▪ Say, at least 3 out of 4
▪ Imposes a “soft conjunction” on queries seen on web
search engines (early Google)
▪ Easy to implement in postings traversal
14
Sec. 7.1.2
3 of 4 query terms
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
15
Champion Lists
▪ Precompute for each dictionary term t, the r docs of
highest weight in t’s postings
▪ Call this the champion list for t
▪ (aka fancy list or top docs for t)
▪ Note that r has to be chosen at index build time
▪ Thus, it’s possible that r < K
▪ At query time, only compute scores for docs in the
champion list of some query term
▪ Pick the K top-scoring docs from amongst these
16
Sec. 7.1.3
Exercises
▪ How do Champion Lists relate to Index Elimination?
Can they be used together?
▪ How can Champion Lists be implemented in an
inverted index?
▪ Note that the champion list has nothing to do with small
docIDs
17
Sec. 7.1.4
Modeling authority
▪ Assign to each document a query-independent
quality score in [0,1] to each document d
▪ Denote this by g(d)
▪ Thus, a quantity like the number of citations is scaled
into [0,1]
▪ Exercise: suggest a formula for this.
19
Net score
▪ Consider a simple total score combining cosine
relevance and authority
▪ net-score(q,d) = g(d) + cosine(q,d)
▪ Can use some other linear combination
▪ Indeed, any function of the two “signals” of user happiness
– more later
▪ Now we seek the top K docs by net score
20
Top K by net score – fast methods
▪ First idea: Order all postings by g(d)
▪ Key: this is a common ordering for all postings
▪ Thus, can concurrently traverse query terms’
postings for
▪ Postings intersection
▪ Cosine score computation
▪ Exercise: write pseudocode for cosine score
computation if postings are ordered by g(d)
21
Sec. 7.1.4
22
Sec. 7.1.4
23
Sec. 7.1.4
Impact-ordered postings
▪ We only want to compute scores for docs for which
wft,d (weighted, non-normalized tft,d) is high enough
▪ We sort each postings list by wft,d
▪ Now: not all postings in a common order!
▪ How do we compute scores in order to pick off top K?
▪ Two ideas follow
25
Sec. 7.1.5
1. Early termination
▪ When traversing t’s postings, stop early after either
▪ a fixed number of r docs
▪ wft,d drops below some threshold
▪ Take the union of the resulting sets of docs
▪ One from the postings of each query term
▪ Compute only the scores for docs in this union
26
Sec. 7.1.5
2. idf-ordered terms
▪ When considering the postings of query terms
▪ Look at them in order of decreasing idf
▪ High idf terms likely to contribute most to score
▪ As we update score contribution from each query
term
▪ Stop if doc scores relatively unchanged
▪ Can apply to cosine or some other net scores
27
Cluster pruning: preprocessing
▪ Pick N docs at random: call these leaders
▪ For every other doc, pre-compute nearest
leader
▪ Docs attached to a leader: its followers;
▪ Likely: each leader has ~ N followers
28
Cluster pruning: query processing
▪ Process a query as follows:
▪ Given query Q, find its nearest leader L
▪ Seek K nearest docs from among L’s
followers
29
Sec. 7.1.6
Visualization
Query
Leader Follower 30
Sec. 7.1.6
31
Sec. 7.1.6
General variants
▪ Have each follower attached to b1=3 (say) nearest
leaders
▪ From query, find b2=4 (say) nearest leaders and their
followers
32
Sec. 7.1.6
Exercises
▪ To find the nearest leader in step 1, how many cosine
computations do we do?
▪ Why did we have N in the first place?
▪ What is the effect of the constants b1, b2 on the
previous slide?
▪ Devise an example where this is likely to fail – i.e., we
miss one of the K nearest docs
▪ Likely under random sampling
33
Sec. 6.1
Fields
▪ We sometimes wish to search by these metadata
▪ E.g., find docs authored by William Shakespeare in the
year 1601, containing alas poor Yorick
▪ Year = 1601 is an example of a field
▪ Also, author last name = shakespeare, etc.
▪ Field or parametric index: postings for each field
value
▪ Sometimes build range trees (e.g., for dates)
▪ Field query typically treated as conjunction
▪ (doc must be authored by shakespeare)
35
Sec. 6.1
Zone
▪ A zone is a region of the doc that can contain an
arbitrary amount of text, e.g.,
▪ Title
▪ Abstract
▪ References …
▪ Build inverted indexes on zones as well to permit
querying
▪ E.g., “find docs with merchant in the title zone and
matching the query gentle rain”
36
Sec. 6.1
37
Sec. 7.2.1
Tiered indexes
▪ Break postings up into a hierarchy of lists
▪ Most important
▪ …
▪ Least important
▪ Can be done by g(d) or another measure
▪ Inverted index thus broken up into tiers of decreasing
importance
▪ At query time use top tier unless it fails to yield K
docs
▪ If so drop to lower tiers
38
Example tiered index
39
Query term proximity
▪ Free text queries: just a set of terms typed into the
query box – common on the web
▪ Users prefer docs in which query terms occur within
close proximity of each other
▪ Let w be the smallest window in a doc containing all
query terms, e.g.,
▪ For the query strained mercy the smallest window in
the doc The quality of mercy is not strained is 4
(words)
▪ Would like scoring function to take this into account
– how? 40
Sec. 7.2.3
Query parsers
▪ Free text query from user may in fact spawn one or
more queries to the indexes, e.g., query rising
interest rates
▪ Run the query as a phrase query
▪ If <K docs contain the phrase rising interest rates, run the
two phrase queries rising interest and interest rates
▪ If we still have <K docs, run the vector space query rising
interest rates
▪ Rank matching docs by vector space scoring
▪ This sequence is issued by a query parser
41
Sec. 7.2.3
Aggregate scores
▪ We’ve seen that score functions can combine cosine,
static quality, proximity, etc.
▪ How do we know the best combination?
▪ Some applications – expert-tuned
▪ Increasingly common: machine-learned
42
Sec. 7.2.4
Document Cache
Scoring
Parameters
Metadata in zone Inexact top Tiered inverted 𝑘-
and field indexes 𝐾 retrieval positional index grams ML
Indexes
Training
43 Set
Ch. 6
▪ http://www.miislita.com/information-retrieval-
tutorial/cosine-similarity-tutorial.html
▪ Term weighting and cosine similarity tutorial for SEO folk!
44