0% found this document useful (0 votes)

23 views58 pages

Lect 13-Text Ranking

The document discusses various methods for document ranking in information retrieval, focusing on binary vector representations and the limitations of overlap measures. It introduces the tf-idf weighting scheme and explores techniques for efficient top-k document retrieval, including champion lists and clustering. Additionally, it highlights the WAND technique for pruning document scores to optimize the retrieval process while ensuring accuracy in the results.

Uploaded by

bulba-670c256ebf8e0334468fd2b3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views58 pages

Lect 13-Text Ranking

Uploaded by

bulba-670c256ebf8e0334468fd2b3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 58

Document ranking

Text-based Ranking
(1° generation)
Doc is a binary vector
 Binary vector X,Y in {0,1}D

 Score: overlap measure

X Y What’s wrong ?

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Normalization

Dice coefficient (wrt avg #terms):
NO, triangular
2 X  Y /(| X |  | Y |)

 Jaccard coefficient (wrt possible terms):

OK, triangular
X Y / X Y
What’s wrong in binary vect?
Overlap matching doesn’t consider:
 Term frequency in a document
 Talks more of t ? Then t should be weighted more.

 Term scarcity in collection

 of commoner than baby bed

 Length of documents
 score should be normalized
A famous “weight”: tf-idf
wt ,d tf t ,d log(n / nt )
tf t,d Number of occurrences of term t in doc d

idf t  log  n  where nt = #docs containing term t

 nt  n = #docs in the indexed collection

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13,1 11,4 0,0 0,0 0,0 0,0

Brutus 3,0 8,3 0,0 1,0 0,0 0,0
Caesar 2,3 2,3 0,0 0,5 0,3 0,3
Calpurnia 0,0 11,2 0,0 0,0 0,0 0,0
Cleopatra 17,7 0,0
Vector
0,0
Space
0,0 0,0
model
0,0
mercy 0,5 0,0 0,7 0,9 0,9 0,3
worser 1,2 0,0 0,6 0,6 0,6 0,0
Sec. 6.3

Why distance is a bad idea

Easy to Spam
An example
t3
v

cos() = v  w / ||v|| * ||w||

w
 Computational Problem
#pages .it ≈ a few billions
t2 t1
# terms ≈ some mln
#ops ≈ 1015
document v w
1 op/ns ≈ 1015 ns ≈ 1 week
term 1 2 4
!!!!
term 2 0 0
term 3 3 1

cos() = 24 + 00 + 31 / sqrt{ 22 + 32 } sqrt{ 42 + 12 }  0,75  40°

Sec. 6.3

cosine(query,document)
Dot product

   

D
  qd q d qi d i
cos(q , d )        i 1

qd q d
i 1 q i 1 i
D 2 2D
i d

qi is the tf-idf weight of term i in the query: wi,q  wt,q

di is the tf-idf weight of term i in the document: wi,d  wt,d

cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.
Sec. 7.1.2

Storage

wt ,d tf t ,d log(n / nt )
 For every term t, we have in memory the
length nt of its posting list, so the IDF is
implicitly available.

 For every docID d in the posting list of term

t, we store its frequency tft,d which is tipically
small and thus stored with unary/gamma.
Computing cosine score

We could restrict to docs

in the intersection
Vector spaces and other
operators
 Vector space OK for bag-of-words queries
 Clean metaphor for similar-document
queries
 Not a good combination with operators:
Boolean, wild-card, positional, proximity

 First generation of search engines

 Invented before “spamming” web search
Top-K documents

Approximate retrieval
Sec. 7.1.1

Speed-up top-k retrieval

 Costly is the computation of the cos()
 Find a set A of contenders, with k < |A| << N

Set A does not necessarily contain all top-k,
but has many docs from among the top-k

Return the top-k docs in A, according to the
score

 The same approach is also used for other (non-

cosine) scoring functions
 Will look at several schemes following this
approach
Sec. 7.1.2

How to select A’s docs

 Consider docs containing at least one query
term (obvious… as done before!).

 Take this further:

1. Only consider docs containing most query
terms
2. Only consider high-idf query terms
3. Champion lists: top scores
4. Fancy hits: for complex ranking functions
5. Clustering
Approach #1: Docs containing many query terms

 For multi-term queries, compute scores for

docs containing most query terms

 Say, at least q-1 out of q terms of the query

 Imposes a “soft AND” on queries seen on
web search engines (early Google)

 Easy to implement in postings traversal

Many query terms

Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32

Scores only computed for docs 8, 16 and 32.

Sec. 7.1.2

Approach #2: High-idf query terms

only
 High-IDF means short posting lists = rare
term

 Intuition: in and the contribute little to the

scores and so don’t alter rank-ordering much

 Only accumulate ranking for documents in

those posting lists
Approach #3: Champion Lists
 Preprocess: Assign to each term, its m best documents
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13.1 11.4 0.0 0.0 0.0 0.0

Brutus 3.0 8.3 0.0 1.0 0.0 0.0
Caesar 2.3 2.3 0.0 0.5 0.3 0.3
Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0
Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0
mercy 0.5 0.0 0.7 0.9 0.9 0.3
worser 1.2 0.0 0.6 0.6 0.6 0.0

 Search:
 If |Q| = q terms, merge their preferred lists ( mq answers).
 Compute COS between Q and these docs, and choose the top
k.
Need to pick m>k to work well empirically.
Approach #4: Fancy-hits heuristic
 Preprocess:
 Assign docID by decreasing PR weight
 Sort by docID = order by decring PR weight
 Define FH(t) = m docs for t with highest tf-idf weight
 Define IL(t) = the rest
 Idea: a document that scores high should be in FH or in the front of IL
 Search for a t-term query:

First FH: Compute the score of all docs in their FH, like
Champion Lists, and keep the top-k docs.

Then IL: scan ILs and check the common docs

Compute the score and possibly insert them into the top-k.

Stop when M docs have been checked or the PR score
becomes smaller than some threshold.
TF-IDF < 10
TF-IDF >= 10 PR = x and decreasing
PR PR
Pisa
Top-m by TF-IDF

TF-IDF < 20
TF-IDF >= 20 Same PR = x and decreasing
PR PR
Torre
Top-m by TF-IDF

 If score is sum PR and TF-IDF values, then

 Any next match, has PR < x and TF-IDF < 30
 So that if x + 30 < minimum in the Heap, then stop scan
Sec. 7.1.4

Modeling authority
 Assign to each document a query-
independent quality score in [0,1] to each
document d
 Denote this by g(d)

 Thus, a quantity like the number of citations

(?) is scaled into [0,1]
Sec. 7.1.4
Champion lists in g(d)-
ordering
 Can combine champion lists with g(d)-
ordering

 Or, maintain for each term a champion list

of the r>k docs with highest g(d) + tf-idftd

 g(d) may be the PageRank

 Seek top-k results from only the docs in

these champion lists
Sec. 7.1.6

Approach #5: Clustering

Query

Leader Follower
Sec. 7.1.6

Cluster pruning: preprocessing

 Pick N docs at random: call
these leaders
 For every other doc, pre-
compute nearest leader
 Docs attached to a leader: its
followers;
 Likely: each leader has ~ N

followers.
Sec. 7.1.6

Cluster pruning: query processing

 Process a query as follows:

 Given query Q, find its nearest

leader L.

 Seek K nearest docs from among

L’s followers.
Sec. 7.1.6

Why use random sampling

 Fast
 Leaders reflect data distribution
Sec. 7.1.6

General variants
 Have each follower attached still to the
nearest leader.

 But given now the query, find b=4 (say)

nearest leaders and their followers. For
them compute the scores and then take
the top-k ones

 Can recur on leader/follower construction.

Exact Top-K documents

Exact retrieval
Goal
 Given a query Q, find the exact top K docs
for Q, using some ranking function r

 Simplest Strategy:
1) Find all documents in the intersection
2) Compute score r(d) for all these documents d
3) Sort results by score
4) Return top K results
Background
 Score computation is a large fraction of the CPU
work on a query

Generally, we have a tight budget on latency (say,
100ms)

We can’t exhaustively score every document!

 Goal is to cut CPU usage for scoring, without

compromising on the quality of results

 Basic idea: avoid scoring docs that won’t make it

into the top K
The WAND technique
 It is a pruning method which uses a max
heap over the real document scores
 There is a proof that the docIDs in the heap at
the end of the process are the exact top-K
 Basic idea reminiscent of branch and
bound
 We maintain a running threshold score =
the K-th highest score computed so far
 We prune away all docs whose scores are
guaranteed to be below the threshold
 We compute exact scores for only the un-
pruned docs
Index structure for WAND
 Postings ordered by docID

 Assume a special iterator on the

postings that can “go to the first docID >
X”
 using skip pointers
 Using the Elias-Fano’s compressed lists

 The “iterator” moves only to the right,

to larger docIDs
Score Functions
 We assume that:
 r(t,d) = score of t in d

 The score of the document d is the sum of the

scores of query terms: r(d) = r(t1,d) + … + r(tn,d)

 Also, for each query term t, there is some

upper-bound UB(t) such that, for all d,
 r(t,d) ≤ UB(t)
 These values are pre-computed and stored
Threshold
 We keep inductively a threshold  such
that for every document d within the
top-K, it holds that r(d) ≥ 
  can be initialized to 0
 It is raised whenever the “worst” of the
currently found top-K has a score above the
threshold
The Algorithm
 Example Query: catcher in the rye
 Consider a generic step in which each iterator is
in some position of its posting list

rye 304

catcher 273

the 762

in 589
35
Sort Pointer
 Sort the pointers to the inverted lists by
increasing document id

catcher 273

rye 304

in 589

the 762
Find Pivot
 Find the “pivot”: The first pointer in this order
for which the sum of upper-bounds of the terms
is at least equal to the threshold
Threshold = 6.8
catcher 273 UBcatcher = 2.3

rye 304 UBrye = 1.8

in 589 UBin = 3.3

the 762 UBthe = 4.3

Pivot 37
Prune docs that have no hope

Threshold = 6.8

catcher 273 Hopeless docs UBcatcher = 2.3

Hopeless UBrye = 1.8

rye 304 docs

in 589 UBin = 3.3

the 762 UBthe = 4.3

Pivot
Compute pivot’s score
 If 589 is present in enough postings (soft AND),
compute its full score – else move pointers right of
589
 If 589 is inserted in the current top-K, update threshold!
 Advance and pivot again …
catcher 589

rye 589

in 589

the 762
WAND summary

 In tests, WAND leads to a 90+% reduction in

score computation
 Better gains on longer queries

 WAND gives us safe ranking

Blocked WAND
 UB(t) was over the full list of t
 To improve this, we add the following:
 Partition the list into blocks
 Store for each block b the maximum score
UB_b(t) among the docIDs stored into it
The new algorithm: Block-Max WAND

Algorithm (2-levels check)

 As in previous WAND:
 p = pivoting docIDs via threshold  taken from the max-
heap, and let d be the pivoting docID in list(p)

Move block-by-block in lists 0..p-1so reach blocks

that may contain d (their docID-ranges overlap)
 Sum the UBs of those blocks
 if the sum ≤ then skip the block whose right-end is the
leftmost one; repeat from the beginning
 Compute score(d), if it is ≤  then move iterators to next
first docIDs > d; repeat from the beginning
 Insert d in the min-heap and re-evaluate 
Document RE-ranking

Relevance feedback
Sec. 9.1

Relevance Feedback
 Relevance feedback: user feedback on
relevance of docs in initial set of results
 User issues a (short, simple) query
 The user marks some results as relevant or
non-relevant.
 The system computes a better representation
of the information need based on feedback.
 Relevance feedback can go through one or
more iterations.
Sec. 9.1.1

Rocchio (SMART)
 Used in practice:
  1  1 
qm q0  
Dr  d j  
Dnr  d j
d j Dr d j Dnr

 Dr = set of known relevant doc vectors

 Dnr = set of known irrelevant doc vectors
 qm = modified query vector; q0 = original query
vector; α,β,γ: weights (hand-chosen or set
empirically)

 New query moves toward relevant documents

Relevance Feedback:
Problems

 Users are often reluctant to provide explicit

feedback

 It’s often harder to understand why a

particular document was retrieved after
applying relevance feedback

 There is no clear evidence that relevance

feedback is the “best use” of the user’s time.
Sec. 9.1.6

Pseudo relevance feedback

 Pseudo-relevance feedback automates the
“manual” part of true relevance feedback.
 Retrieve a list of hits for the user’s query
 Assume that the top k are relevant.
 Do relevance feedback (e.g., Rocchio)

 Works very well on average

 But can go horribly wrong for some
queries.
 Several iterations can cause query drift.
Sec. 9.2.2

Query Expansion
 In relevance feedback, users give
additional input (relevant/non-relevant) on
documents, which is used to reweight terms
in the documents

 In query expansion, users give additional

input (good/bad search term) on words or
phrases
Sec. 9.2.2

How augment the user query?

 Manual thesaurus (costly to generate)
 E.g. MedLine: physician, syn: doc, doctor, MD

 Global Analysis (static; all docs in collection)

 Automatically derived thesaurus

(co-occurrence statistics)
 Refinements based on query-log mining

Common on the web

 Local Analysis (dynamic)

 Analysis of documents in result set
Quality of a search engine

Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Is it good ?
 How fast does it index
 Number of documents/hour
 (Average document size)

 How fast does it search

 Latency as a function of index size

 Expressiveness of the query language

Measures for a search engine
 All of the preceding criteria are measurable

 The key measure: user happiness

…useless answers won’t make a user happy

 User groups for testing !!

General scenario
collection

Retrieved

Relevant
Precision vs. Recall
 Precision: % docs retrieved that are relevant [issue “junk”
found]
 Recall: % docs relevant that are retrieved [issue “info” found]

collection

Retrieved

Relevant
How to compute them
 Precision: fraction of retrieved docs that are relevant
 Recall: fraction of relevant docs that are retrieved
Relevant Not Relevant
Retrieved tp (true positive) fp (false positive)

Not fn (false negative) tn (true negative)

Retrieved

 Precision P = tp/(tp + fp)

 Recall R = tp/(tp + fn)
Precision-Recall curve
 Measure Precision at various levels of
Recall

precision
x

x
x
recall
A common picture

precision x
x

x
x

recall
F measure
 Combined measure (weighted harmonic mean):

1
F
1 1
  (1   )
P R
 People usually use balanced F1 measure
 i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)

Maestro XS Reference Manual Version 2.0 PDF
33% (3)
Maestro XS Reference Manual Version 2.0 PDF
130 pages
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
70% (10)
Steel Grades For GB Standard - JIS Standard - ASTM Standard - DIN Standard
8 pages
Fractals: On The Edge Of Chaos
From Everand
Fractals: On The Edge Of Chaos
Oliver Linton
3/5 (2)
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
No ratings yet
Lecture 6 - Scoring, Term Weighting, Vector Space Model - Part 2
44 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
Chapter 6 - Scoring Term Weighting and Vector Space Model
No ratings yet
Chapter 6 - Scoring Term Weighting and Vector Space Model
43 pages
IRCh 7 Slides
No ratings yet
IRCh 7 Slides
52 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
Lecture12 Efficient Scoring
No ratings yet
Lecture12 Efficient Scoring
52 pages
TF Idf
100% (3)
TF Idf
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
52 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
Information Retrieval: Introduction To
No ratings yet
Information Retrieval: Introduction To
48 pages
Lecture7a-Vectorspace Computing Scores
No ratings yet
Lecture7a-Vectorspace Computing Scores
43 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Ranked Retrieval
No ratings yet
Ranked Retrieval
52 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
I R Rank
No ratings yet
I R Rank
52 pages
Lecture 10
No ratings yet
Lecture 10
18 pages
IR Presentation 2
No ratings yet
IR Presentation 2
28 pages
IR - Models
100% (3)
IR - Models
58 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Boolean VectorSpace 11
No ratings yet
Boolean VectorSpace 11
15 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
No ratings yet
Chapter 2 Modeling: Modern Information Retrieval by R. Baeza-Yates and B. Ribeir
47 pages
NLP See
No ratings yet
NLP See
27 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Lecture7b Efficient Scoring
No ratings yet
Lecture7b Efficient Scoring
18 pages
Lecture4 VSM
No ratings yet
Lecture4 VSM
101 pages
NLP See
No ratings yet
NLP See
9 pages
Lec 4
No ratings yet
Lec 4
39 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
1 Overview
No ratings yet
1 Overview
44 pages
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
No ratings yet
CS8080 INFORMATION RETRIEVAL TECHNIQUES II INTERNAL EXAMINATION - Google Forms
420 pages
Web Search
No ratings yet
Web Search
30 pages
Vector Space and IR Evaluation
No ratings yet
Vector Space and IR Evaluation
41 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
16 pages
Hatakenaka 2011
No ratings yet
Hatakenaka 2011
6 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Lec 3
No ratings yet
Lec 3
51 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Week 3 - Probabilistic Retrieval and Relevance Feedback
No ratings yet
Week 3 - Probabilistic Retrieval and Relevance Feedback
37 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
IR - 2 Unit
No ratings yet
IR - 2 Unit
46 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Lecture10 Efficient Scoring
No ratings yet
Lecture10 Efficient Scoring
19 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
6 Tfidf
No ratings yet
6 Tfidf
48 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Useful Formulae: Mathematical & Physical
From Everand
Useful Formulae: Mathematical & Physical
Matthew Watkins
No ratings yet
Chapter 7 (Part I) - User Defined Datatypes
No ratings yet
Chapter 7 (Part I) - User Defined Datatypes
53 pages
Yu 2017 Centrifugal Microfluidics For Sorti
No ratings yet
Yu 2017 Centrifugal Microfluidics For Sorti
12 pages
Polyester Double Braid Info Sheet
No ratings yet
Polyester Double Braid Info Sheet
1 page
Charges Q (1) 1.5 MC, Q (2) 0.2 MC and Q (3) - 0.5 MC, Are Placed at
No ratings yet
Charges Q (1) 1.5 MC, Q (2) 0.2 MC and Q (3) - 0.5 MC, Are Placed at
1 page
CHEVRON Maintenance Heat Exchanger
67% (3)
CHEVRON Maintenance Heat Exchanger
23 pages
An Introduction To Air Density and Density Altitude Calculations
No ratings yet
An Introduction To Air Density and Density Altitude Calculations
22 pages
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
No ratings yet
JOTRON TRON UAIS TR-2500 - Operation - Installation Manual
77 pages
Case 9.2 Aiding Allies PDF
No ratings yet
Case 9.2 Aiding Allies PDF
12 pages
Demo. Teaching Topics
No ratings yet
Demo. Teaching Topics
3 pages
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
No ratings yet
Eloisa Jasmin F. Perez E3Q - Engineering Data Analysis Formative Assessment
2 pages
T.C. Altinbas University Institute of Graduate Studies
No ratings yet
T.C. Altinbas University Institute of Graduate Studies
75 pages
Vibration DNV
100% (1)
Vibration DNV
10 pages
BMC JE Brochure English 1729259098
No ratings yet
BMC JE Brochure English 1729259098
7 pages
Omkw 1
No ratings yet
Omkw 1
32 pages
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
No ratings yet
Rust Experimental v2017 DevBlog 179 x64 #KnightsTable
2 pages
Lavalle Planning
No ratings yet
Lavalle Planning
121 pages
Ethylene Oxide: Jump To
100% (1)
Ethylene Oxide: Jump To
31 pages
Tda8580j Datasheet
100% (1)
Tda8580j Datasheet
28 pages
Qpwugerqwjbrchapter 2 Descriptive Statistics: Tabular and Graphical Presentations
No ratings yet
Qpwugerqwjbrchapter 2 Descriptive Statistics: Tabular and Graphical Presentations
37 pages
Application of Machine Learning
No ratings yet
Application of Machine Learning
11 pages
Specification - Bitumen Slip Layer (G&P)
No ratings yet
Specification - Bitumen Slip Layer (G&P)
4 pages
Everything You Ever Wanted To Functional Global Variables
No ratings yet
Everything You Ever Wanted To Functional Global Variables
51 pages
Statics - Chapter 5
No ratings yet
Statics - Chapter 5
12 pages
Deepak Singh Resume
No ratings yet
Deepak Singh Resume
2 pages
Visual Basic 6.0 Documentation
No ratings yet
Visual Basic 6.0 Documentation
33 pages
Caso Blue Mountain Coffee ADBUDG
No ratings yet
Caso Blue Mountain Coffee ADBUDG
16 pages
"Node - CPP" : #Include #Include #Include Class Public New
No ratings yet
"Node - CPP" : #Include #Include #Include Class Public New
9 pages
W5-Group III Cations
No ratings yet
W5-Group III Cations
10 pages

Lect 13-Text Ranking

Uploaded by

Lect 13-Text Ranking

Uploaded by

Document ranking

 Score: overlap measure

 Jaccard coefficient (wrt possible terms):

 Term scarcity in collection

idf t  log  n  where nt = #docs containing term t

Antony 13,1 11,4 0,0 0,0 0,0 0,0

Why distance is a bad idea

cos() = v  w / ||v|| * ||w||

cos() = 2*4 + 0*0 + 3*1 / sqrt{ 22 + 32 } * sqrt{ 42 + 12 }  0,75  40°

qi is the tf-idf weight of term i in the query: wi,q  wt,q

cos(q,d) is the cosine similarity of q and d … or,

 For every docID d in the posting list of term

We could restrict to docs

 First generation of search engines

Speed-up top-k retrieval

 The same approach is also used for other (non-

How to select A’s docs

 Take this further:

 For multi-term queries, compute scores for

 Say, at least q-1 out of q terms of the query

 Easy to implement in postings traversal

Scores only computed for docs 8, 16 and 32.

Approach #2: High-idf query terms

 Intuition: in and the contribute little to the

 Only accumulate ranking for documents in

Antony 13.1 11.4 0.0 0.0 0.0 0.0

 If score is sum PR and TF-IDF values, then

 Thus, a quantity like the number of citations

 Or, maintain for each term a champion list

 g(d) may be the PageRank

 Seek top-k results from only the docs in

Approach #5: Clustering

Cluster pruning: preprocessing

Cluster pruning: query processing

 Given query Q, find its nearest

 Seek K nearest docs from among

Why use random sampling

 But given now the query, find b=4 (say)

 Can recur on leader/follower construction.

 Goal is to cut CPU usage for scoring, without

 Basic idea: avoid scoring docs that won’t make it

 Assume a special iterator on the

 The “iterator” moves only to the right,

 The score of the document d is the sum of the

 Also, for each query term t, there is some

rye 304 UBrye = 1.8

in 589 UBin = 3.3

the 762 UBthe = 4.3

catcher 273 Hopeless docs UBcatcher = 2.3

Hopeless UBrye = 1.8

in 589 UBin = 3.3

the 762 UBthe = 4.3

 In tests, WAND leads to a 90+% reduction in

 WAND gives us safe ranking

Algorithm (2-levels check)

Move block-by-block in lists 0..p-1so reach blocks

 Dr = set of known relevant doc vectors

 New query moves toward relevant documents

 Users are often reluctant to provide explicit

 It’s often harder to understand why a

 There is no clear evidence that relevance

Pseudo relevance feedback

 Works very well on average

 In query expansion, users give additional

How augment the user query?

 Global Analysis (static; all docs in collection)

 Local Analysis (dynamic)

 How fast does it search

 Expressiveness of the query language

 The key measure: user happiness

 User groups for testing !!

Not fn (false negative) tn (true negative)

 Precision P = tp/(tp + fp)

You might also like

cos() = 24 + 00 + 31 / sqrt{ 22 + 32 } sqrt{ 42 + 12 }  0,75  40°