Chap5 Query Processing
Chap5 Query Processing
List<Posting>()
It.append(Posting(n))
Write to a file
Architecture
Index Creation Index Ranking
Log
Querying Process
Preprocessing Steps
Local
Text Acquisition Document UI
Store
Web Pages
Query Processing
• Document-at-a-time
– Calculates complete scores for documents by
processing all term lists, one document at a time
• Term-at-a-time
– Accumulates scores for documents by processing
term lists one at a time
• Both approaches have optimization
techniques that significantly reduce time
required to generate scores
Document-At-A-Time
Pseudocode Function Descriptions
• getCurrentDocument()
– Returns the document number of the current posting of the inverted
list.
• skipForwardToDocument(d)
– Moves forward in the inverted list until getCurrentDocument() <= d.
This function may read to the end of the list.
• movePastDocument(d)
– Moves forward in the inverted list until getCurrentDocument() < d.
• moveToNextDocument()
– Moves to the next document in the list. Equivalent to
movePastDocument(getCurrentDocument()).
• getNextAccumulator(d)
– returns the first document number d' >= d that has already has an
accumulator.
• removeAccumulatorsBetween(a, b)
– Removes all accumulators for documents numbers between a and b.
Ad will be removed iff a < d < b.
Document-At-A-Time
Term-At-A-Time
Term-At-A-Time
Optimization Techniques
• Term-at-a-time uses more memory for
accumulators, but accesses disk more
efficiently
• Two classes of optimization
– Read less data from inverted lists
• e.g., skip lists
• better for simple feature functions
– Calculate scores for fewer documents
• e.g., conjunctive processing
• better for complex feature functions
Conjunctive
Term-at-a-Time
Conjunctive
Document-at-a-Time
Threshold Methods
• Threshold methods use number of top-ranked
documents needed (k) to optimize query
processing
– for most applications, k is small
• For any query, there is a minimum score that each
document needs to reach before it can be shown
to the user
– score of the kth-highest scoring document
– gives threshold τ
– optimization methods estimate τ′ to ignore
documents
Threshold Methods
• For document-at-a-time processing, use score
of lowest-ranked document so far for τ′
– for term-at-a-time, have to use kth-largest score in
the accumulator table
• MaxScore method compares the maximum
score that remaining documents could have to
τ′
– safe optimization in that ranking will be the same
without optimization
MaxScore Example