0% found this document useful (0 votes)
4 views17 pages

Chap5 Query Processing

The document discusses the architecture and processes involved in information retrieval, particularly focusing on search engine indexing and query processing techniques. It outlines two main approaches for scoring documents: Document-at-a-Time and Term-at-a-Time, along with various optimization techniques to enhance performance. Additionally, it covers threshold methods for optimizing query processing by determining the minimum score required for documents to be displayed to users.

Uploaded by

hihifi1326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Chap5 Query Processing

The document discusses the architecture and processes involved in information retrieval, particularly focusing on search engine indexing and query processing techniques. It outlines two main approaches for scoring documents: Document-at-a-Time and Term-at-a-Time, along with various optimization techniques to enhance performance. Additionally, it covers threshold methods for optimizing query processing by determining the minimum score required for documents to be displayed to users.

Uploaded by

hihifi1326
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008


With changes by Crista Lopes
Simple Inverted
Index
Index Construction
• Simple in-memory indexer

List<Posting>()

It.append(Posting(n))

Write to a file
Architecture
Index Creation Index Ranking
Log

Querying Process
Preprocessing Steps

Text Transformation Evaluation

Local
Text Acquisition Document UI
Store

Web Pages
Query Processing
• Document-at-a-time
– Calculates complete scores for documents by
processing all term lists, one document at a time
• Term-at-a-time
– Accumulates scores for documents by processing
term lists one at a time
• Both approaches have optimization
techniques that significantly reduce time
required to generate scores
Document-At-A-Time
Pseudocode Function Descriptions
• getCurrentDocument()
– Returns the document number of the current posting of the inverted
list.
• skipForwardToDocument(d)
– Moves forward in the inverted list until getCurrentDocument() <= d.
This function may read to the end of the list.
• movePastDocument(d)
– Moves forward in the inverted list until getCurrentDocument() < d.
• moveToNextDocument()
– Moves to the next document in the list. Equivalent to
movePastDocument(getCurrentDocument()).
• getNextAccumulator(d)
– returns the first document number d' >= d that has already has an
accumulator.
• removeAccumulatorsBetween(a, b)
– Removes all accumulators for documents numbers between a and b.
Ad will be removed iff a < d < b.
Document-At-A-Time
Term-At-A-Time
Term-At-A-Time
Optimization Techniques
• Term-at-a-time uses more memory for
accumulators, but accesses disk more
efficiently
• Two classes of optimization
– Read less data from inverted lists
• e.g., skip lists
• better for simple feature functions
– Calculate scores for fewer documents
• e.g., conjunctive processing
• better for complex feature functions
Conjunctive
Term-at-a-Time
Conjunctive
Document-at-a-Time
Threshold Methods
• Threshold methods use number of top-ranked
documents needed (k) to optimize query
processing
– for most applications, k is small
• For any query, there is a minimum score that each
document needs to reach before it can be shown
to the user
– score of the kth-highest scoring document
– gives threshold τ
– optimization methods estimate τ′ to ignore
documents
Threshold Methods
• For document-at-a-time processing, use score
of lowest-ranked document so far for τ′
– for term-at-a-time, have to use kth-largest score in
the accumulator table
• MaxScore method compares the maximum
score that remaining documents could have to
τ′
– safe optimization in that ranking will be the same
without optimization
MaxScore Example

• Indexer computes μtree


– maximum score for any document containing just “tree”
• Assume k =3, τ′ is lowest score after first three docs
• Likely that τ ′ > μtree
– τ ′ is the score of a document that contains both query
terms
• Can safely skip over all gray postings
Other Approaches
• Early termination of query processing
– ignore high-frequency word lists in term-at-a-time
– ignore documents at end of lists in doc-at-a-time
– unsafe optimization
• List ordering
– order inverted lists by quality metric (e.g.,
PageRank) or by partial score
– makes unsafe (and fast) optimizations more likely
to produce good documents

You might also like