Chapter 2
Chapter 2
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented) bitwise AND.
– 110100 AND
– 110111 AND
– 101111 =
– 100100
22
Sec. 1.2
Inverted index
• For each term t, we must store a list of all documents that
contain t.
– Identify each doc by a docID, a document serial number
• Can we used fixed-size arrays for this?
Brutus 1 2 4 11 31 45 173 174
Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101
Inverted index
• We need variable-size postings lists
– On disk, a continuous run of postings is normal and best
– In memory, can use linked lists or variable length arrays
• Some tradeoffs in size/ease of insertion
Posting
Dictionary Postings
Sorted by docID (more later on why).
24 24
Sec. 1.2
Tokenize
r
Token stream Friends Romans Countrymen
Linguistic modules
Indexe
r
friend 2 4
roman 1 2
Inverted index
countryman 13 16
User Interaction
• Query input
– Provides interface and parser for query language
– Most web queries are very simple, other applications may use forms
– Query language used to describe more complex queries and results
of query transformation
• e.g., Boolean queries, Indri and Galago query languages
• similar to SQL language used in database applications
• IR query languages also allow content and structure specifications, but focus
on content
User Interaction
• Query transformation
– Improves initial query, both before and after initial search
– Includes text transformation techniques used for documents
– Spell checking and query suggestion provide alternatives to original
query
– Query expansion and relevance feedback modify the original query
with additional terms
User Interaction
• Results output
– Constructs the display of ranked documents for a query
– Generates snippets to show how queries match documents
– Highlights important words and passages
– Retrieves appropriate advertising in many applications
– May provide clustering and other visualization tools
Sec. 19.2.2
Simplest forms
• First generation engines relied heavily on tf/idf
– The top-ranked pages for the query maui resort were the ones containing the most maui’s
and resort’s
• SEOs responded with dense repetitions of chosen terms
– e.g., maui resort maui resort maui resort
– Often, the repetitions would be in the same color as the background of the web page
• Repeated terms got indexed by crawlers
• But not visible to humans on browsers
Meta-Tags =
“… London hotels, hotel, holiday inn, hilton,
discount, booking, reservation, sex, mp3, britney
spears, viagra, …”
Ranking
• Scoring
– Calculates scores for documents using a ranking algorithm
– Core component of search engine
– Basic form of score is qi di
• qi and di are query and document term weights for term i
– Many variations of ranking algorithms and retrieval models
Ranking
• Performance optimization
– Designing ranking algorithms for efficient processing
• Term-at-a time vs. document-at-a-time processing
• Safe vs. unsafe optimizations
• Distribution
– Processing queries in a distributed environment
– Query broker distributes queries and assembles results
– Caching is a form of distributed searching
Sec. 19.5
New definition?
– The statically indexable web is whatever search engines index.
• Different engines have different preferences
– max url depth, max count/host, anti-spam rules, priority rules, etc.
• Different engines index different things under the same URL:
– frames, meta-keywords, document restrictions, document
extensions, ...
Evaluation
• Logging
– Logging user queries and interaction is crucial for improving search
effectiveness and efficiency
– Query logs and clickthrough data used for query suggestion, spell
checking, query caching, ranking, advertising search, and other
components
• Ranking analysis
– Measuring and tuning ranking effectiveness
• Performance analysis
– Measuring and tuning system efficiency
Sec. 19.6
Duplicate documents
• The web is full of duplicated content
• Strict duplicate detection = exact match
– Not as common
• But many, many cases of near duplicates
– E.g., Last modified date the only difference between two copies of a
page
Sec. 19.6
Duplicate/Near-Duplicate Detection
Computing Similarity
• Features:
– Segments of a document (natural or artificial breakpoints)
– Shingles (Word N-Grams)
– a rose is a rose is a rose →
a_rose_is_a
rose_is_a_rose
is_a_rose_is
a_rose_is_a
• Similarity Measure between two docs (= sets of shingles)
– Set intersection
– Specifically (Size_of_Intersection / Size_of_Union)
Sec. 19.6
Doc
Doc Shingle set Sketch
AA A A
Jaccard
Doc
Doc Shingle set Sketch
BB B B
Sec. 19.6
0 1
1 0
1 1 Jaccard(C1,C2) = 2/5 = 0.4
0 0
1 1
Sec. 19.6
Key Observation
• For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
• Overload notation: A = # of rows of type A
A
• Claim Jaccard(Ci , C j )
A BC
How Does It Really Work?
• This course explains these components of a search engine in more
detail
• Often many possible approaches and techniques for a given
component
– Focus is on the most important alternatives
– i.e., explain a small number of approaches in detail rather than many
approaches
– “Importance” based on research results and use in actual search engines
– Alternatives described in references