Chapter 2
Chapter 2
Chapter 2
WWW
● Text transformation
○ transforms documents into index terms or features
● Index creation
○ takes index terms and creates data structures (indexes) to support fast searching
● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking
● Ranking
○ uses query and indexes to generate ranked list of documents
● Evaluation
○ monitors and measures effectiveness and efficiency (primarily offline)
■ Must efficiently find huge numbers of web pages (coverage) and keep them up-to-date (freshness)
● Feeds
○ Real-time streams of documents
■ e.g., web feeds for news, blogs, video, radio, tv
○ RSS is common standard (Really Simple Syndication)
■ RSS “reader” can provide new XML documents to search engine
■ E.g. Feedly is a RSS reader for users
● Conversion
○ Convert variety of documents into a consistent text plus metadata format
■ e.g. HTML, XML, Word, PDF, etc. → XML
○ Convert text encoding for different languages
■ Using a Unicode standard like UTF-8
■ Not designed for document storage (designed for structured data, e.g. numbers, dates etc.)
■ More typically, a simpler, more efficient storage system is used due to huge numbers of
documents
■ Document parser uses syntax of markup language (or other formatting) to identify structure
● Stemming
○ Group words derived from a common stem
○ Usually effective, but not for all queries [transformers] changes to [transformer]
○ Anchor text can significantly enhance the representation of pages pointed to by links
● Information Extraction
○ Identify classes of index terms that are important for some applications
○ e.g., named entity recognizers identify classes such as people, locations, companies,
dates, etc.
● Classifier
○ Identifies class-related metadata for documents
● Document Statistics
○ Gathers counts and positions of words and other features
● Weighting
○ Computes weights for index terms (how salient is a term for a document?)
○ e.g., tf.idf weight 1 / number of document a word occurs in (document frequency or df)
● Inversion w1
0
w2 3
● Index Distribution
○ Distributes indexes across multiple computers and/or multiple sites
○ Many variations
● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking
● Query input
○ Provides interface and parser for query language
○ Most web queries are very simple, other applications may use forms
○ Query language used to describe more complex queries and results of query
transformation
■ IR query languages also allow content and structure specifications, but focus on
content
○ Query expansion and relevance feedback modify the original query with additional terms
● Results output
○ Constructs the display of ranked documents for a query
● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking
● Performance optimization
○ Designing ranking algorithms for efficient processing
● Distribution
○ Processing queries in a distributed environment
● Query Suggest
● Query refinements
● Spell correction
● User clicks
● Mouse tracking
● Logging
○ Logging user queries and interaction is crucial for improving search effectiveness and
efficiency
○ Query logs and clickthrough data used for query suggestion, spell checking, query
caching, ranking, advertising search, and other components
● Ranking analysis
○ Measuring and tuning ranking effectiveness
● Performance analysis
○ Measuring and tuning system efficiency
○ i.e., explain a small number of approaches in detail rather than many approaches