Chap - Week8 - Queries and Information Needs
Chap - Week8 - Queries and Information Needs
– distance 2
Substitution X 2
Edit Distance
• Number of techniques used to speed up
calculation of edit distances (when checking with
dictionary…)
– restrict to words starting with same character
– restrict to words of same or similar length
– restrict to words that sound the same
e.g. catre (suppose to be cater) … don’t check with
categorization, but just with cater/caters/catering
• Last option uses a phonetic code to group words
– e.g. Soundex
Spelling Correction Issues
• Ranking corrections
– “Did you mean...” feature requires accurate ranking of
possible corrections
• Context
– Choosing right suggestion depends on context (other
words)
– e.g., lawers → lowers, lawyers, layers, lasers, lagers
but trial lawers → trial lawyers
• Run-on errors
– e.g., “mainscourcebank”
– missing spaces can be considered another single
character error in right framework
The Thesaurus
• Used in early search engines as a tool for
indexing and query formulation
– specified preferred terms and relationships
between them
– also called controlled vocabulary
• Particularly useful for query expansion
– adding synonyms or more specific terms using
query operators based on thesaurus
– improves search effectiveness
MeSH Thesaurus
(MEDICAL SUBJECT HEADING)
Query Expansion
• A variety of automatic or semi-automatic
query expansion techniques have been
developed
– goal is to improve effectiveness by matching
related terms
– semi-automatic techniques require user
interaction to select best expansion terms (e.g.
relevance feedback?)
• Query suggestion is a related technique
– alternative queries, not necessarily more terms
Query Expansion
• Approaches usually based on an analysis of
term co-occurrence
– either in the entire document collection, a large
collection of queries, or the top-ranked
documents in a result list
– query-based stemming also an expansion
technique
• Automatic expansion based on general
thesaurus not effective
– does not take context into account
Association Measures
Top 10 documents
for “tropical fish”
Relevance Feedback Example
• If we assume top 10 are relevant, most
frequent terms are (with frequency):
a (926), td (535), href (495), http (357), width (345),
com (343), nbsp (316), www (260), tr (239), htm (233),
class (225), jpg (221)
• too many stopwords and HTML expressions
• Use only snippets and remove stopwords
tropical (26), fish (28), aquarium (8), freshwater (5),
breeding (4), information (3), species (3), tank (2),
Badman’s (2), page (2), hobby (2), forums (2)
Relevance Feedback Example
• If document 7 (“Breeding tropical fish”) is
explicitly indicated to be relevant, the most
frequent terms are:
breeding (4), fish (4), tropical (4), marine (2), pond (2),
coldwater (2), keeping (1), interested (1)
• Specific weights and scoring methods used for
relevance feedback depend on retrieval model
Relevance Feedback
• Both relevance feedback and pseudo-relevance
feedback are effective, but not used in many
applications
– pseudo-relevance feedback has reliability issues,
especially with queries that don’t retrieve many
relevant documents
• Some applications use relevance feedback
– filtering, “more like this”
• Query suggestion more popular
– may be less accurate, but can work if initial query fails
Context and Personalization
• If a query has the same words as another
query, results will be the same regardless of
– who submitted the query
– why the query was submitted
– where the query was submitted
– what other queries were submitted in the same
session
• These other factors (the context) could have a
significant impact on relevance
– What’s relevant to you may not be for me…
User Models
• Generate user profiles based on documents
that the person looks at
– such as web pages visited, email messages, or
word processing documents on the desktop
• Modify queries using words from profile
• Generally not effective
– imprecise profiles, information needs can change
significantly
Query Logs
• Query logs provide important contextual
information that can be used effectively
• Context in this case is
– previous queries that are the same
– previous queries that are similar
• Query history for individuals could be used for
caching
Local Search
• Location is context
• Local search uses geographic information to
modify the ranking of search results (e.g.
within Malaysian context – elections?)
– location derived from the query text
– location of the device where the query originated
• e.g.,
– “underworld 3 cape cod”
– “underworld 3” from mobile device in Hyannis
Local Search
• Identify the geographic region associated with
web pages
– use location metadata that has been manually added
to the document,
– or identify locations such as place names, city names,
or country names in text
• Identify the geographic region associated with
the query
– 10-15% of queries contain some location reference
• Rank web pages using location information in
addition to text and link-based features
Snippet Generation
Top 10 documents
for “tropical fish”
Clustering Results
• Efficiency
– must be specific to each query and are based on
the top-ranked documents for that query
– typically based on snippets
• Easy to understand
– Can be difficult to assign good labels to groups
– Monothetic vs. polythetic classification
Types of Classification
• Monothetic
– every member of a class has the property that
defines the class
– typical assumption made by users
– easy to understand
• Polythetic
– members of classes share many properties but
there is no single defining property
– most clustering algorithms (e.g. K-means) produce
this type of output
Classification Example