Chap - Week8 - Queries and Information Needs

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views44 pages

Chap - Week8 - Queries and Information Needs

Uploaded by

VISALINI VIJAYAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

Information Needs
• An information need is the underlying cause of
the query that a person submits to a search
engine
– sometimes called information problem to emphasize
that information need is generally related to a task
• Categorized using variety of dimensions
– e.g., number of relevant documents being sought
– type of information that is needed
– type of task that led to the requirement for
information
Queries and Information Needs
• A query can represent very different information
needs
– May require different search techniques and ranking
algorithms to produce the best rankings
• A query can be a poor representation of the
information need
– User may find it difficult to express the information
need
– User is encouraged to enter short queries both by the
search engine interface, and by the fact that long
queries don’t work
Interaction
• Interaction with the system occurs
– during query formulation and reformulation
– while browsing the result (going through the
ranked results)
• Key aspect of effective retrieval
– users can’t change ranking algorithm but can
change results through interaction
– helps refine description of information need
• e.g., same initial query, different information needs
• how does user describe what they don’t know?
ASK Hypothesis
• Belkin et al (1982) proposed a model called
Anomalous State of Knowledge
• ASK hypothesis:
– difficult for people to define exactly what their
information need is, because that information is a
gap in their knowledge
– Search engine should look for information that fills
those gaps
Keyword Queries
• Query languages in the past were designed for
professional searchers (intermediaries)
Keyword Queries
• Simple, natural language queries were
designed to enable everyone to search
• Current search engines do not perform well
(in general) with natural language queries
• Keyword selection is not always easy
– query refinement techniques can help
Query-Based Stemming
• Make decision about stemming at query time
rather than during indexing
– improved flexibility, effectiveness
• Query is expanded using word variants
– documents are not stemmed
– e.g., “rock climbing” expanded with “climb”, not
stemmed to “climb”
Stem Classes
• A stem class is the group of words that will be
transformed into the same stem by the
stemming algorithm
– generated by running stemmer on large corpus
– e.g., Porter stemmer on TREC News
Stem Classes
• Stem classes are often too big and inaccurate
• Modify using analysis of word co-occurrence
• Assumption:
– Word variants that could substitute for each other
should co-occur often in documents
Spell Checking
• Important part of query processing
– 10-15% of all web queries have spelling errors
• Errors include typical word processing errors
but also many other types, e.g.
Spell Checking
• Basic approach: suggest corrections for words
not found in spelling dictionary
• Suggestions found by comparing word to
words in dictionary using similarity measure
• Most common similarity measure is edit
distance
– number of operations required to transform one
word into the other (e.g. how many more letters
to add, how many alteration need to be done)
Edit Distance
• Damerau-Levenshtein distance
– counts the minimum number of insertions,
deletions, substitutions, or transpositions of single
characters required
– e.g., Damerau-Levenshtein distance 1

– distance 2

Substitution X 2
Edit Distance
• Number of techniques used to speed up
calculation of edit distances (when checking with
dictionary…)
– restrict to words starting with same character
– restrict to words of same or similar length
– restrict to words that sound the same
e.g. catre (suppose to be cater) … don’t check with
categorization, but just with cater/caters/catering
• Last option uses a phonetic code to group words
– e.g. Soundex
Spelling Correction Issues
• Ranking corrections
– “Did you mean...” feature requires accurate ranking of
possible corrections
• Context
– Choosing right suggestion depends on context (other
words)
– e.g., lawers → lowers, lawyers, layers, lasers, lagers
but trial lawers → trial lawyers
• Run-on errors
– e.g., “mainscourcebank”
– missing spaces can be considered another single
character error in right framework
The Thesaurus
• Used in early search engines as a tool for
indexing and query formulation
– specified preferred terms and relationships
between them
– also called controlled vocabulary
• Particularly useful for query expansion
– adding synonyms or more specific terms using
query operators based on thesaurus
– improves search effectiveness
MeSH Thesaurus
(MEDICAL SUBJECT HEADING)
Query Expansion
• A variety of automatic or semi-automatic
query expansion techniques have been
developed
– goal is to improve effectiveness by matching
related terms
– semi-automatic techniques require user
interaction to select best expansion terms (e.g.
relevance feedback?)
• Query suggestion is a related technique
– alternative queries, not necessarily more terms
Query Expansion
• Approaches usually based on an analysis of
term co-occurrence
– either in the entire document collection, a large
collection of queries, or the top-ranked
documents in a result list
– query-based stemming also an expansion
technique
• Automatic expansion based on general
thesaurus not effective
– does not take context into account
Association Measures

• Associated words are of little use for

expanding the query “tropical fish”
• Expansion based on whole query takes
context into account
– e.g., using Dice with term “tropical fish” gives the
following highly associated words:
goldfish, reptile, aquarium, coral, frog, exotic, stripe,
regent, pet, wet
• Impractical for all possible queries, other
approaches used to achieve this effect
Other Approaches
• Pseudo-relevance feedback
– expansion terms based on top retrieved documents
for initial query (maybe checking the co-occurance?)
• Context vectors
– Represent words by the words that co-occur with
them
– e.g., top 35 most strongly associated words for “aquarium” (using
Dice’s coefficient):

– Rank words for a query by ranking context vectors

Other Approaches
• Query logs
– Best source of information about queries and
related terms
• short pieces of text and click data
– e.g., most frequent words in queries containing
“tropical fish” from MSN log:
stores, pictures, live, sale, types, clipart, blue,
freshwater, aquarium, supplies
– query suggestion based on finding similar queries
• group based on click data
Relevance Feedback
• User identifies relevant (and maybe non-
relevant) documents in the initial result list
(judges)
• System modifies query using terms from those
judged documents and re-ranks documents
• Pseudo-relevance feedback just assumes top-
ranked documents are relevant – no user
input
Relevance Feedback Example

Top 10 documents
for “tropical fish”
Relevance Feedback Example
• If we assume top 10 are relevant, most
frequent terms are (with frequency):
a (926), td (535), href (495), http (357), width (345),
com (343), nbsp (316), www (260), tr (239), htm (233),
class (225), jpg (221)
• too many stopwords and HTML expressions
• Use only snippets and remove stopwords
tropical (26), fish (28), aquarium (8), freshwater (5),
breeding (4), information (3), species (3), tank (2),
Badman’s (2), page (2), hobby (2), forums (2)
Relevance Feedback Example
• If document 7 (“Breeding tropical fish”) is
explicitly indicated to be relevant, the most
frequent terms are:
breeding (4), fish (4), tropical (4), marine (2), pond (2),
coldwater (2), keeping (1), interested (1)
• Specific weights and scoring methods used for
relevance feedback depend on retrieval model
Relevance Feedback
• Both relevance feedback and pseudo-relevance
feedback are effective, but not used in many
applications
– pseudo-relevance feedback has reliability issues,
especially with queries that don’t retrieve many
relevant documents
• Some applications use relevance feedback
– filtering, “more like this”
• Query suggestion more popular
– may be less accurate, but can work if initial query fails
Context and Personalization
• If a query has the same words as another
query, results will be the same regardless of
– who submitted the query
– why the query was submitted
– where the query was submitted
– what other queries were submitted in the same
session
• These other factors (the context) could have a
significant impact on relevance
– What’s relevant to you may not be for me…
User Models
• Generate user profiles based on documents
that the person looks at
– such as web pages visited, email messages, or
word processing documents on the desktop
• Modify queries using words from profile
• Generally not effective
– imprecise profiles, information needs can change
significantly
Query Logs
• Query logs provide important contextual
information that can be used effectively
• Context in this case is
– previous queries that are the same
– previous queries that are similar
• Query history for individuals could be used for
caching
Local Search
• Location is context
• Local search uses geographic information to
modify the ranking of search results (e.g.
within Malaysian context – elections?)
– location derived from the query text
– location of the device where the query originated
• e.g.,
– “underworld 3 cape cod”
– “underworld 3” from mobile device in Hyannis
Local Search
• Identify the geographic region associated with
web pages
– use location metadata that has been manually added
to the document,
– or identify locations such as place names, city names,
or country names in text
• Identify the geographic region associated with
the query
– 10-15% of queries contain some location reference
• Rank web pages using location information in
addition to text and link-based features
Snippet Generation

• Query-dependent document summary

• Simple summarization approach
– rank each sentence in a document using a
significance factor
– select the top sentences for the summary
– first proposed by Luhn in 50’s
Snippet Guidelines
• All query terms should appear in the
summary, showing their relationship to the
retrieved page
• When query terms are present in the title,
they need not be repeated
– allows snippets that do not contain query terms
• Highlight query terms in URLs
• Snippets should be readable text, not lists of
keywords
Advertising
• Sponsored search – advertising presented with
search results
• Contextual advertising – advertising presented
when browsing web pages
• Both involve finding the most relevant
advertisements in a database
– An advertisement usually consists of a short text
description and a link to a web page describing
the product or service in more detail
Example Advertisements

Advertisements retrieved for query “fish tank”

Clustering Results
• Result lists often contain documents related to
different aspects of the query topic
• Clustering is used to group related documents
to simplify browsing

Example clusters for

query “tropical fish”
Result List Example

Top 10 documents
for “tropical fish”
Clustering Results
• Efficiency
– must be specific to each query and are based on
the top-ranked documents for that query
– typically based on snippets
• Easy to understand
– Can be difficult to assign good labels to groups
– Monothetic vs. polythetic classification
Types of Classification
• Monothetic
– every member of a class has the property that
defines the class
– typical assumption made by users
– easy to understand
• Polythetic
– members of classes share many properties but
there is no single defining property
– most clustering algorithms (e.g. K-means) produce
this type of output
Classification Example

• Possible monothetic classification

– {D1,D2} (labeled using a) and {D2,D3} (labeled e)
• Possible polythetic classification
– {D2,D3,D4}, D1
– labels?
Cross-Language Search
• Query in one language, retrieve documents in
multiple other languages
• Involves query translation, and probably
document translation (e.g. Sejarah Malaysia)
• Query translation can be done using bilingual
dictionaries
• Document translation requires more
sophisticated statistical translation models
– similar to some retrieval models
Cross-Language Search
END

Information Retrieval
No ratings yet
Information Retrieval
142 pages
Query Languages
No ratings yet
Query Languages
54 pages
Chap 6
No ratings yet
Chap 6
70 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Din 1685 - 1
67% (3)
Din 1685 - 1
4 pages
Irs Unit - 4
No ratings yet
Irs Unit - 4
29 pages
Introduction Advanced DB
No ratings yet
Introduction Advanced DB
80 pages
IR Lecture 6b
No ratings yet
IR Lecture 6b
45 pages
Bulu
No ratings yet
Bulu
47 pages
LightSpeed Ans
No ratings yet
LightSpeed Ans
99 pages
Relevance Feedback
No ratings yet
Relevance Feedback
37 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Saep 349 PDF
100% (1)
Saep 349 PDF
41 pages
Web Search Engines
No ratings yet
Web Search Engines
30 pages
02 - Lect2 Biomedical IR
No ratings yet
02 - Lect2 Biomedical IR
20 pages
Relevance Feedback
No ratings yet
Relevance Feedback
16 pages
Information Retrieval QA
No ratings yet
Information Retrieval QA
8 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
1 Mod-1 - Lec-1
No ratings yet
1 Mod-1 - Lec-1
21 pages
IR Workbook Answers
No ratings yet
IR Workbook Answers
36 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
Unit 1 Irt
No ratings yet
Unit 1 Irt
21 pages
Irs Unit-4 Modified
No ratings yet
Irs Unit-4 Modified
13 pages
IR Presentation 1
No ratings yet
IR Presentation 1
41 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
17 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Top 200
No ratings yet
Top 200
232 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Asddas
No ratings yet
Asddas
34 pages
The Overview of Web Search Engines 16ep4np3gk
No ratings yet
The Overview of Web Search Engines 16ep4np3gk
23 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
CE Board Nov 2020 - Hydraulics - Set 19
No ratings yet
CE Board Nov 2020 - Hydraulics - Set 19
1 page
Unit - 1
No ratings yet
Unit - 1
51 pages
WIIAT.2015.212
No ratings yet
WIIAT.2015.212
4 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
1preprocessing Crawling Laws PDF
No ratings yet
1preprocessing Crawling Laws PDF
53 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Search Engines
No ratings yet
Search Engines
4 pages
HMM Model For Search
No ratings yet
HMM Model For Search
20 pages
Web Search Engine
No ratings yet
Web Search Engine
11 pages
02 Topic 4 Database Search Vs Open Web (Kathy) With Lab 5
No ratings yet
02 Topic 4 Database Search Vs Open Web (Kathy) With Lab 5
8 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
1756 ControlLogix Controllers
No ratings yet
1756 ControlLogix Controllers
40 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Dictionary - Programs Questions and Answers - Class 11
No ratings yet
Dictionary - Programs Questions and Answers - Class 11
17 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
Advanced Academic English: A handbook for university writing with glossary
From Everand
Advanced Academic English: A handbook for university writing with glossary
Dr Dina Awad
No ratings yet
Cortex™ M3
No ratings yet
Cortex™ M3
384 pages
Query Operation 2021
No ratings yet
Query Operation 2021
35 pages
Web Query Mining
No ratings yet
Web Query Mining
16 pages
DSD Univ Paper 2023-24
No ratings yet
DSD Univ Paper 2023-24
2 pages
Aristotle On Matter
No ratings yet
Aristotle On Matter
24 pages
Building Fast Search Engines
No ratings yet
Building Fast Search Engines
21 pages
Zaheer Ahmad, Presentation Information Literacy Skills
No ratings yet
Zaheer Ahmad, Presentation Information Literacy Skills
29 pages
Search and Retrieval of Information
No ratings yet
Search and Retrieval of Information
7 pages
Cmpsci 446 Search Engines
No ratings yet
Cmpsci 446 Search Engines
32 pages
Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
No ratings yet
Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
4 pages
Chap 1
No ratings yet
Chap 1
22 pages
Assignment 1
No ratings yet
Assignment 1
13 pages
Otago 649834
No ratings yet
Otago 649834
27 pages
Relevance Feedback & Query Expansion
No ratings yet
Relevance Feedback & Query Expansion
4 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
SodaPDF Converted Text
No ratings yet
SodaPDF Converted Text
14 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
No ratings yet
QUESEM: Towards Building A Meta Search Service Utilizing Query Semantics
10 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Inferring User Search Goals With Weakly Supervised Methodology
No ratings yet
Inferring User Search Goals With Weakly Supervised Methodology
8 pages
WJEC GCSE Maths Intermediate Paper 2 November 2022
No ratings yet
WJEC GCSE Maths Intermediate Paper 2 November 2022
24 pages
Evaporators Performance
No ratings yet
Evaporators Performance
14 pages
How to Write a Dissertation: An Instructional Manual for Dissertation Writers.
From Everand
How to Write a Dissertation: An Instructional Manual for Dissertation Writers.
Benjamin Baisai Silas Madondo
No ratings yet
Structural Analysis
No ratings yet
Structural Analysis
3 pages
Research Final
No ratings yet
Research Final
39 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
49 pages
Rat IL - 4 Assay Kit 2014
No ratings yet
Rat IL - 4 Assay Kit 2014
14 pages
Time Is Money - Estimating The Cost of Latency in Trading
No ratings yet
Time Is Money - Estimating The Cost of Latency in Trading
61 pages
AI Unit 1 Short Answer
No ratings yet
AI Unit 1 Short Answer
14 pages
CS5371 Theory of Computation
No ratings yet
CS5371 Theory of Computation
2 pages
Effect of Grist
No ratings yet
Effect of Grist
9 pages
Brochure Force Sensor
No ratings yet
Brochure Force Sensor
7 pages
Jurnal Spasial: Volume 6, Nomor 1, April
No ratings yet
Jurnal Spasial: Volume 6, Nomor 1, April
7 pages
Laboratory Work #4: Measurement of The Horizontal Component of The Earth Magnetic Induction
No ratings yet
Laboratory Work #4: Measurement of The Horizontal Component of The Earth Magnetic Induction
9 pages
Logic: Term
No ratings yet
Logic: Term
2 pages
College of Engineering Science and Technology Department of Computing Science & Information Systems
No ratings yet
College of Engineering Science and Technology Department of Computing Science & Information Systems
3 pages
EGU2020 Poster Thiesen E 02 ST A0portrait
No ratings yet
EGU2020 Poster Thiesen E 02 ST A0portrait
1 page

Chap - Week8 - Queries and Information Needs

Uploaded by

Chap - Week8 - Queries and Information Needs

Uploaded by

Search Engines

Information Retrieval in Practice

All slides ©Addison Wesley, 2008

• Associated words are of little use for

– Rank words for a query by ranking context vectors

• Query-dependent document summary

Advertisements retrieved for query “fish tank”

Example clusters for

• Possible monothetic classification

You might also like