Unit IV
Unit IV
UNIT-IV
CATALOGING AND INDEXING
User Search Techniques: Search Statements and Binding, Similarity Measures and Ranking,
Relevance Feedback, Selective Dissemination of Information Search, Weighted Searches of
Boolean Systems, Searching the INTERNET and Hypertext
Search Statements
“Find me information on the impact of oil spills in Alaska on the price of oil.”
Step 1 – User Binding:
o Extracted terms: impact, oil, spills, Alaska, price, etc.
Step 2 – System Binding:
o Mapped to synonyms and given weights:
o oil (.606), petroleum (.65) price (.16), cost (.25), value (.10) etc.
Step 3 – Database Binding:
o Weights adjusted again based on document statistics and indexing semantics of that
particular database.
In general, searching is concerned with calculating the “similarity” between “a user’s search
statement” and “the items in the database”.
The similarity is applied to “total items” (or) logical passages in the item.
For EXAMPLE, every paragraph may be defined as a passage (or) every 100 words.
The different similarity measures can be used to calculate the similarity between
“the item and the search statement”.
The similarity between documents for clustering purposes, i.e.
Simple Sum of Products
o The above formula uses the summation of the product of the various terms
of two items.
o Calculates similarity by summing the product of corresponding term weights
between two items.
o This is a basic approach but lacks normalization, which can lead to issues
when items vary in length.
Croft’s Similarity Formula
o C is a tuning constant.
o IDFi is the inverse document frequency, which gives higher weight to rarer
terms.
o where: K is a tuning constant (typically 0.3 to 0.5).
o TFi,j is the frequency of term i in item j.
o maxfreqj is the maximum frequency of any term in item. This formula
adjusts term weights based on their frequency and rarity, improving
relevance.
Cosine Similarity
o Measures the cosine of the angle between two vectors (document and
query). A value of 1 means the vectors are identical (same direction), and 0
means they are orthogonal (unrelated).
o The denominator normalizes for vector length, ensuring the result is
between 0 and 1.
o A variant (fourth equation) simplifies the denominator but still normalizes
the score.
Jaccard and Dice Measures
o Jaccard: The denominator depends on the number of common terms,
producing scores between -1 and 1. It penalizes dissimilarities more heavily.
– Dice: The Dice measure simplifies the denominator and adds a factor of 2 in the
numerator, also ranging from -1 to 1, but it’s less sensitive to the number of common
terms.
– The simple “sum of the products” similarity formula is used to calculate similarity
between the query and each document. If no threshold is specified, all three documents
are considered hits. If a threshold of 4 is selected, then only DOC1 is returned.
Ranking:
Once items are identified as possibly relevant to the user’s query, it is the best way
to present the most likely relevant items first.
This process is called “Ranking”.
Relevance Feedback
Relevance feedback concept was that, the new query should be based on the "old query".
The old query modified to increase the weight of terms in "relevant items" and decrease the
weight of terms that are in "non-relevant items".
The first major work on relevance feedback was published in 1965 by Rocchio.
This technique not only modified the terms in the original query but also allowed expansion of
new terms from the relevant items.
The revised Rocchio formula for query modification:
This “Impact of relevance feedback “figure visually shows how positive and negative feedback
affect the query’s position in the document space:
Circles: Represent documents (filled circles are non-relevant, open circles are relevant).
Oval: The set of items retrieved by the query.
Solid Box: The original query’s position.
Hollow Box: The query’s position after feedback.
Example:
Recent experiments with relevance feedback during TREC sessions have shown conclusively the
advantages of relevance feedback.
Queries using relevance feedback produce significantly better results than queries being
manually enhanced, while user enter queries with few no. of terms, automatic relevance
feedback based on the rank value of that items used.
This concept in the information system called pseudo-relevance feedback, blind feedback (or
local text analysis). It does not require human relevance feedback.
Highest ranked items from query are automatically assumed to be relevant.
Boolean queries (e.g., "A AND B", "A OR B", "A NOT B") are traditionally strict—they return items
that exactly match the conditions (e.g., both A and B for AND).
This method requires more computation due to sorting but provides a more
comprehensive use of weights.
4. P-norm Model
For OR queries, the origin (all weights = 0) is the worst case; the best documents
are farthest from the origin.
For AND queries, the ideal point is the unit vector (all weights = 1); the best
documents are closest to this point.
Boolean Operations
Weighted Interpretation
Example: If the query has 5 Synonyms (e.g., “buy” for Example: If the query term is
terms and the item contains 3 “purchase”) increase the score, “charge” with the context of
of them (or their synonyms), while antonyms (e.g., “sell” for “paying for an object,” finding
the completeness is 3/5 = 0.6. “buy”) decrease it. words like “buy,” “purchase,” or
“debt” in the item suggests that
This sets an upper limit on the The closer the semantic “charge” is used in the desired
item’s rank. If query terms are relationship, the more weight is sense, increasing the item’s
weighted (e.g., some terms are added to the ranking. score.
marked as more important),
those weights are factored into This helps disambiguate terms
the score. with multiple meanings (e.g.,
“charge” as in payment vs.
“charge” as in an electrical
charge).
The Internet in the 1990s relied on search engines like Yahoo, AltaVista, and Lycos to help users find
information. These search engines worked by:
1. Autonomy
Agents operated independently without constant human input, navigating websites
based on predefined criteria to collect relevant information.
2. Communication Ability
Agents used standard protocols (e.g., Z39.50, a library search protocol) to interact
with websites and retrieve data.
5. Adaptive Behavior
Agents assessed their environment and adjusted their actions to better meet user
needs, combining autonomy and reasoning.
6. Trustworthiness
Users needed to trust that agents would act in their best interests, retrieving
relevant and accessible information.
Doyle (1962): Proposed "semantic road maps" to give users a visual overview of a database’s
content, allowing them to focus queries on specific themes.
The shift in user-machine interfaces from basic typewriter-like interactions to more complex
systems like WIMP (windows, icons, menus, pointer) interfaces, which handle multiple tasks
simultaneously.
As computer displays became common, the focus turned to representing information visually in
ways that align with human cognitive processes.
The goal is to reduce the mental effort (cognitive overhead) users spend finding and
understanding information by leveraging human perception—particularly vision, but also other
senses like audio and touch.
Background on Vision and Cognition
o Gestalt Psychology: The mind organizes visual input into meaningful wholes using rules:
o Proximity: Nearby objects are grouped together.
o Similarity: Similar objects are grouped together.
o Continuity: Smooth, continuous patterns are preferred (e.g., a circle with a line through
it is seen as a circle and a line, not two half-circles).
o Closure: Gaps are mentally filled to form a whole (e.g., a dashed square is still perceived
as a square).
o Connectedness: Linked elements are seen as a single unit.
o Spatial Frequency
The visual system constructs images from multiple channels (spatial frequency,
orientation, contrast).
Spatial frequency measures light-dark cycles per degree of visual field.
Distinct images are easier to process for motion/changes than blurred ones, so
certain spatial frequencies can help highlight patterns in dynamic displays.
o Natural Visual Processing
The visual system is tuned to real-world patterns like horizontal/vertical
references, subdued colors, and terrain/depth.
Bright colors in displays mimic natural attention cues (e.g., noticing bright
flowers), and depth-based graphics align with everyday depth processing.
Implication: Visualizations should mimic real-world sensory experiences to
reduce cognitive effort.
Cone Tree
Description:
Description:
Visualization of Results