IRS Unit 4 by Krishna
IRS Unit 4 by Krishna
UNIT-4 PART-1
1. User Input: "Find information on the impact of oil spills in Alaska on the price
of oil."
2. Binding Vocabulary: Keywords like "impact," "oil," "spills," "Alaska," and
"price" are identified.
3. Statistical Binding: Weights are assigned to terms (e.g., "oil" = 0.606, "spills"
= 0.12) based on their frequency in the database.
4. Database Binding: Terms are matched to the specific database's indexing and
content, optimizing search results.
This multi-level binding ensures that search statements evolve from a user’s broad
input to a form that retrieves the most relevant information efficiently.
• After identifying potentially relevant items, they are ranked so that the most
relevant items appear first.
• Relevance is determined using a scalar number that represents how similar
each item or passage is to the query.
In essence, similarity measures ensure accurate matching between queries and
database items, while ranking prioritizes the results for user convenience.
Similarity Measures
Summary : Various similarity measures are used in information retrieval to calculate
the similarity between items and search statements. These measures, such as the
Cosine and Jaccard formulas, aim to quantify the similarity between items and
queries, with higher values indicating greater similarity. Thresholds are often applied
to these similarity measures to filter and rank search results, ensuring only relevant
items are presented to users.
1. Similarity Increases with Resemblance: The more similar two items are, the
higher the value produced by the formula.
2. Normalization: Many measures require normalization to adjust for differences
in item lengths and ensure values fall within a specific range (e.g., 0 to 1 or -1
to +1).
3. Cosine Similarity:
◦ Treats documents and queries as vectors in n-dimensional space.
◦ Computes the cosine of the angle between vectors:Cosine Similarity
◦ Values range from 0 (orthogonal, no similarity) to 1 (identical vectors).
◦ Cosine Similarity = Σ(vₖ × wₖ) / (√Σ(vₖ²) × √Σ(wₖ²))
4. Jaccard Similarity:
◦ Focuses on commonality of terms
◦ Jaccard=∣A∪B∣∣A∩B∣
◦ More sensitive to the number of shared terms.
◦ Jaccard Similarity = |A ∩ B| / |A ∪ B|
5. Dice Similarity:
◦ A variant of Jaccard with a simplified denominator
◦ Dice Similarity = 2 × |A ∩ B| / (|A| + |B|)
Advanced Concepts:
1. Thresholds in Search:
◦ Used to filter results based on similarity values.
◦ Only items exceeding a specified threshold are considered hits.
2. Hierarchical Clustering:
◦ Groups items into clusters using centroids (average vectors of items in a
cluster).
◦ Risks include missing relevant items if centroids fail to capture
individual item relevance.
Figures like vector examples and clustering hierarchies illustrate how similarity
measures are applied to real-world data, showing relationships between query results
and document vectors.
Key Concepts
1. Traditional Search:
◦ HMMs use a "noisy channel" analogy: the query is the observed output,
and the relevant documents are the unknown keys. The noisy channel
represents the mismatch between the way the document's author
expresses ideas and the way the user formulates the query.
◦ This model suggests that given a query, we can estimate the probability
that a specific document is relevant by computing P(D is R∣Q), i.e., the
probability that document D is relevant for query Q
The HMM process works by moving through the states of the document (the words
or terms in the document). At each state transition:
In an ideal scenario:
1. Define States: The words or stems in the document are treated as states.
2. Estimate Transition Probabilities: Calculate how likely a word or term
transitions to the next word in the document.
3. Estimate Output Distributions: Determine the probability of a query term
being generated from each state (word or stem).
4. Compute Relevance: Given a query Q, calculate the probability P(D is R∣Q),
the likelihood that a document D is relevant to the query Q
Ranking Algorithms
Introduction to Ranking Algorithms:
• Ranking algorithms use similarity measures to order search results, placing the
most relevant items at the top and the least relevant ones at the bottom.
• Traditional Systems: In early Boolean systems, items were ordered by their
entry date, not relevance to the user's query.
• Modern Systems: Ranking has become a common feature with the
introduction of statistical similarity techniques, especially with the growing
size and diversity of data sources like the internet.
Ranking in Commercial Systems:
• In this stage, the physical proximity of query terms and related words within
the document is taken into account.
• Proximity Factor: If query terms and related terms appear in close proximity
(same sentence or paragraph), the item is ranked higher.
• The ranking score decreases as the physical distance between query terms
increases.
User Interface Considerations:
• Although ranking produces a score for each item, displaying scores to the user
can be misleading, as differences may be either very small or very large.
• It’s better to show the general relevance of items instead of focusing on
specific scores to avoid confusion.
Relevance Feedback
Relevance Feedback Definition: Relevance feedback is a technique where the
system improves future search queries by using relevant items that were found. It
adjusts the original query based on the relevance of the retrieved items.
Relevance feedback addresses this by allowing the user to refine their query based on
relevant items they find, or by the system automatically expanding the query using a
thesaurus. The key idea is to adjust the original query to give more weight to terms
from relevant items and reduce the weight of terms from non-relevant items. This
process improves the chances of returning more relevant results in future searches.
Rocchio's work in 1965 introduced the concept of relevance feedback, where query
terms are reweighted based on their occurrence in relevant and non-relevant items.
The formula used for this process increases the weight of relevant terms (positive
feedback) and decreases the weight of irrelevant terms (negative feedback). However,
most systems emphasize positive feedback, as it has shown better results in refining
queries.
• Positive Feedback: Terms from relevant items are given higher weight to
increase the likelihood of retrieving similar relevant items.
• Negative Feedback: Terms from non-relevant items are given lower weight to
avoid retrieving irrelevant items in the future.
• Impact of Positive Feedback: Positive feedback helps move the query closer
to the user’s information needs.
• Impact of Negative Feedback: While negative feedback can reduce the
relevance of non-relevant items, it does not always help bring the query closer
to relevant items.
One challenge is handling terms in the original query that don't appear in relevant
items, which might lead to reducing their importance even if they are still significant
to the user. This issue has been addressed in various systems to maintain the original
query's integrity.
This technique has shown better performance than manual query enhancement and is
particularly useful when users enter queries with few terms.
Selective Dissemination of Information Search
1. Profile Creation: The user defines a profile, which is a static search statement
or a set of preferences regarding the type of information they are interested in.
This profile is similar to a stored query but differs because it reflects broader,
ongoing information needs rather than a specific search.
2. Continuous Comparison: New information that enters the system is
automatically compared with the user’s profile. If the incoming information
matches the profile, it is delivered to the user’s inbox, often asynchronously.
6. Example Systems:
◦ Logicon Message Dissemination System (LMDS): This system treats
profiles as static databases and uses algorithms to match incoming items
to profiles. It employs a "trigraph" algorithm to quickly identify profiles
that do not match incoming items.
◦ Personal Library Software (PLS): This system accumulates
information and periodically runs user profiles against a database, losing
near real-time delivery but enhancing the retrospective search.
◦ Retrievalware & InRoute: These systems use statistical algorithms and
techniques like inverse document frequency to match items to user
profiles, even when no historical data is available.
7. Dimensionality Reduction and Classification: In more advanced systems,
methods like Latent Semantic Indexing (LSI) and statistical classification
techniques (e.g., linear discriminant analysis, logistic regression) are used to
reduce the complexity of the system and improve the accuracy of profile-item
matching.
8. Neural Networks for SDI: Neural networks are being explored to enhance
SDI systems by allowing the system to "learn" patterns in the data. These
networks can adjust weights in response to incoming items, improving
relevance detection and profile matching over time.
Summary : Two main approaches to generating queries are Boolean and natural
language. Integrating Boolean and weighted systems models presents challenges,
particularly in interpreting logic operators and associating weights with query terms.
Approaches like fuzzy sets, P-norm models, and Salton’s refinement method aim to
address these issues and improve retrieval accuracy.
In the context of weighted searches with Boolean systems, the key challenge arises
when integrating Boolean operators (AND, OR, NOT) with weighted index systems.
Boolean systems, by definition, retrieve results based on strict inclusion or exclusion
of query terms, but when weights are introduced to the terms, they complicate the
process.
Issues:
• Boolean operators and weights: When using the traditional Boolean
operators, AND and OR, in a weighted environment, the result may be too
restrictive or too general. For instance, an AND operator in its strict form
might retrieve only those items that strictly satisfy the condition, while OR
would retrieve too many, making the results less relevant. Salton, Fox, and Wu
highlighted that using the strict Boolean definitions could lead to suboptimal
retrieval results.
• Lack of ranking: A pure Boolean system doesn't account for the relevance of
retrieved items; all matches are treated equally, whereas weighted systems
prioritize certain terms over others based on their assigned importance. This
absence of ranking is a significant issue when Boolean queries are combined
with weights.
1. Fuzzy Set Approach: Fox and Sharat proposed a fuzzy set approach that
introduces the concept of "degree of membership" to a set, which helps in
interpreting AND and OR operations more flexibly. The degree of membership
for these operators can be adjusted, providing a more nuanced result than the
strict Boolean interpretation. This approach uses the Mixed Min and Max
(MMM) model, which calculates similarity based on linear combinations of the
minimum and maximum weights of the terms involved in the query.
2. P-norm Model: Another approach involves using the P-norm model, which
assigns weights to the terms in both the query and the items being searched.
This model represents terms as coordinates in an n-dimensional space, similar
to the Cosine similarity technique. For an OR query, the "worst" case is when
all terms have a weight of zero, and for an AND query, the "ideal" case is when
all terms have a weight of one. The best-ranked documents will either have the
maximum distance from the origin (for OR queries) or the minimal distance
from the ideal unit vector (for AND queries).
Example:
In a weighted Boolean query, if the term "Computer" has a high weight, the retrieval
will prioritize documents containing this term. However, when combined with other
terms (like "sale"), the system adjusts the results based on the term weights. As the
weight of a term (say "sale") changes from 0.0 to 1.0, the result gradually shifts from
items containing only "Computer" to a broader set including items related to "sale."
This allows for a more flexible and relevant set of search results compared to
traditional Boolean methods. The goal is to balance strict Boolean logic with
weighted significance to provide results that better match the user's expectations.
2. Indexing Techniques:
◦ Lycos: Focuses on retrieving home pages for indexing.
◦ AltaVista: Indexes all text on a site for detailed results.
Both systems provide users with URLs linked to indexed content.
3. Ranking Algorithms:
Retrieved items are ranked using statistical word occurrence patterns to help
users focus on relevant results.
4. Intelligent Agents:
These are automated tools designed to enhance search capabilities:
◦ Operate autonomously.
◦ Communicate with sites to collect relevant data.
◦ Adapt, reason, and learn based on user needs and patterns.
◦ Examples of reasoning include rule-based, knowledge-based, and
evolution-based approaches.
◦ Intelligent agents optimize searches by learning user preferences and
improving their methods over time.
Suppose you are researching "electric vehicles" and want updated information
continuously. You can use an Intelligent Agent:
The agent autonomously visits websites like tesla.com or evnews.com to
collect data.
It adapts to your preferences (e.g., focusing on cost-effective EVs) and learns
which articles you prefer to refine future searches.
Searching Hypertext:
1. What is Hypertext?
Hypertext consists of interconnected items, often accessed via hyperlinks. A
hyperlink is an embedded link pointing to another resource, which can be
activated by clicking. A hyperlink is an embedded link to another item that
can be instantiated by clicking on the item reference. Frequently hidden to the
user is a URL associated with the text being displayed
2. Types of Hyperlinks:
◦ Links to essential objects (e.g., embedded images).
◦ Links to supporting or related topics.
3. Static Networks:
Hypertext creates a static network of linked items, allowing users to navigate
through related content manually by following links.
4. Search in Hypertext:
◦ Users explore linked items starting from a given node.
◦ The result is a network diagram representing interrelated items.
5. Automated Hyperlink Traversal:
Automated systems can follow hyperlinks to gather additional information,
which can refine search queries and results.
6. Advanced Applications:
◦ Systems like Pointcast and FishWrap deliver tailored information
directly to users.
◦ Collaborative tools like Firefly and Empirical Media learn user
preferences through interaction, leveraging insights from other users to
enhance recommendations.
Conclusion:
INFORMATION VISUALIZATION
Information retrieval systems have historically focused on indexing, searching, and
clustering, neglecting information display due to technological limitations and
academic interests. However, the maturation of visualization technologies and the
growing demand for sophisticated information presentation necessitate a shift
towards visual computing. Information visualization, drawing from cognitive
engineering and perception theories, can optimize search results display, reducing
user overhead and enhancing understanding.
1. Philosophical Foundation:
◦ Plato’s observation laid the groundwork for understanding how the mind
perceives and interprets the real world.
◦ The mind processes inputs from the physical world (e.g., sensory data)
and transforms them into meaningful signals.
2. Need for Visualization:
◦ Text-only interfaces are limited in helping the brain utilize its advanced
processing capabilities.
◦ Visualization bridges this gap by leveraging the brain's ability to process
images and relationships between data points.
3. Early Contributions:
Scenario:
1. Without Visualization:
The search engine provides a textual list of 1,000 articles, sorted by relevance.
• Gestalt psychologists believe the mind organizes sensory input into uni ed
mental representations, guided by rules such as:
◦ Proximity: Objects close to each other are perceived as a group.
◦ Similarity: Similar objects are grouped together.
◦ Continuity: The mind interprets gures as continuous patterns rather
than fragmented shapes.
◦ Closure: The mind lls in gaps to perceive a whole (e.g., dashed lines
forming a square are still seen as a square).
◦ Connectedness: Linked or uniform elements are perceived as a single
unit.
5. Implications for Human-Computer Interaction
Key Challenges
Overview
Information visualization technologies help improve how data and search results are
presented to users. These technologies are used across various elds, from weather
forecasting to architectural design. Speci cally, in Information Retrieval Systems,
they aim to enhance two main aspects:
1. Document Clustering
• Modern search systems use complex algorithms that can make it hard to
understand how queries relate to results.
• Visualization tools show terms (including synonyms or related words) used in
the search and their impact on the retrieved results.
3. Structured Databases & Link Analysis
Visualization Techniques
Hierarchical Representation
• Useful for data that follows a tree structure, like genealogies or organizational
charts.
• Examples:
◦ Cone-Tree: A 3D representation where child nodes form a cone under a
parent node. Users can rotate and navigate the tree.
◦ Perspective Wall: Displays information in three sections—focused in
the center and out-of-focus on the sides—helping users keep context
while zooming in.
◦ Tree Maps: Utilize screen space by subdividing rectangles based on
parent-child relationships. Box sizes and locations indicate relationships
and relevance.
User-Centric Interfaces
1. Envision System:
◦ Combines scatterplots, query editing, and bibliographic summaries.
◦ Provides an interactive, user-friendly environment to explore search
results.
2. Veerasamy and Belkin’s Bar Visualization:
Goals of Visualization