0% found this document useful (0 votes)
13 views13 pages

Irs Unit-4 Modified

The document discusses search statements and binding in information retrieval, emphasizing their role in refining user queries to match system capabilities. It also covers similarity measures, ranking algorithms, and the importance of cognition and perception in information visualization, highlighting how these concepts enhance data understanding. Additionally, it outlines technologies and techniques used in information visualization to improve user interaction with complex data.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

Irs Unit-4 Modified

The document discusses search statements and binding in information retrieval, emphasizing their role in refining user queries to match system capabilities. It also covers similarity measures, ranking algorithms, and the importance of cognition and perception in information visualization, highlighting how these concepts enhance data understanding. Additionally, it outlines technologies and techniques used in information visualization to improve user interaction with complex data.

Uploaded by

Balle Manasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Search Statements and Binding

Search Statements
 Definition: Search statements are expressions of an information need created by users to
locate specific concepts or items.
 Characteristics:
 They can use Boolean logic (e.g., AND, OR, NOT) or natural language.
 Users may assign different weights to concepts in the search statement to emphasize
their importance.
 Purpose: The goal is to logically narrow down the total set of items to a smaller, relevant
cluster that matches the user’s information needs.

Binding
 Definition: Binding is the process of refining and adapting the search statement into more
specific forms for processing by a search system. It connects the user’s vocabulary and
experiences with the system’s capabilities.
1. First Level of Binding:
 The user creates a search statement that logically subsets the total item space to
relevant clusters.
2. Second Level of Binding:
 The search statement is parsed and translated into the search system’s metalanguage
for processing.
 Examples of Binding in Systems:
 Statistical Systems:
 Identify processing tokens (e.g., words or phrases) and assign weights
based on their frequency in the search statement.
 Natural Language Systems:
 Use algorithms to analyze syntax and semantics, similar to indexing
techniques.
 Concept Systems:
 Map the search statement to pre-defined concepts used for indexing
items.
3. Final Level of Binding:
 The refined search is applied to a specific database.
 The binding process at this level considers:
 Statistics: Such as the frequency of terms in the database.
 Semantics: The meaning and relationships of terms within the database.

Examples of Statistics Used in Binding


 Document Frequency: The number of documents containing a specific term.
 Total Frequency: The total occurrences of a term across all documents.
Indexing Techniques in Binding
1. Concept Indexing Systems:
 Use statistical algorithms on a representative sample of the database to define
concepts.
2. Natural Language Indexing:
 Apply algorithms that are independent of any specific database or corpus.

Length of Search Statements


 The length of search statements impacts the retrieval system's ability to find relevant items:
 Longer Search Statements:
 Provide more context and improve the system’s ability to locate relevant
items.
 Example: Profiles used in Selective Dissemination of Information (SDI)
systems often contain 75–100 terms.
 Shorter Search Statements:
 Common on the Internet, where typical queries are only 1–2 words long.
 These reduce the effectiveness of advanced retrieval techniques.

4.2 Similarity Measures

1. Searching and Similarity


 Searching involves comparing a user’s search query to items in a database.
 Similarity can be measured for an entire item or specific parts of it (e.g., paragraphs or word
chunks).
 The most similar part of an item is used to determine its overall similarity.

2. Characteristics of Similarity Measures


 A similarity formula shows how closely a query matches an item:
 Higher similarity means a better match.
 Zero similarity indicates no match.
 Various formulas exist to calculate similarity, depending on the method.

3. Popular Similarity Formulas


a. Sum of Products Similarity Measure
 Compares two items by multiplying corresponding terms and summing the results.
 When used with a query, it calculates the similarity of every item to the query.
 Requires normalization to account for item length differences and ensure results are between
0 and 1.
b. Croft Similarity Formula
 Considers the frequency of terms in an item.
 Formula includes constants and factors like:
 TFij: Frequency of a term in an item.
 IDFi: Inverse document frequency (importance of a term across all items).
 maxfreqj: Maximum frequency of any term in an item.
 Constants like C and K help fine-tune results, with K often ranging from 0.3 to 0.5.

c. Salton Similarity Formula


 Used in the SMART system, treating queries and items as vectors in n-dimensional space.
 Calculates similarity using the Cosine formula:
 A Cosine value near 1 means the query and item are very similar.
 A value near 0 indicates no relation.
 Modified by Alton and Buckley, using factors like:
 Term frequency in queries (TF).
 Maximum term frequency in queries (maxfreq).
 Inverse document frequency (IDF).

d. Jaccard and Dice Similarity Measures


 Adjust the normalization factor to account for common terms:
 Jaccard: Measures overlap between query and item terms, with results between -1
and +1.
 Dice: Simplifies the Jaccard formula by doubling the numerator, making calculations
easier.

Hidden Markov Models (HMM) Techniques


 HMM is a method used to figure out which documents are relevant to a user’s query by
passing through different stages or states.
 The query the user enters is the output, and the relevant documents are the unknown keys
we're trying to find.
 The "noisy channel" represents the gap between how an author writes and how a user forms
their query.
How it works:
 We apply Bayes’ rule to calculate the likelihood of a query being relevant to a document.
 Instead of directly finding if a document is relevant (which is hard), we estimate the
probability that the query matches the document.
Key Parts of HMM:
 States: These represent things like words or key parts in the document.
 Transition Matrix: This shows the probability of moving from one word or state to another
in a document.
 Output Symbols: These are the possible queries that can be seen.
 Probability of Output: This is the likelihood of seeing a query term for a given word in the
document.
In simple terms, HMM looks at words in documents, moves between them, and helps figure out
which document best matches a query.

Ranking Algorithms
Ranking is the process of organizing search results so that the most relevant items appear first. This
helps users quickly find what they are looking for by displaying the most relevant items at the top.

How Ranking Works


1. Purpose of Ranking:
 After identifying items relevant to the query, ranking arranges them in order of
relevance.
 It reduces the user's effort by prioritizing the most likely useful items.
2. Use of Similarity Measures:
 The similarity value calculated for each item during the search process is used to
rank the results.
 Items are sorted from the most relevant to the least relevant based on this value.
3. Modern Systems:
 Most modern systems use statistical similarity techniques and ranking to handle large
numbers of search results effectively.
 Heuristic rules (rules of thumb) are often used to improve the ranking.

Ranking in RetrievalWare
RetrievalWare is an example of a system that uses a two-step ranking process:
1. Coarse Grain Ranking:
 Focuses on the presence of query terms in items.
 Uses a weighted formula based on:
 Completeness: How many query terms are found in the item.
 Contextual Evidence: Relevance of the item's content to the query.
 Variety: Diversity of related terms.
 Semantic Distance: Meaning relationships between terms.
 This step provides an initial rank without considering the physical proximity of query
terms.
2. Fine Grain Ranking:
 Considers the exact location and proximity of query terms within the item.
 Items where related terms appear closer together (e.g., in the same sentence or
paragraph) are judged more relevant.
 Refines the initial rank from the coarse grain process for better accuracy.

Key Factors in Ranking


1. Completeness:
 Measures the proportion of query terms found in the item compared to the total
number of terms in the query.
 If query terms have weights assigned, these weights are factored into the ranking
value.
2. Proximity:
 Items with query terms appearing close together (e.g., in the same sentence) are
considered more relevant.

## Cognition and Perception in Information Visualization

When we talk about **information visualization**, we mean using visual tools like charts, graphs,
and maps to help people understand data. To make these tools effective, we need to consider how
our brains work—this involves two main concepts: **cognition** and **perception**.

### What is Perception?

**Perception** is how we interpret what we see. Here are some key points:

- **Quick Recognition**: Our brains can quickly notice things like color and shape without much
thought. For example, if a red dot is placed on a chart, our eyes will immediately spot it.

- **Grouping Information**: We tend to group similar items together. For example, if several dots
are close together, we see them as a cluster. This helps us make sense of data quickly.
- **Focusing Attention**: We can only pay attention to a limited amount of information at once.
Good visualizations highlight the most important data so we can focus on what matters.

### What is Cognition?

**Cognition** involves the mental processes we use to think and understand. Here’s how it relates
to information visualization:

- **Processing Information**: When we look at a visualization, our brains work to understand the
data. We compare numbers, look for patterns, and draw conclusions.

- **Using Memory**: Visuals help us remember information better. For instance, a well-designed
graph can show trends over time, making it easier to recall key points later.

- **Different Thinking Styles**: People think differently. Some may prefer visuals that are simple
and clear, while others might like more detailed information. Good visualizations can cater to these
different preferences.

### How Cognition and Perception Work Together

Cognition and perception work hand in hand in information visualization:

- **Making Understanding Easier**: Good visuals take advantage of how we perceive things to
make complex information easier to understand. For example, a pie chart shows parts of a whole
clearly, allowing us to grasp proportions quickly.

- **Creating Insights**: Well-designed visuals can lead to sudden realizations or insights about the
data. When the design aligns with how our brains naturally work, it becomes easier to see important
trends or relationships.

Technologies Used in Information Visualization for Information Retrieval Systems


Information visualization is essential in information retrieval systems as it helps users understand
complex data through visual representations. Various technologies and tools facilitate this process,
making it easier to analyze, interpret, and communicate data effectively.
Key Technologies in Information Visualization
1. Data Visualization Software: These tools allow users to create visual representations of
data easily. Popular examples include:
 Tableau: Known for its user-friendly interface, Tableau enables users to create
interactive dashboards and visualizations from various data sources.
 Power BI: A Microsoft product that integrates with other Microsoft services, Power
BI is widely used for business analytics and reporting.
 D3.js: A JavaScript library that helps developers create dynamic and interactive
visualizations for the web.
2. Graphical User Interfaces (GUIs): Many visualization tools come with GUIs that allow
users to interact with data visually. Users can drag and drop elements, filter data, and
customize visualizations without needing extensive programming skills.
3. Interactive Visualizations: These allow users to manipulate the data displayed. For
example, users can zoom in on specific areas of a chart or click on elements to reveal more
information. This interactivity enhances user engagement and understanding.
4. Web-Based Visualization Tools: Tools like Google Charts and Chart.js enable users to
create visualizations directly in web applications. These tools often support real-time data
updates, making them suitable for dynamic environments.
5. Database Management Systems (DBMS): Technologies like SQL databases store large
amounts of data that can be visualized. Data retrieval from these databases is crucial for
creating accurate visual representations.
6. Machine Learning Algorithms: Some advanced visualization tools use machine learning to
analyze patterns in data automatically. This can help identify trends and insights that might
not be immediately apparent.

Common Visualization Techniques


1. Charts and Graphs:
 Bar Charts: Useful for comparing different categories.
 Line Graphs: Ideal for showing trends over time.
 Pie Charts: Good for displaying proportions of a whole.
2. Heatmaps: These use color to represent data values across a matrix, making it easy to
identify patterns or areas of interest.
3. Tree Maps: A space-efficient way to visualize hierarchical data using nested rectangles.
4. Scatter Plots: Useful for showing relationships between two variables, helping users
identify correlations.
5. Dashboards: Combine multiple visualizations into one interface, providing an overview of
key metrics and trends at a glance.

Searching the Internet and Hypertext: Simplified


The Internet provides several mechanisms to search and retrieve information. These mechanisms
are based on servers that create indexes of items and allow users to search for them.
Search Mechanisms and Nodes
 Popular Search Nodes:
 Yahoo, AltaVista, and Lycos are examples of systems that index and search the
Internet.
 These systems actively collect textual data from various sites and create searchable
indexes.
 How They Work:
 Lycos: Collects and indexes the home pages of websites.
 AltaVista: Indexes all the text on a website, providing a more comprehensive search.
 The indexed data is linked to the website's URL, enabling users to retrieve the
content.
 Ranking Algorithms:
 Simple ranking methods are used based on statistical data, like word occurrences, to
display the most relevant results to users.

Intelligent Agents in Internet Search


Intelligent Agents are tools that help users by automatically searching and retrieving relevant
information from the Internet.
Key Features of Intelligent Agents:
1. Autonomy:
 Agents work independently without human interaction.
 They make decisions and traverse websites based on pre-set criteria to gather
relevant information.
2. Communication Ability:
 They communicate with websites using universally accepted protocols, like Z39.50,
to retrieve data.
3. Cooperation:
 Agents can work together to achieve shared goals, making the search more effective.
4. Reasoning Capability:
 Rule-Based: Agents follow predefined rules for actions.
 Knowledge-Based: Agents use past actions and outcomes to make decisions.
 Artificial Evolution: Agents evolve and spawn smarter versions to perform tasks
better.
5. Adaptive Behavior:
 Agents evaluate their current situation and adjust their actions accordingly.
6. Trustworthiness:
 Users must trust that agents will retrieve relevant and accessible information without
errors.
Weighted Searches of Boolean Systems (Simplified)
Overview
 There are two main ways to create queries: Boolean (uses operators like AND, OR, NOT)
and natural language.
 Boolean queries can cause issues when combined with weighted systems (where terms have
importance scores).
 Challenges arise due to:
1. How logical operators (AND, OR, NOT) work with weights.
2. The lack of ranking in pure Boolean systems.

Issues with Boolean Systems


 AND operator: Can be too strict, requiring all conditions to match.
 OR operator: Can be too general, accepting almost anything.
 Result: Strict definitions of these operators may not give the results users expect.

Fuzzy Set Approach


 Introduced to deal with the "fuzziness" in Boolean systems.
 What are Fuzzy Sets?
 Instead of items either "belonging" or "not belonging" to a set, fuzzy sets assign a
degree of membership (a value between 0 and 1).
 Helps mix Boolean logic with weighted systems more effectively.

MMM Technique and Improvements


 Early approaches like Maximum/Minimum Matching (MMM) used weights based only
on extreme values.
 Paice's Improvement:
 Calculated similarity by considering all item weights, not just the maximum or
minimum.
 This provided more accurate and balanced results.
Introduction to Information Visualization (Simplified)
Key Idea
Information visualization focuses on displaying search results and data in ways that are easy for
users to understand and interact with. It goes beyond treating monitors like paper, using modern
capabilities of electronic displays.

Unique Features of Visualization


1. Modify Data Representation:
 Change how data looks, like using different colors or styles to enhance
understanding.
2. Track Changes:
 Show changes in data while keeping the same visual format (e.g., highlighting new
connections in clusters).
3. Animate Changes:
 Use animations to display data changes over time or across space.
4. Interactive User Input:
 Allow users to navigate between information spaces and adjust visualizations based
on personal preferences.
5. Hyperlinks:
 Enable clickable links to connect related pieces of information.

Purpose of Information Visualization


 Optimize Search Results:
Display search results in ways that make them easier for users to process and analyze.
 Enhance Cognitive Processing:
Use visuals to help users understand the meaning and relationships in data.

Types of Visualization
1. Link Visualization:
 Shows how items are connected or related (e.g., networks of linked documents).
2. Attribute (Concept) Visualization:
 Focuses on the content and relationships between large amounts of data, revealing
patterns or clusters.

Relevance Feedback
 Concept:
 Introduced by Rocchio in 1965, relevance feedback improves search queries by
modifying them based on user judgments about relevant and non-relevant items.
How It Works:
 Positive Feedback:
 Increases the weight of terms in relevant items, making them more important in the
next search.
 Negative Feedback:
 Decreases the weight of terms in non-relevant items, reducing their importance in
future searches.

Benefits of Relevance Feedback:


1. Enhances query performance by improving term weights.
2. Expands the query with new terms from relevant items.
3. Has been proven effective in creating better search queries.

Challenges and Solutions:


 Issue: Query terms not found in relevant items might lose weight unfairly.
 Solution: Systems ensure original query terms are retained, even if negative
feedback suggests removing them.
 Improvement:
 Systems present the modified query to users for review, allowing them to adjust
weights and verify added terms.

Impact of Relevance Feedback:


 Early systems like SMART demonstrated its value in refining queries.
 Despite early concerns due to small datasets, relevance feedback has become a vital feature
in modern information systems.
 Examples include techniques like EMIM weighting, which address challenges like retaining
significant query terms.
Jaccard and Dice Similarity Measures
 Both Jaccard and Dice similarity measures change the denominator to better match the data's
characteristics.
 In Cosine similarity, the denominator doesn't consider how many terms are common, which
can lead to small values when the vectors are large, and only a few terms are shared.
 Jaccard Similarity:
 In Jaccard, the denominator depends on the number of terms that are common
between two sets.
 As more terms are shared, the similarity decreases, but it always stays between -1
and +1.
The Jaccard formula calculates how much two sets overlap relative to their total size.

 Dice
Similarity:
 Dice's formula simplifies the denominator compared to Jaccard and adds a factor of 2
in the numerator.
 The Dice measure also normalizes without considering the number of common terms
directly.
In simpler terms, both measures help compare sets, with Jaccard focusing on shared terms
compared to total terms, and Dice adjusting the formula for a bit more emphasis on shared
elements.

Aspects of the Visualization Process


 Pre-attentive Processing
This is the brain's ability to quickly notice basic patterns, like borders or changes in
direction, without much effort. For example, it's easier to spot groups of objects with the
same orientation than different shapes or rotated objects.

 O
pt
ic
al

Illusion
Optical illusions can make objects look bigger or smaller depending on their background. To
make small items stand out, bright colors are helpful.
 Colors
Color helps organize and highlight information.
 Hue: The color itself.
 Saturation: How bright or dull the color is.
 Lightness: How light or dark the color is.
 Complementary Colors: Colors like red/green or blue/yellow that, when combined,
make white or gray.
 Depth
Depth is used to show how far away or close things are, using techniques like shading or
perspective. It helps our brain understand 3D space, something we learn early in life.
 Configural Aspects of a Display
This refers to how objects are arranged so we can quickly recognize patterns or changes, like
spotting issues in a system by looking for unusual shapes.
 Spatial Frequency
Our brain detects light and dark changes in images. We see some patterns better than others,
like simple ones with fewer changes, while complex patterns are harder to process.
 Human Sensory Systems
Our brains are good at recognizing horizontal and vertical lines, but harder to process
diagonal lines. Bright colors and movement help us focus on what's important.
In short, our brains use visual tricks like color, depth, and pattern recognition to quickly understand
what’s in front of us. These methods help us focus on the important details and ignore the rest.

You might also like