UNIT-4 Information Retrieval Notes
UNIT-4 Information Retrieval Notes
Result: The query "AI in robotics" is most similar to Document 1 and is classified into Category A (Technology).
Diagram
Vector Space Representation:
A simplified 3D diagram with three terms (artificial, intelligence, robotics) as axes:
Document 1 vector points towards all three axes.
Document 2 vector points towards the "machine" and "learning" axes.
Query vector aligns perfectly with Document 1.
Advantages of Vector Space Classification
1. Simple and Intuitive:
o Easy to understand and implement.
2. Flexible:
o Supports various weighting schemes (TF-IDF, binary).
3. Effective:
o Handles large text datasets well.
Limitations
1. High Dimensionality:
o For large vocabularies, the vector space can become sparse.
2. No Semantic Understanding:
o Ignores the context and relationships between words.
3. Sensitivity to Weighting:
o Performance depends on how terms are weighted.
Applications
Email spam detection.
Sentiment analysis.
Topic categorization.
Document retrieval systems.
Conclusion
Vector space classification is a powerful technique in text analysis and IR. It represents documents and queries in a
geometric space and classifies them based on similarity metrics like cosine similarity. While simple and effective, its
limitations in handling semantics can be mitigated by combining it with advanced approaches like word embeddings
or neural networks.
SVM:
Deep Learning:
Conclusion
Support Vector Machines are a robust and effective method for text and document classification, particularly when
combined with well-prepared feature representations like TF-IDF. While SVM excels in high-dimensional spaces,
modern alternatives like deep learning are also gaining popularity for more nuanced text analysis tasks.
Flat clustering is a method in information retrieval where documents are grouped into clusters based on their
similarities, without imposing any hierarchical structure. All clusters are treated as equals, and the clustering process
focuses on dividing the dataset into a predetermined number of groups.
2. Predefined Number of Clusters: The number of clusters (kkk) is usually specified in advance.
3. Similarity-Based Grouping: Documents are grouped based on their similarity in the feature space.
4. Iterative Refinement: Many flat clustering algorithms iteratively adjust the cluster assignments to improve
the quality.
1. K-Means Clustering
K-means is the most popular flat clustering algorithm. It divides nnn documents into kkk clusters based on minimizing
the variance within clusters.
Steps of K-Means:
1. Initialize Cluster Centers: Randomly select kkk initial centroids (one for each cluster).
2. Assign Documents to Clusters: Assign each document to the nearest cluster based on a similarity measure
(e.g., cosine similarity or Euclidean distance).
3. Recalculate Centroids: Compute the centroid of each cluster based on the assigned documents.
4. Iterate: Repeat steps 2 and 3 until cluster assignments stabilize or a stopping criterion is met.
In EM-based clustering, each document is assigned a probability of belonging to a cluster instead of a hard
assignment.
The algorithm iteratively refines these probabilities and estimates the cluster parameters.
Dataset:
Documents:
1. Initialization:
Select two initial centroids randomly (e.g., C1C1 and C2C2).
2. Assign to Clusters:
Compute the similarity of each document to C1C1 and C2C2, assigning documents to the closest
centroid.
3. Update Centroids:
Recalculate the centroids based on the mean position of documents in each cluster.
4. Repeat:
Reassign documents and update centroids until convergence.
Final Clusters:
Cluster 1 (Technology):
"Artificial intelligence and robotics."
"Machine learning advances."
Cluster 2 (Sports):
"Football leagues and tournaments."
"Olympic sports events.
Applications in Information Retrieval
1. Topic Clustering:
3. Recommender Systems:
3. Difficulty with Overlapping Clusters: Flat clustering assumes distinct, non-overlapping clusters, which may
not always align with real-world data.
Conclusion
Flat clustering, especially with algorithms like K-means, is a fundamental approach in information retrieval for
grouping documents into similar categories. While simple and effective, it may require careful tuning and
preprocessing for optimal results.
Hierarchical clustering is a method of grouping data (like documents or items) into a tree-like structure, called a
dendrogram, which shows how similar or different the data points are. Instead of just dividing data into a fixed
number of groups like flat clustering, hierarchical clustering creates clusters at multiple levels.
Key Features
1. Tree Structure (Dendrogram): The results are represented as a tree, where each branch represents a cluster.
o Agglomerative (Bottom-Up): Start with each item as its own cluster and merge them step by step.
o Divisive (Top-Down): Start with all items in one cluster and split them step by step.
Step 2: Find the two most similar clusters and merge them.
Step 3: Repeat until all data points are merged into one large cluster.
Step 2: Split the cluster into smaller clusters based on their dissimilarity.
Step 3: Repeat until each data point becomes its own cluster.
Similarity Measurement
The algorithm calculates how similar or different clusters are using distance measures like:
2. Cosine Similarity: Commonly used for text data to measure the angle between vectors.
Dataset:
o Cluster 1: Doc 1
o Cluster 2: Doc 2
o Cluster 3: Doc 3
o Cluster 4: Doc 4
Result:
At the next level, similar documents merge (e.g., Tech and Sports).
Diagram: Dendrogram
All Documents
/ \
Sports Technology
/ \ / \
1. No Need to Specify Clusters in Advance: Unlike flat clustering, you don’t need to decide the number of
clusters.
2. Hierarchy Provides Insights: You can see relationships at different levels (e.g., documents about sports versus
technology).
3. Flexible with Different Data Types: Works well with numerical or text data.
Disadvantages
Applications
Matrix decomposition is a mathematical technique for breaking down a matrix into simpler components, making it
easier to process and analyze data. Latent Semantic Indexing (LSI) is an application of matrix decomposition in
information retrieval to uncover hidden (latent) relationships between terms and documents.
Matrix Decomposition
Matrix decomposition refers to breaking a complex matrix into simpler, interpretable components. In information
retrieval, we often use the term-document matrix AAA, where:
Each cell contains the frequency or weight of a term in a document (e.g., TF-IDF score).
Simplifies large, sparse matrices (many zeros) into smaller, dense ones.
Result: The original A can be approximated with fewer dimensions by keeping only the top kk singular values in Σ,
reducing noise and redundancy.
LSI uses SVD to enhance information retrieval. By reducing the dimensionality of the term-document matrix, it
identifies hidden (latent) semantic relationships between terms and documents. This is especially helpful for dealing
with synonyms and polysemy.
Steps in LSI:
Dataset:
After applying SVD, we keep the top 2 singular values (k=2) for simplicity.
The matrix now represents the key semantic concepts:
Concept 1: Technology (Machine Learning, AI, etc.).
Concept 2: Sports.
Query Example:
Advantages of LSI
1. Handles Synonymy:
Captures relationships between similar terms (e.g., "AI" and "Artificial Intelligence").
2. Improves Retrieval:
Finds semantically similar documents even if exact words don’t match.
3. Reduces Noise:
Focuses on key patterns and removes irrelevant details.
Limitations of LSI
1. Computational Complexity:
SVD can be slow for large datasets.
2. Fixed Dimensionality:
Requires retraining if new documents are added.
3. Ambiguity:
May struggle with polysemy (words with multiple meanings).
Web search is the process of retrieving relevant information from the vast collection of resources on the internet in
response to a user’s query. Search engines like Google, Bing, and Yahoo are examples of systems that use web
search principles.
1. Crawling
Crawling is the process where search engines discover new and updated content on the web by following links using
automated programs called web crawlers (or bots).
Example:
A crawler starts at a webpage (e.g., www.example.com) and follows all the links on that page to discover
other pages, and so on.
2. Indexing
Once pages are discovered, the information is processed and stored in a database called an index. This involves
analyzing the content, identifying keywords, and storing metadata.
Example:
A page about "machine learning" is indexed under terms like "machine," "learning," "AI," and "technology."
3. Query Processing
When a user submits a search query, the search engine processes it to understand the intent and find the most
relevant documents in the index.
Example:
Query: "Best laptops under $1000"
The search engine understands the keywords "best," "laptops," and "under $1000" to find relevant results.
4. Ranking
Search engines rank the results based on their relevance to the query and other factors, such as page quality,
authority, and user engagement.
Example:
Results are ranked so that the most relevant and credible pages appear at the top.
Example:
A search result page includes titles, descriptions (snippets), and URLs.
1. Relevance
Example:
For the query "How to bake a cake," a page containing a cake recipe is more relevant than one about the
history of cakes.
3. Query Types
1. Crawling:
The search engine’s crawler has already visited and indexed pages about programming languages.
2. Indexing:
Pages with titles and content about programming languages are stored in the search engine’s index.
3. Query Processing:
The search engine identifies keywords like "top 10," "programming languages," and "2025." It understands
the user wants a ranked list of popular programming languages.
4. Ranking:
Pages are ranked based on factors like freshness (e.g., recently updated pages), popularity, and
relevance to the query.
A blog post from January 2025 is likely to rank higher than a 2020 article.
5. Retrieval and Display:
The top results include titles like:
"Top 10 Programming Languages to Learn in 2025"
"The Most Popular Coding Languages in 2025"
Web search engines rely on three key processes to provide efficient and accurate results: web crawling, indexing,
and link analysis. These components work together to discover, organize, and rank web content.
1. Web Crawling
Web crawling is the process of systematically browsing the web to discover and collect data from websites. The tool
used for this task is called a web crawler, spider, or bot.
How It Works:
1. Seed URLs:
The crawler starts with a list of initial URLs (seed URLs).
Example: Start with "https://example.com".
2. Follow Links:
The crawler visits the seed URLs, extracts hyperlinks, and follows them to discover more pages.
Example: A page on "example.com" links to "https://example.com/page1" and
"https://anotherexample.com".
3. Content Extraction:
The crawler retrieves the content of each page (text, metadata, links, etc.).
4. Repeat:
The process continues recursively until a stopping condition (e.g., time, number of pages) is met.
1. Scale:
Billions of pages exist on the web.
2. Dynamic Content:
Some pages are generated dynamically and may not be accessible.
3. Politeness:
Crawlers must respect the website’s resources and the robots.txt file (which specifies what parts of the site
can be crawled).
4. Duplicate Content:
Crawlers need to detect and avoid indexing duplicate pages.
Example of Web Crawling Workflow:
1. Seed URL: "https://example.com".
2. The page contains links to:
o "https://example.com/about"
o "https://example.com/contact"
o "https://anotherexample.com".
3. The crawler fetches and processes each page, following links on each page to discover more content.
2. Indexes
What is an Index?
An index is a structured database that stores information about web pages to facilitate fast and accurate retrieval.
1. Content Processing:
The crawler sends the retrieved pages to the indexing system. The system processes the text, metadata, and
structure of the pages.
2. Term Extraction:
Important words (terms) and their frequencies are extracted.
3. Storage:
The extracted information is stored in an inverted index, where:
Terms are the keys.
Document IDs (pages where the terms appear) are the values.
Input Documents:
Inverted Index:
When a query is made (e.g., "Crawling"), the index quickly retrieves the relevant documents (Doc 1 and Doc 2).
3. Link Analysis
Link analysis evaluates the relationships between web pages using hyperlinks to determine their importance and
relevance.
Web Graph:
PageRank Distribution:
HITS Results:
4. Together, these processes make modern web search engines efficient and
powerful tools for finding relevant information.