IR Unit V Notes Remaining
IR Unit V Notes Remaining
A search engine is a complex system designed to retrieve relevant information from a large corpus of
data based on user queries. Its architecture consists of multiple interconnected components optimized
for efficiency and accuracy. Below is an overview of the search engine architecture in the context of
information retrieval (IR):
B. Indexer
Text Processing: Cleans and normalizes data (e.g., tokenization, stemming, stop-word removal).
Metadata Extraction: Extracts important features (title, URL, page rank, etc.).
C. Query Processor
Applies ranking signals such as query relevance, authority, and user behavior.
E. Result Presentation
1. Crawling → Parsing & Indexing → Query Processing → Ranking & Retrieval → Presentation of
Results
Vector Space Model (TF-IDF): Represents documents and queries in a vector space.
4. Modern Enhancements
Neural Search & NLP: Uses deep learning models (e.g., BERT, GPT) for better understanding.
Would you like a deep dive into any specific aspect of search engine architecture? 🚀
Here are the notes on the different categories of Search Engine Architecture mentioned in the image:
2. Cluster-Based Architecture
Each node performs specific tasks like crawling, indexing, and query processing.
3. Distributed Architectures
4. Multi-Site Architecture
Each site has local search capabilities, but they share resources.
Search Engine Ranking in Information Retrieval (IR) refers to the process of ordering search results based
on their relevance to a user's query. The ranking process is crucial for delivering high-quality search
results, and it relies on various algorithms and ranking factors.
1. Query Processing
Stopword Removal: Eliminating common words like "the," "is," and "and."
2. Document Representation
Inverse Document Frequency (IDF): Weighs terms based on how rare they are across all
documents.
3. Ranking Models
Traditional Models
Vector Space Model (VSM): Computes similarity scores using cosine similarity.
Learning to Rank (LTR): Uses machine learning to rank documents. Common approaches
include:
Neural Ranking Models: Deep learning models like BERT are used to enhance ranking.
5. Evaluation Metrics
Mean Average Precision (MAP): Computes the mean of precision scores at different recall levels.
Normalized Discounted Cumulative Gain (NDCG): Weighs relevance based on ranking position.
Mean Reciprocal Rank (MRR): Evaluates the rank of the first relevant result.
Link-based ranking
Link-based ranking is a core concept in Information Retrieval (IR), particularly in web search
engines. It involves ranking documents (typically web pages) based on their hyperlink structures,
assuming that links serve as endorsements of relevance and quality. The most well-known link-
based ranking algorithms include:
1. PageRank (Google)
4. TrustRank
5. BrowseRank
Introduced by Microsoft.
Uses user browsing behavior (clickstream data) instead of just hyperlinks.
These ranking algorithms are often combined with content-based ranking techniques to
improve search results. Would you like an in-depth explanation of a specific algorithm?
A document is either relevant (1) or not relevant (0) based on an exact match.
Formula:
Formula:
where:
Formula: TF-IDF(t,d)=TF(t,d)×IDF(t)
Formula:
where:
Where:
Then we are multiplying, The magnitude (length) of the query vector and The magnitude (length) of the
document vector
Browsing
1. Introduction to Browsing
Browsing is an information retrieval technique where users explore and navigate content without a
specific query. It is a discovery-oriented process that allows users to find relevant information through
structured navigation.
Unlike searching, which involves direct queries, browsing helps users explore data interactively by
following links, categories, or recommendations.
2. Types of Browsing
Example: E-commerce sites (Amazon, Flipkart) where users filter by price, brand, category, etc.
Browsing is useful when users are unsure about their exact query or when they are looking for
inspiration. Searching is more effective when users know what they need.
4. Browsing Strategies
Filtering: Applying conditions to narrow down content (e.g., sorting by date, relevance).
Supports Uncertain Queries: Helps users who don’t know exactly what they are looking for.
Aids Decision Making: Useful in online shopping, research, and knowledge discovery.
7. Challenges in Browsing
Efficient UI Design: Using breadcrumbs, menus, and filters for easy navigation.
Web crawlers play a crucial role in information retrieval (IR) by systematically browsing the web to
collect, index, and organize data. Here are some key applications of web crawlers in IR:
Crawlers scan and index billions of web pages to help users find relevant information quickly.
They enable keyword-based searching by retrieving documents that match a user's query.
2. Content Aggregation
News aggregators (e.g., Google News) use crawlers to gather articles from various sources.
Price comparison websites (e.g., Skyscanner) use crawlers to retrieve product prices and
availability.
Crawlers collect data from social media platforms, blogs, and forums for opinion mining and
sentiment analysis.
Companies use this for brand reputation management and market analysis.
Crawlers gather information from research papers, journals, and patents for academic databases
like Google Scholar or Semantic Scholar.
Crawlers monitor suspicious websites for phishing scams, malware distribution, and fraud
detection.
Online businesses use crawlers to track competitors' product prices, customer reviews, and stock
levels.
E-commerce platforms like Amazon use crawlers to identify fake reviews and unauthorized
resellers.
Governments and organizations use crawlers to ensure compliance with regulations (e.g., GDPR).
They track copyright violations and plagiarism detection (e.g., Turnitin, Copyscape).
Web crawlers help build knowledge graphs by extracting structured data from multiple sources.
Used in AI and Natural Language Processing (NLP) for better understanding of relationships
between entities.
9. Personalized Recommendations
Crawlers help streaming services (e.g., Netflix, Spotify) and e-commerce sites (e.g., Amazon)
recommend content by gathering user behavior data.
Medical research organizations use crawlers to extract data from health-related websites,
forums, and journals.