0% found this document useful (0 votes)
13 views10 pages

IR Unit V Notes Remaining

The document outlines the architecture of search engines, detailing key components such as crawlers, indexers, query processors, and ranking algorithms. It discusses various architectures including centralized, cluster-based, distributed, and multi-site systems, as well as modern enhancements like neural search and personalization. Additionally, it covers the role of browsing in information retrieval, applications of web crawlers, and challenges faced in the browsing process.

Uploaded by

mohamedfarookali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

IR Unit V Notes Remaining

The document outlines the architecture of search engines, detailing key components such as crawlers, indexers, query processors, and ranking algorithms. It discusses various architectures including centralized, cluster-based, distributed, and multi-site systems, as well as modern enhancements like neural search and personalization. Additionally, it covers the role of browsing in information retrieval, applications of web crawlers, and challenges faced in the browsing process.

Uploaded by

mohamedfarookali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Search Engine Architecture

A search engine is a complex system designed to retrieve relevant information from a large corpus of
data based on user queries. Its architecture consists of multiple interconnected components optimized
for efficiency and accuracy. Below is an overview of the search engine architecture in the context of
information retrieval (IR):

1. Key Components of a Search Engine

A. Crawler (Web Scraper)

 Collects raw data from the web by following links.

 Downloads web pages and stores them in a repository.

 Uses a scheduler to determine which pages to crawl and revisit.

B. Indexer

 Text Processing: Cleans and normalizes data (e.g., tokenization, stemming, stop-word removal).

 Inverted Index: Creates a mapping of words to document locations.

 Metadata Extraction: Extracts important features (title, URL, page rank, etc.).

C. Query Processor

 Query Parsing: Analyzes user input (handles typos, synonyms, etc.).

 Tokenization & Normalization: Matches query terms with indexed terms.

 Ranking Model: Ranks results using relevance scores.

D. Ranking & Retrieval

 Uses relevance ranking algorithms (TF-IDF, BM25, PageRank, Machine Learning).

 Applies ranking signals such as query relevance, authority, and user behavior.

E. Result Presentation

 Formats search results with snippets, titles, and links.

 Reranks results dynamically based on user feedback.

2. Workflow of a Search Engine

1. Crawling → Parsing & Indexing → Query Processing → Ranking & Retrieval → Presentation of
Results

2. Continuous learning from user interactions improves ranking models.


3. Search Engine Ranking Algorithms

 Boolean Retrieval: Uses AND/OR/NOT operations.

 Vector Space Model (TF-IDF): Represents documents and queries in a vector space.

 BM25: A probabilistic ranking model.

 Machine Learning-Based Ranking (LTR): Uses AI models to rank results dynamically.

4. Modern Enhancements

 Neural Search & NLP: Uses deep learning models (e.g., BERT, GPT) for better understanding.

 Personalization & Context Awareness: Customizes results based on user history.

 Real-time Indexing: Ensures fresh content is available quickly.

Would you like a deep dive into any specific aspect of search engine architecture? 🚀

Here are the notes on the different categories of Search Engine Architecture mentioned in the image:

1. Basic Architecture – Centralized Crawler

 Single-node system where a centralized crawler fetches web pages.

 Uses a single database and indexing system.

 Simpler but lacks scalability and can become a bottleneck.

 Suitable for small-scale search engines or private databases.

2. Cluster-Based Architecture

 Uses multiple servers (nodes) grouped into clusters.

 Each node performs specific tasks like crawling, indexing, and query processing.

 Improves performance and scalability compared to a centralized approach.

 Used in mid-sized search engines and enterprise-level systems.

3. Distributed Architectures

 Fully distributed system where different nodes handle different functions.


 Uses distributed crawlers and parallel indexing.

 Provides fault tolerance, load balancing, and high scalability.

 Common in large-scale search engines like Google and Bing.

4. Multi-Site Architecture

 Designed for organizations with multiple data centers.

 Each site has local search capabilities, but they share resources.

 Reduces latency and improves regional search efficiency.

 Used by global enterprises and large-scale search engines.

Search Engine Ranking

Search Engine Ranking in Information Retrieval (IR) refers to the process of ordering search results based
on their relevance to a user's query. The ranking process is crucial for delivering high-quality search
results, and it relies on various algorithms and ranking factors.

Key Components of Search Engine Ranking

1. Query Processing

 Tokenization: Breaking the query into meaningful words or phrases.

 Stopword Removal: Eliminating common words like "the," "is," and "and."

 Stemming/Lemmatization: Reducing words to their root form (e.g., "running" → "run").

 Query Expansion: Adding synonyms or related terms to improve recall.

2. Document Representation

 Term Frequency (TF): Measures how often a term appears in a document.

 Inverse Document Frequency (IDF): Weighs terms based on how rare they are across all
documents.

 Vector Space Model: Represents documents and queries as mathematical vectors.

3. Ranking Models

Traditional Models

 Boolean Retrieval: Uses exact match queries but lacks ranking.

 Vector Space Model (VSM): Computes similarity scores using cosine similarity.

 Probabilistic Models: Estimates the probability that a document is relevant.


 BM25 (Best Matching 25): A popular probabilistic ranking function.

Machine Learning-Based Models

 Learning to Rank (LTR): Uses machine learning to rank documents. Common approaches
include:

o Pointwise: Predicts relevance scores independently.

o Pairwise: Compares document pairs to determine ranking.

o Listwise: Optimizes the ranking of an entire list.

 Neural Ranking Models: Deep learning models like BERT are used to enhance ranking.

4. Ranking Factors in Search Engines

 Relevance: How well a document matches the query.

 PageRank: Measures the importance of a webpage based on backlinks.

 Click-Through Rate (CTR): Tracks user engagement.

 User Behavior: Dwell time and bounce rate influence rankings.

 Content Quality: Originality, readability, and depth of content.

 Freshness: Newer content may rank higher for trending topics.

5. Evaluation Metrics

 Precision & Recall: Measure accuracy and completeness.

 Mean Average Precision (MAP): Computes the mean of precision scores at different recall levels.

 Normalized Discounted Cumulative Gain (NDCG): Weighs relevance based on ranking position.

 Mean Reciprocal Rank (MRR): Evaluates the rank of the first relevant result.

Link-based ranking

Link-based ranking is a core concept in Information Retrieval (IR), particularly in web search
engines. It involves ranking documents (typically web pages) based on their hyperlink structures,
assuming that links serve as endorsements of relevance and quality. The most well-known link-
based ranking algorithms include:

1. PageRank (Google)

 Developed by Larry Page and Sergey Brin.


 Assigns a score to web pages based on the number and quality of links pointing to them.
 Formula: PR(A)=(1−d)+d∑i=1nPR(Li)C(Li)PR(A) = (1 - d) + d \sum_{i=1}^{n} \
frac{PR(L_i)}{C(L_i)} where:
o PR(A)PR(A) = PageRank of page A.
o dd = Damping factor (typically 0.85).
o LiL_i = Incoming links to page A.
o C(Li)C(L_i) = Number of outbound links on page LiL_i.

2. HITS (Hyperlink-Induced Topic Search)

 Developed by Jon Kleinberg.


 Identifies two types of pages:
o Hubs: Pages that link to many high-quality pages.
o Authorities: Pages that are linked to by many high-quality hubs.
 Uses an iterative approach to compute hub and authority scores.

3. SALSA (Stochastic Approach for Link-Structure Analysis)

 A hybrid of HITS and PageRank.


 Computes hub and authority scores using Markov chains.

4. TrustRank

 Designed to combat web spam (preventing spammy or low-quality content from


appearing in search engine results).
 Uses a small set of trusted seed pages to propagate trust through links.

5. BrowseRank

 Introduced by Microsoft.
 Uses user browsing behavior (clickstream data) instead of just hyperlinks.

These ranking algorithms are often combined with content-based ranking techniques to
improve search results. Would you like an in-depth explanation of a specific algorithm?

simple ranking functions


In Information Retrieval (IR), ranking functions determine the order of documents retrieved in response
to a query. Here are some simple and commonly used ranking functions:

1. Binary Relevance (Boolean Retrieval)

 A document is either relevant (1) or not relevant (0) based on an exact match.

 No ranking; documents are returned in no particular order.

2. Term Frequency (TF)


 Ranks documents based on how often a query term appears in the document.

Formula:

TF(t,d)=count of term t in document d

3. Inverse Document Frequency (IDF)

 Gives higher importance to rare terms across documents.

 Formula:

 where:

o N = Total number of documents

o dft = Number of documents containing term t

4. TF-IDF (Term Frequency - Inverse Document Frequency)

 A combination of TF and IDF to balance term importance.

 Formula: TF-IDF(t,d)=TF(t,d)×IDF(t)

 Higher values indicate more relevant documents.

5. BM25 (Okapi BM25)

 A more advanced ranking function that normalizes term frequency.

 Formula:

 where:

o K1 and b are tuning parameters

o ∣d∣ is document length

o avgdl is the average document length

6. Cosine Similarity (Vector Space Model)

 Measures the angle between the query and document vectors.


 Formula:

 Where:

• qi = The value (weight) of term I in the query vector

• di = The value (weight) of term i in the document vector

∑iqidi = The dot product of the query and document vectors

Then we are multiplying, The magnitude (length) of the query vector and The magnitude (length) of the
document vector

Would you like a deeper explanation of any of these methods? 😊

Browsing

1. Introduction to Browsing

Browsing is an information retrieval technique where users explore and navigate content without a
specific query. It is a discovery-oriented process that allows users to find relevant information through
structured navigation.

Unlike searching, which involves direct queries, browsing helps users explore data interactively by
following links, categories, or recommendations.

2. Types of Browsing

2.1 Hierarchical Browsing

 Users navigate through predefined categories and subcategories.

 Example: Yahoo Directory (earlier), Online libraries, File systems.

2.2 Faceted Browsing

 Users filter information using multiple attributes (facets).

 Example: E-commerce sites (Amazon, Flipkart) where users filter by price, brand, category, etc.

2.3 Serendipitous Browsing

 Users discover unexpected but relevant content.

 Example: Social media feeds (Facebook, Twitter), YouTube recommendations.


2.4 Link-Based Browsing

 Users follow hyperlinks to explore related content.

 Example: Wikipedia browsing through internal links.

2.5 Graph-Based Browsing

 Users explore content based on relationships in a network or graph structure.

 Example: LinkedIn connections, Google Scholar citation networks.

3. Browsing vs. Searching

Feature Browsing Searching

User Goal Exploration & discovery Finding specific information

Interaction Passive & interactive Active input

Navigation Through links & categories Through direct queries

Example Wikipedia, social media feeds Google search, database queries

Browsing is useful when users are unsure about their exact query or when they are looking for
inspiration. Searching is more effective when users know what they need.

4. Browsing Strategies

 Scanning: Quickly looking over content to find relevant information.

 Drilling Down: Navigating deeper into a category or topic.

 Filtering: Applying conditions to narrow down content (e.g., sorting by date, relevance).

 Exploratory Browsing: Clicking on related items or suggestions to discover new information.

5. Role of Browsing in Information Retrieval

 Enhances User Experience: Allows smooth exploration of content.

 Supports Uncertain Queries: Helps users who don’t know exactly what they are looking for.

 Improves Discoverability: Helps in finding related information beyond direct searches.

 Aids Decision Making: Useful in online shopping, research, and knowledge discovery.

6. Applications of Browsing in Information Retrieval


 Digital Libraries: Browsing through books, journals, and articles.

 E-commerce: Category-based navigation and recommendations.

 News Portals: Exploring trending and related news articles.

 Social Media: Discovering content through feeds, hashtags, and recommendations.

7. Challenges in Browsing

 Information Overload: Too much content can make browsing inefficient.

 Navigation Complexity: Poorly designed interfaces make browsing difficult.

 User Fatigue: Long browsing sessions can reduce engagement.

 Lack of Personalization: Without recommendations, browsing can be ineffective.

8. Enhancing Browsing in Information Retrieval

 Personalization: Using AI to recommend relevant content.

 Efficient UI Design: Using breadcrumbs, menus, and filters for easy navigation.

 Machine Learning & AI: Suggesting related content dynamically.

 User Feedback Mechanisms: Allowing users to refine recommendations.

Applications of a web crawler

Web crawlers play a crucial role in information retrieval (IR) by systematically browsing the web to
collect, index, and organize data. Here are some key applications of web crawlers in IR:

1. Search Engines (Google, Bing, etc.)

 Crawlers scan and index billions of web pages to help users find relevant information quickly.

 They enable keyword-based searching by retrieving documents that match a user's query.

2. Content Aggregation

 News aggregators (e.g., Google News) use crawlers to gather articles from various sources.

 Price comparison websites (e.g., Skyscanner) use crawlers to retrieve product prices and
availability.

3. Sentiment Analysis & Social Media Monitoring

 Crawlers collect data from social media platforms, blogs, and forums for opinion mining and
sentiment analysis.
 Companies use this for brand reputation management and market analysis.

4. Academic & Research Data Collection

 Crawlers gather information from research papers, journals, and patents for academic databases
like Google Scholar or Semantic Scholar.

 They help in bibliometric analysis and trend discovery in research fields.

5. Fraud Detection & Cybersecurity

 Crawlers monitor suspicious websites for phishing scams, malware distribution, and fraud
detection.

 Cybersecurity firms use them to track vulnerabilities and new threats.

6. E-commerce & Competitive Analysis

 Online businesses use crawlers to track competitors' product prices, customer reviews, and stock
levels.

 E-commerce platforms like Amazon use crawlers to identify fake reviews and unauthorized
resellers.

7. Legal & Compliance Monitoring

 Governments and organizations use crawlers to ensure compliance with regulations (e.g., GDPR).

 They track copyright violations and plagiarism detection (e.g., Turnitin, Copyscape).

8. Knowledge Graph & Structured Data Extraction

 Web crawlers help build knowledge graphs by extracting structured data from multiple sources.

 Used in AI and Natural Language Processing (NLP) for better understanding of relationships
between entities.

9. Personalized Recommendations

 Crawlers help streaming services (e.g., Netflix, Spotify) and e-commerce sites (e.g., Amazon)
recommend content by gathering user behavior data.

10. Healthcare & Medical Information Retrieval

 Medical research organizations use crawlers to extract data from health-related websites,
forums, and journals.

 Helps in disease surveillance, drug discovery, and clinical trial analysis.

Would you like a deeper dive into any of these applications?

You might also like