0% found this document useful (0 votes)
26 views36 pages

IR Workbook Answers

The document provides a comprehensive overview of information retrieval and web search, detailing the web's characteristics, the differences between Information Retrieval and Web Search, and various algorithms and models used in these processes. It covers topics such as TF-IDF weighting, query expansion, spam detection, web crawling, and relevance scoring, along with examples and applications. Additionally, it discusses collaborative filtering, text mining, classification versus clustering, and the architecture of search engines.

Uploaded by

shreya.y2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views36 pages

IR Workbook Answers

The document provides a comprehensive overview of information retrieval and web search, detailing the web's characteristics, the differences between Information Retrieval and Web Search, and various algorithms and models used in these processes. It covers topics such as TF-IDF weighting, query expansion, spam detection, web crawling, and relevance scoring, along with examples and applications. Additionally, it discusses collaborative filtering, text mining, classification versus clustering, and the architecture of search engines.

Uploaded by

shreya.y2612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

INFORMATION RETRIEVAL WORKBOOK

1) What are the basic characteristics of the web?


The web, being a vast repository of information, exhibits several defining characteristics:
 Dynamic and ever-changing: Web content is constantly updated or replaced, making
it difficult to maintain static indexes. Examples include news websites, social media,
and e-commerce platforms.
 Interlinked structure: Hyperlinks connect web pages, enabling navigation and the
formation of a graph-like structure. This interconnectivity facilitates discovery
through algorithms like PageRank.
 Heterogeneous content: The web hosts diverse types of information, including text,
images, videos, and structured data (e.g., metadata or tables).
 Scalable architecture: Millions of users and billions of resources make scalability a
necessity in terms of infrastructure and algorithms.
 Unstructured nature: Unlike databases, web data lacks a standardized schema,
which complicates indexing and retrieval.
 Global accessibility: The web is available worldwide, enabling collaboration and
information sharing across borders.

2) Distinguish between Information Retrieval and Web Search.


Information Retrieval (IR) and Web Search serve overlapping but distinct purposes, differing
in scope, complexity, and techniques:

Aspect Information Retrieval Web Search

Data Controlled and curated corpus (e.g.,


Open, unstructured web pages.
Collection digital libraries).

Relevance Textual content similarity and Text + hyperlink structure (e.g.,


Criteria metadata. PageRank) + user data.

Scale Smaller, predefined collections. Billions of pages indexed dynamically.

Ranked list of URLs with summaries


Output Specific documents or snippets.
and rich results.

 Example of IR: Retrieving a list of research papers about AI from a database.


 Example of Web Search: Searching for "best smartphones of 2024" on Google and
getting a mix of blogs, reviews, and shopping links.
3) Steps involved in TF-IDF weighting:
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to
evaluate the importance of a term in a document relative to a collection.
1. Compute Term Frequency (TF):
Measures how often a term appears in a document.
TF(t,d)=Frequency of term t in document d/Total number of terms in document d
Calculate Document Frequency (DF):
Counts the number of documents containing the term t.
2. Compute Inverse Document Frequency (IDF):
Assesses the rarity of a term across documents.

IDF(t)=log((Total number of documents)/(Number of documents containing t))


3. Combine TF and IDF:
The final weight is the product of TF and IDF.
TF−IDF(t,d)=TF(t,d)×IDF(t)
Example: In a dataset of 1000 documents, if the term "web" appears in 10 documents, and
its frequency in a specific document is 5, then:
 TF=5/ {Total terms in document} .
 IDF=log(1000/10)= 2.
 TF−IDF=TF×2.

4) Is it necessary to do query expansion always? Why?


Query expansion enhances search by adding related terms to the user's original query. While
beneficial in many cases, it is not always necessary:
 When it's helpful:
o To handle synonyms (e.g., "car" expanded to include "automobile").
o For vague queries (e.g., "AI" expanded to "artificial intelligence").
o In domain-specific searches where specialized terms are used.

 When it's unnecessary:


o If the query is already precise (e.g., "Python 3.11 tutorial").
o It may introduce noise, retrieving irrelevant results.
Example: Querying "jaguar" might expand to "jaguar animal" or "Jaguar car," but if the user
only wants animal-related results, expansion could dilute relevance.
5) List the parameters to identify spam.
Spam detection involves recognizing unsolicited, irrelevant, or malicious content using
specific parameters:
1. Content-based parameters:
o Repeated keywords (keyword stuffing).
o Misleading or unrelated metadata.

o Low content quality (e.g., gibberish or scraped text).


2. Behavioural indicators:
o High bounce rates.
o Excessive linking to low-quality sites.
o Automated patterns like bot-generated clicks or views.

3. Link analysis:
o Abnormal backlink patterns (e.g., link farms).
o Excessive outbound links to unrelated domains.
4. User signals:
o Negative feedback (e.g., frequent reports as spam).

o Low engagement (short session durations).


5. Technical indicators:
o Hidden text or links (e.g., white text on white background).
o Cloaking (different content for users and search engines).
Example: A webpage stuffed with phrases like "buy cheap watches free shipping" multiple
times with links leading to suspicious domains could be flagged as spam.

6) State Heap's Law.


Heap's Law models the growth of a vocabulary in a collection of documents. It states:

V(n)=k⋅ nβ
Where:
 V(n)V(n)V(n): Vocabulary size after processing n words.
 k: Constant depending on the corpus.
 β: Typically between 0.4 and 0.6, indicating sublinear growth.

Implications:
 As more text is added, the vocabulary size grows, but at a diminishing rate.
 Used in information retrieval to estimate dictionary size for indexing.
Example: In a collection of news articles, adding more articles will introduce new terms, but
most will be repeats of common words.

7) How does relevance scoring work in web search?


Relevance scoring ranks web pages based on their utility for a given query:
1. Textual relevance:
o Measures similarity between the query and document content using
techniques like TF-IDF or BM25.
2. Link analysis:
o Uses algorithms like PageRank or HITS to score based on hyperlink structure.
3. User Behaviour:

o Factors like click-through rate (CTR), dwell time, and past searches.
4. Semantic matching:
o Expands queries to include synonyms or related terms (e.g., "movie" matched
with "film").
5. Content freshness:
o Prioritizes recent content for time-sensitive queries (e.g., news).
Example: For the query "best smartphone 2024," relevance scoring considers text in product
reviews, their backlinks, and recent updates to ensure top results reflect the latest models.

8) What is Collaborative Filtering? Give an example.


Collaborative Filtering (CF) is a recommendation method based on user-item interactions. It
identifies similar users or items to make predictions.

 Types:
o User-based CF: Finds users with similar preferences.
o Item-based CF: Finds items often interacted with together.
 Process:
o Build a user-item interaction matrix.
o Compute similarity (e.g., cosine similarity).
o Recommend based on neighbours’ preferences.
Example:
In Netflix, if User A likes movies "X" and "Y," and User B likes "Y" and "Z," the system might
recommend "Z" to User A, assuming similar tastes.

9) Define Text Mining. Give a suitable example.


Text Mining is the process of extracting meaningful patterns and insights from unstructured
text data using techniques like NLP, machine learning, and statistics.
 Applications:
o Sentiment analysis for customer reviews.
o Spam email classification.

o Extracting topics from news articles.


Example:
Analysing tweets during a product launch to understand customer sentiment. Using text
mining, positive, negative, and neutral feedback can be identified to gauge public reaction.

10) Write the pros and cons of using classification algorithms over clustering approaches in
text mining.

Aspect Classification Clustering

- Unsupervised; discovers hidden


Pros - Predicts specific classes or categories.
patterns.

- Leverages labelled data for accuracy. - Can identify unknown groupings.

- Scalable for large datasets. - No need for labelled data.

- Requires labelled training data, which may


Cons - Results are harder to interpret.
not be available.

- Clusters may not align with


- Limited to pre-defined categories.
meaningful categories.

Example:
 Classification: Categorizing emails into "spam" or "not spam."

 Clustering: Grouping news articles based on similarity to identify topics.


11) Explain the process of web search and the components of a search engine.
Web search involves retrieving and ranking web pages relevant to a user's query. A search
engine performs this task using various components and processes.
Components of a Search Engine:
1. Web Crawlers (Spiders):
o Crawl the web, fetching web pages and discovering links.

o Example: Google's web crawlers, such as Googlebot.


2. Indexing:
o Stores crawled content in an organized structure for efficient retrieval.
o Techniques: Inverted indices for fast keyword searches.
3. Query Processor:

o Analyses user queries to interpret intent and match indexed content.


o Techniques: Tokenization, query expansion, and relevance weighting.
4. Ranking Engine:
o Scores and ranks pages using algorithms like TF-IDF, BM25, and PageRank.
5. Search Interface:

o Presents results in a user-friendly format, such as snippets and rich results.


Steps in Web Search:
1. User query submission: User enters a search query.
2. Query processing: Search engine parses and expands the query.
3. Index lookup: Retrieves relevant documents from the index.

4. Relevance scoring and ranking: Applies scoring models to rank documents.


5. Results display: Returns ranked results to the user.
Example: Searching "best laptops under $1000" retrieves ranked product pages, reviews,
and shopping links.

12) Describe Boolean retrieval model with an example.


The Boolean retrieval model is a fundamental IR model that uses Boolean logic to match
documents with queries.
Core Features:
1. Queries are expressed using operators:
o AND: Matches documents containing all terms.
o OR: Matches documents containing any term.
o NOT: Excludes documents containing the term.

2. The result is a binary decision (relevant or not).


Example:
Query: "machine AND learning NOT deep"
 Matches documents containing both "machine" and "learning" but excludes those
containing "deep."
Process:
 Documents are represented as binary vectors based on term presence.

 Logical operations identify relevant documents.


Limitation:
 Does not rank results by relevance, only binary inclusion.

13) Write short notes on Latent Semantic Indexing (LSI) with a clear example.
LSI is a technique that reduces the dimensionality of a term-document matrix to uncover
hidden relationships between terms and documents.
Process:

1. Construct Term-Document Matrix (TDM): Rows represent terms; columns represent


documents.

2. Apply Singular Value Decomposition (SVD):


o Decomposes TDM into three matrices: U, S, V.
o S captures the most significant concepts.
3. Reduce Dimensionality: Retain only the top k singular values in S.
4. Query Mapping: Map user queries into the same reduced space for matching.

Example:
For documents about "artificial intelligence" and "machine learning":
 Original terms: "AI," "artificial," "learning," "machine."
 LSI maps these terms into latent topics (e.g., "technology").
Advantage: Handles synonymy (different words with the same meaning) and polysemy
(same word with different meanings).
14) Describe in detail about Web Crawling with a neat architecture diagram.
Web crawling is the process of systematically browsing the web to collect and store data.
Process:
1. Seed URLs: Start with an initial list of URLs.

2. Fetching: Use HTTP requests to fetch pages.


3. Parsing: Extract content and links.
4. Storing: Save content in repositories for indexing.
5. Prioritizing: Determine which URLs to crawl next (e.g., breadth-first, depth-first).
Architecture Components:

1. URL Frontier: Maintains the list of URLs to be crawled.


2. Downloader: Fetches web pages using HTTP.
3. Parser: Extracts data and links.
4. Storage: Saves crawled pages in databases.
5. Scheduler: Prioritizes URLs for crawling.

Example: Search engines like Google use web crawlers to build their indexes.
Diagram:

15) Explain the vector space model in XML retrieval. Write in detail about the evaluation
of the data retrieved.
Vector Space Model (VSM):
 Represents documents and queries as vectors in a high-dimensional space.
 Documents are scored based on their cosine similarity to the query vector.
Application to XML Retrieval:
 XML documents have hierarchical structures.
 The VSM is adapted to index and retrieve specific XML elements or attributes based
on user queries.
Steps:

1. Parse XML documents to extract relevant elements.


2. Construct vectors based on terms in these elements.
3. Match query vectors to document vectors using cosine similarity.
Evaluation of Data Retrieval:
1. Precision: Proportion of retrieved documents that are relevant.

2. Recall: Proportion of relevant documents that are retrieved.


3. F-Measure: Harmonic mean of precision and recall.
4. Mean Average Precision (MAP): Average precision across multiple queries.
Example: Querying an XML dataset of books for "author: Tolkien" retrieves elements
containing relevant author data.

16) Explain the working of Hyperlink algorithm with an example.


Hyperlink algorithms, such as PageRank and HITS (Hyperlink-Induced Topic Search), leverage
the structure of hyperlinks to rank web pages.
1. PageRank Algorithm:

 Idea: A page is important if many important pages link to it.


 Process:
1. Assign an initial rank to each page.
2. Distribute rank equally among outgoing links.
3. Update ranks iteratively until convergence.

PR(A)=(1−d)+d ∑ PR(L) / outlinks (L)


Where PR(A)PR(A)PR(A) is the PageRank of page A, d is the damping factor (e.g., 0.85), and L
represents linking pages.
Example: For pages A, B, and C:
 A links to B and C; B links to C; C links to A.
 The algorithm calculates and distributes ranks iteratively.
2. HITS Algorithm:
 Idea: Pages are classified as hubs (linking to many authorities) or authorities (linked
by many hubs).
 Process:
1. Compute hub and authority scores for each page.

2. Update iteratively based on the adjacency matrix of the graph.


Example:
 In a topic like "machine learning," research papers linking to many high-quality
papers act as hubs, while heavily cited papers are authorities.

17) What is meant by Inverted Indices? Explain.


An inverted index is a data structure used to map terms to their occurrences in a document
collection.
Structure:
1. Dictionary: Stores unique terms.

2. Posting Lists: For each term, stores document IDs and possibly additional information
(e.g., term frequency, positions).
Steps to Create an Inverted Index:
1. Tokenization: Break documents into terms.
2. Normalization: Lowercase terms, remove punctuation, etc.
3. Indexing: Create mappings of terms to document lists.

Example:
For documents:

 Doc1: "apple banana apple."


 Doc2: "banana orange."
Inverted Index:
 apple → Doc1 (2 occurrences)
 banana → Doc1, Doc2

 orange → Doc2

18) How to implement a MapReduce function through Hadoop?


MapReduce Overview:
A distributed computing framework that processes large datasets by splitting tasks into Map
and Reduce phases.
Steps to Implement:
1. Define the Mapper:

o Processes input data and generates key-value pairs.


Example: In a word count task, the mapper outputs (word, 1) for each word.
2. Define the Reducer:
o Aggregates values for the same key.
Example: The reducer sums count for each word, producing (word, total count).

3. Set Up Hadoop Cluster:


o Deploy Hadoop Distributed File System (HDFS) for storage.
o Submit the job to the Hadoop Job Tracker.
4. Run the Job:
o Process data in distributed nodes.

Example:
For the input "apple banana apple":
 Mapper Output: (apple, 1), (banana, 1), (apple, 1).
 Reducer Output: (apple, 2), (banana, 1).

19) Explain the use of snippet generation, summarization, and question answering in web
search and retrieval.
1. Snippet Generation:

o Briefly summarizes the most relevant parts of a web page.


o Purpose: Provides context to help users decide whether to click a result.
o Example: For the query "best smartphone 2024," a snippet might display a
line like "Check out our list of the best smartphones under $1000."
2. Summarization:
o Condenses long documents into shorter versions.
o Types:

 Extractive: Selects key sentences from the text.


 Abstractive: Generates new sentences summarizing the content.
3. Question Answering (QA):
o Directly provides answers to user questions using NLP.

o Example: Query: "Who is the CEO of Google?" Answer: "Sundar Pichai."


Applications:
 Enhances user experience by reducing effort and time spent sifting through content.

20) Explain in detail the steps involved in finding the root node and its leaf nodes through
information gain value using a decision tree.
Steps:

1. Compute Entropy:
o Measure the disorder or impurity of the dataset.
H(S)=−∑p(i)log2(p(i))
Where p(i) is the probability of each class.
2. Calculate Information Gain:

o Determines the reduction in entropy after splitting on an attribute.


IG(S,A)=H(S)−∑ ( ∣Sv∣ / ∣S∣ )H(Sv)
Select the Attribute with Maximum Information Gain:
o This attribute becomes the root node.
3. Repeat for Subsets:

o Apply the process recursively for subsets (child nodes).


4. Stop Condition:
o When entropy is zero or all attributes are exhausted, leaf nodes are assigned.
Example:
For a dataset of loan approvals:
 Root node: "Income Level" (highest IG).

 Child nodes: Split into "High," "Medium," and "Low."


 Further splits create the decision tree.

21) Explain the application of Expectation-Maximization (EM) algorithm in text mining.


The Expectation-Maximization (EM) algorithm is a statistical technique for finding maximum
likelihood estimates of parameters in probabilistic models, especially when the data involves
latent variables.
Application in Text Mining:
EM is widely used in clustering, topic modelling, and handling incomplete or noisy text
datasets.
Steps of the EM Algorithm:
1. Initialization:

o Assign initial values to the parameters of the model (e.g., probabilities in a


mixture of Gaussians).

2. Expectation Step (E-step):


o Estimate the probabilities of latent variables based on current parameters.
3. Maximization Step (M-step):
o Update the model parameters to maximize the likelihood of the observed
data given the latent variables.
4. Repeat until convergence.

Example in Topic Modelling:


 Consider a document collection. Each document belongs to a latent topic.
 Latent Variables: Topics.
 Observed Variables: Words in the documents.
Process:

 The E-step estimates the probability that a word belongs to a topic.


 The M-step adjusts the topic-word and document-topic distributions.
Output: Topic-word distributions for discovering document themes.
Advantages:
 Handles missing data and probabilistic relationships.

22) Design a simple sentiment analysis system considering a database of your choice with
minimal records. Sketch the architecture and highlight the functions of the modules
identified.
Architecture:
1. Data Collection Module:
o Collects user reviews or social media posts.
2. Preprocessing Module:
o Cleans the data:

 Remove stopwords, special characters, and URLs.


 Tokenize text.
3. Feature Extraction Module:
o Extracts relevant features:
 Bag-of-Words, TF-IDF, or Word Embeddings.

4. Sentiment Classification Module:


o Uses a machine learning model (e.g., Naive Bayes, SVM, or deep learning).
5. Output Module:
o Displays results: Positive, Negative, or Neutral sentiments.
Minimal Database Example:

Data:
 "I love this product." (Positive)
 "This is the worst experience ever." (Negative)
 "The service was okay, not great." (Neutral)
System Flow:

1. Preprocess the text:


o Remove punctuation, lowercase text.
2. Extract features using TF-IDF.
3. Train a Naive Bayes classifier with the labeled data.
4. Classify new inputs, such as "The food was delicious."

23) Demonstrate the use of K-means clustering algorithm in grouping different documents.
Assume a simple database with 3 types of documents. State the assumptions made.
Steps for K-Means Clustering:
1. Initialization: Choose K=3 clusters. Randomly initialize centroids.
2. Assignment: Assign each document to the nearest cluster based on cosine similarity
or Euclidean distance.
3. Update: Recalculate centroids by averaging feature vectors of assigned documents.
4. Repeat: Iterate until clusters stabilize.
Database Example:

Documents:
1. "AI and machine learning techniques."
2. "Deep learning and neural networks."
3. "Cooking recipes for beginners."
4. "Best travel destinations in 2024."

5. "Machine learning applications in healthcare."


Clusters:
 Cluster 1: AI/ML topics.
 Cluster 2: Travel.
 Cluster 3: Cooking.

Assumptions:
 Text is represented as numerical vectors (e.g., TF-IDF).
 The number of clusters (KKK) is known beforehand.
Result: Documents grouped into categories based on content similarity.

24) What is peer-to-peer search?


Peer-to-peer (P2P) search is a decentralized search mechanism where nodes (peers) act
both as clients and servers, sharing resources without relying on a centralized server.
Features:
1. Decentralized: No central control; data is distributed across peers.
2. Scalability: Handles large-scale networks efficiently.
3. Robustness: Resistant to single points of failure.

How It Works:
1. A peer issues a query.
2. The query propagates through the network using techniques like flooding or
distributed hash tables (DHT).
3. Peers with matching content respond directly.
Example:
BitTorrent protocol for file sharing uses P2P to search and retrieve files.
Advantages:

 Cost-effective and scalable.


Limitations:
 Latency and difficulty in ensuring consistent results.

25) What are the performance measures for search engines?

1. Precision:
o Proportion of retrieved documents that are relevant.
Precision=Relevant Retrieved Documents/Total Retrieved Documents
2. Recall:
o Proportion of relevant documents retrieved out of all relevant documents.

Recall=Relevant Retrieved Documents/Total Relevant Documents


3. F-Measure:
o Harmonic mean of precision and recall.
F=2×((Precision×Recall)/(Precision+Recall))
4. Mean Average Precision (MAP):

o Averages precision values at different recall levels across queries.


5. Click-Through Rate (CTR):
o Proportion of clicked results to total impressions.
6. Response Time:
o Time taken to return results.

7. Index Coverage:
o Percentage of web pages indexed by the search engine.
8. Diversity:
o Variety in the types of results returned.
9. User Satisfaction:
o Measured through surveys or implicit feedback
26) Define the term Stemming.
Stemming is the process of reducing words to their root or base form by removing suffixes
or prefixes. It is used in text preprocessing to standardize words and reduce redundancy in
information retrieval systems.
Example:

 Words: running, runner, runs


 Stem: run
Algorithms Used:
1. Porter Stemmer: Commonly used, rule-based algorithm.
2. Lancaster Stemmer: More aggressive than the Porter Stemmer.

3. Snowball Stemmer: An improvement of Porter with support for multiple languages.


Applications:
 Text mining.
 Search engines to improve matching between query terms and documents.

27) Differentiate between relevance feedback and pseudo relevance feedback.

Aspect Relevance Feedback Pseudo Relevance Feedback

Uses user-provided relevance Assumes top-k retrieved documents


Definition
judgments to refine search results. are relevant for refinement.

User Requires user input to mark relevant Fully automated, no user


Involvement documents. involvement.

Lower due to potential noise in


Accuracy High if user input is accurate.
assuming relevance.

Efficiency Slower due to user interaction. Faster, as it bypasses user input.

User marks documents as relevant; Top-5 documents used to expand the


Example
the system refines results. query terms.

28) What are the applications of web crawlers?


Web crawlers, also known as spiders or bots, traverse the web to collect data for various
applications.
Applications:
1. Search Engines:
o Index web pages for faster and more accurate search results (e.g., Google,
Bing).
2. Market Research:

o Monitor competitors' pricing, product details, and reviews.


3. Content Aggregation:
o Gather news, job postings, or travel deals from multiple sources.
4. Sentiment Analysis:
o Collect social media posts or product reviews for sentiment analysis.

5. Data Archiving:
o Preserve snapshots of web pages (e.g., Wayback Machine).
6. Monitoring:
o Detect changes in websites, such as updates to regulations or policies.
7. E-commerce:

o Automate inventory management by fetching product availability.

29) Define Search Engine Optimization (SEO).


Search Engine Optimization (SEO) is the process of improving a website's visibility on search
engine results pages (SERPs) to attract organic (non-paid) traffic.
Key Components of SEO:

1. On-Page SEO:
o Optimizing content, meta tags, and keywords.
o Ensuring mobile-friendliness and fast loading times.
2. Off-Page SEO:
o Building backlinks from reputable websites.

3. Technical SEO:
o Improving crawlability, URL structure, and site architecture.
4. Content Quality:
o Producing relevant, high-quality content to engage users.
Benefits:
 Increased traffic.
 Improved user experience.

 Higher brand credibility.

30) What do you mean by item-based collaborative filtering?


Item-based collaborative filtering recommends items to users based on the similarity
between items rather than user behaviour.
Process:
1. Compute item-item similarity using methods like cosine similarity or Pearson
correlation.

2. Predict ratings for unseen items based on a user's ratings of similar items.
Example:
In an e-commerce setting:
 A user who purchased Phone A is recommended Phone B because many users who
bought Phone A also bought Phone B.
Advantages:
 Effective when user behaviour data is sparse.

 Scales better for large datasets compared to user-based filtering.

31) What is invisible web?


The Invisible Web, also known as the Deep Web, refers to content on the internet that is not
indexed by traditional search engines.
Characteristics:
1. Dynamic Content: Pages generated in response to specific queries (e.g., flight search
results).

2. Restricted Access: Password-protected or subscription-based content (e.g., academic


journals).

3. Unlinked Pages: Pages with no inbound links are not crawled by search engines.
4. Technical Barriers: Content requiring special formats (e.g., Flash) or located in
databases.
Examples:
 Online banking portals.
 Medical records.
 Government databases.

The Invisible Web contains valuable information, especially for research and specialized
queries.

32) What is good clustering?


Good clustering is achieved when a clustering algorithm produces well-separated and
coherent groups (clusters) that reflect the underlying data structure.
Characteristics of Good Clustering:
1. High Intra-Cluster Similarity: Points within a cluster are very similar.

2. Low Inter-Cluster Similarity: Points in different clusters are distinctly dissimilar.


3. Scalability: The algorithm performs well on large datasets.
4. Interpretability: Results are meaningful and understandable.
Evaluation Metrics:
1. Silhouette Coefficient: Measures the separation between clusters.

2. Dunn Index: Ratio of the smallest inter-cluster distance to the largest intra-cluster
distance.

3. Elbow Method: Determines the optimal number of clusters.

33) Differentiate information filtering and information retrieval.

Aspect Information Filtering Information Retrieval

Filters relevant information from a Retrieves information in response to


Purpose
continuous stream. a specific query.

A predefined profile or user


Input A user-initiated query.
preferences.

Relevant information matched to the Documents or data relevant to the


Output
user's profile. search query.

Email spam filtering, news feed


Examples Search engines, library catalogs.
personalization.
Aspect Information Filtering Information Retrieval

Usage Ongoing relevance (e.g.,


Ad-hoc, one-time search tasks.
Scenario recommendation systems).

34) Explain about components of Information Retrieval and Search Engine.


Components of Information Retrieval (IR):
1. Document Collection:
o A repository of structured or unstructured data.

2. Indexing Module:
o Creates an index for efficient document retrieval.
3. Query Processor:
o Processes user queries to match them with indexed documents.
4. Retrieval Module:

o Fetches documents based on relevance scoring techniques.


5. Ranking Module:
o Ranks retrieved documents according to their relevance to the query.
Components of a Search Engine:
1. Web Crawler:

o Collects and updates data from the web.


2. Indexer:
o Structures the data for quick lookup.
3. Query Interface:
o Allows users to input queries.

4. Ranking Algorithm:
o Uses factors like relevance, popularity, and authority to rank results.
5. Database/Storage:
o Stores the web pages, metadata, and indexes.

35) Explain the impact of the web on Information Retrieval system.


The web has significantly influenced Information Retrieval (IR) systems in the following ways:
1. Scale and Diversity:
o IR systems must handle massive, diverse, and dynamic web data.

2. Hyperlinked Structure:
o Links between documents provide new ways to evaluate relevance (e.g.,
PageRank).
3. User Interaction:
o Personalization and query expansion based on user behaviour.
4. Challenges:
o Spam, duplicated content, and ensuring real-time updates.

5. Multimedia Retrieval:
o Necessity to process images, videos, and other non-textual content.
6. Global Access:
o IR systems cater to multilingual and geographically distributed users.

36) Brief about Open-Source Search Engine Framework.


Open-source search engine frameworks provide tools for building custom search engines.
Popular Frameworks:
1. Apache Lucene:
o A high-performance, full-text search library.

o Features: Ranking, term proximity, and scalability.


2. Elasticsearch:
o Built on Lucene for distributed search.
o Features: Real-time indexing and analytics.
3. Solr:

o A Lucene-based search server optimized for enterprise use.


o Features: Faceted search, scalability, and integration with Hadoop.
4. Whoosh:
o A lightweight Python-based framework.
o Ideal for small-scale applications.
Advantages:
 Cost-effective customization.
 Community support and frequent updates.

37) Explain in detail about binary independence model for Probability Ranking Principle
(PRP).
The Binary Independence Model (BIM) is a probabilistic IR model based on the Probability
Ranking Principle (PRP), which ranks documents by their likelihood of relevance to a query.
Assumptions:
1. Binary relevance: A document is either relevant or not.
2. Independence: Terms in a document occur independently.

Key Formula:
The probability of a document DDD being relevant is:
P(R∣D)=(P(R)⋅P(D∣R) ) / P(D)
Steps:
1. Compute probabilities for relevant and non-relevant terms.

2. Score documents based on the ratio of these probabilities.


3. Rank documents by their scores.
Advantages:
 Simplifies relevance estimation.
 Provides a principled basis for ranking.

Limitations:
 Assumes independence, which may not hold in real-world text.

38) Write short notes on Latent Semantic Indexing (LSI).


Latent Semantic Indexing (LSI) is a technique to improve information retrieval by reducing
the dimensionality of text data.
Process:

1. Term-Document Matrix:
o Represents the frequency of terms in documents.
2. Singular Value Decomposition (SVD):
o Decomposes the matrix into three smaller matrices: A=U Σ VT
3. Dimensionality Reduction:

o Reduces noise and captures latent relationships between terms and


documents.

Advantages:
 Handles synonymy and polysemy.
 Improves retrieval accuracy.
Example:
For terms like car and automobile, LSI identifies their similarity, improving search results for
related queries.

39) How do we process a query using an inverted index and the basic Boolean Retrieval
model?
Inverted Index:
An inverted index maps terms to the list of documents containing those terms.
Example:
For documents:
 Doc 1: "apple banana"
 Doc 2: "banana cherry"

 Index:
o apple: [1]
o banana: [1, 2]
o cherry: [2]

Query Processing:
1. Query: apple AND banana
2. Retrieve Posting Lists:
o apple: [1]
o banana: [1, 2]

3. Intersect Lists:
o Result: [1]
Advantages:
 Efficient for Boolean queries.

40) Describe about how to estimate the query generation probability for query likelihood
model.
In the Query Likelihood Model, the probability P(Q∣D) estimates how likely a document D is
to generate the query Q.
Steps:
1. Unigram Language Model:
o Each document is modelled as a probability distribution over terms.

2. Estimate Term Probabilities:


P(Q∣D) =∏t∈Q P (t ∣ D)
Smoothing:
o Adjust probabilities to handle unseen terms using methods like Laplace or
Dirichlet smoothing.
3. Rank Documents:
o Based on P(Q∣D).

Advantages:
 The model incorporates term frequency and handles probabilistic ranking.

41) Explain in detail about fingerprint algorithm for near-duplication detection.


The fingerprint algorithm is a technique used to identify near-duplicate documents
efficiently by generating unique "fingerprints" for documents.
Steps:

1. Tokenization:
o Split the document into smaller tokens (e.g., words or phrases).
2. Hashing:
o Apply a hashing function to convert tokens into numerical representations.
3. Fingerprint Generation:
o Use a method like MinHash or SimHash to create a compact fingerprint of the
document.
4. Comparison:
o Compare fingerprints using similarity metrics such as Jaccard similarity or
Hamming distance.
Applications:

 Detecting plagiarism.
 Eliminating duplicate content in search engine indexes.
 Identifying similar documents in large datasets.
Advantages:
 Fast and scalable for large datasets.

 Reduces storage requirements by identifying duplicates.

42) Explain the features and architecture of web crawlers.


Web Crawlers are automated programs that traverse the web to gather information
from websites.
Features:
1. Scalability:

o Can handle large-scale web data.


2. Politeness:
o Adheres to the rules defined in a site's robots.txt file.
3. Efficiency:
o Minimizes bandwidth usage by avoiding redundant downloads.

4. Extensibility:
o Can be customized for specific tasks like multimedia crawling.

Architecture:
1. URL Frontier:

o Manages the queue of URLs to be crawled.


2. Downloader:
o Fetches web pages from the internet.
3. Parser:
o Extracts links and relevant content from fetched pages.

4. Indexer:
o Stores extracted data for future retrieval.
5. Scheduler:
o Determines the order in which URLs are fetched.

43) Explain about on-line selection in web crawling.


On-line selection in web crawling refers to the process of prioritizing and filtering URLs
dynamically during the crawling process.
Steps:
1. URL Prioritization:
o Assign scores to URLs based on relevance, popularity, or recency.
2. Dynamic Filtering:

o Skip duplicate or irrelevant pages based on predefined criteria.


3. Adaptive Crawling:
o Adjust crawling behaviour based on feedback (e.g., changing priorities if a
page is updated frequently).
Advantages:
 Reduces bandwidth consumption.

 Ensures timely and relevant data collection.


Challenges:
 Requires real-time analysis of content.
 Balancing coverage and depth are complex.

44) Explain in detail about Vector Space Model for XML Retrieval.
The Vector Space Model (VSM) is adapted for XML retrieval to account for structured
data.
Key Concepts:
1. Document Representation:
o Represent XML documents as vectors of terms, considering tags and
attributes.
2. Query Representation:
o Represent queries as vectors, with terms and structure matching XML tags.

3. Similarity Measurement:
o Use cosine similarity to compute the relevance of documents to a query.
Process:
1. Parse XML documents and create a term-vector matrix.
2. Parse user queries, mapping terms to corresponding XML structures.

3. Compute similarity scores and rank results.

45) Explain in detail about HITS link analysis algorithm.


HITS (Hyperlink-Induced Topic Search) is a link analysis algorithm that evaluates the
relevance and authority of web pages.
Key Concepts:

1. Authority:
o Pages that are highly referenced by others.
2. Hub:
o Pages that reference many other relevant pages.
Steps:

1. Construct a link graph of web pages based on hyperlinks.


2. Assign initial scores to all pages.
3. Update scores iteratively:
o Hub score: Sum of the authority scores of pages it links to.
o Authority score: Sum of the hub scores of pages linking to it.

4. Converge to stable scores for hubs and authorities.


Applications:
 Identifying influential pages in search engines.
46) Explain in detail about community-based question answering system.
A community-based question answering (CQA) system is a platform where users ask
and answer questions, leveraging collective knowledge.
Components:
1. User Interface:

o For posting questions and viewing answers.


2. Answer Ranking:
o Ranks answers based on quality, user votes, and relevance.
3. Reputation System:
o Assigns scores to users based on contributions.

4. Search Module:
o Allows users to search for existing questions.
Examples:
 Quora, Stack Overflow.
Advantages:

 Promotes knowledge sharing.


 Provides detailed answers based on diverse expertise.

47) Explain in detail about Collaborative Filtering and Content-Based Recommendation


Systems.
Collaborative Filtering (CF):

CF relies on user behaviour and preferences to recommend items.


Types of CF:
1. User-Based CF:
o Identifies similar users and recommends items liked by them.
o Example: User A and User B have similar tastes; recommend items liked by B
to A.
2. Item-Based CF:

o Identifies items frequently co-rated by users.


o Example: Users who bought item X also bought item Y.
Advantages:
 Requires no item-specific features.
 Works well for diverse user groups.

Disadvantages:
 Cold-start problem (new users or items).
 Scalability issues with large datasets.

Content-Based Recommendation Systems (CBRS):

CBRS uses item attributes and user preferences for recommendations.


Process:
1. Analyse item features (e.g., genre, description).
2. Match these features with user profiles.
Example:

 A user who likes mystery novels is recommended books with similar genres.
Advantages:
 Handles new users if sufficient item data is available.
 Transparent recommendations based on item features.
Disadvantages:

 Limited to known user preferences.


 Cannot recommend diverse items outside user preferences.

48) Brief on Personalized Search.


Personalized Search tailors search results to an individual’s preferences and behaviour.

Key Features:
1. User History:
o Leverages past queries and interactions.
2. Location-Based Results:
o Recommends results based on geographic location.
3. Demographics:
o Incorporates age, gender, and preferences.
4. Device-Specific Optimization:
o Adapts results to the user’s device type.

Example:
A user searching for "restaurants" in New York gets location-specific results based on
prior preferences.
Challenges:
 Privacy concerns.
 Balancing personalization with result diversity.

49) Explain in detail about Naïve Bayes Classification.


Naïve Bayes is a probabilistic classifier based on Bayes' theorem with the assumption of
feature independence.
Bayes' Theorem:
P(C∣X)=P(X∣C)⋅P(C) / P(X)
 P(C∣X): Probability of class C given data X.

 P(X∣C): Probability of data X given class C.


 P(C): Prior probability of class C.
 P(X): Prior probability of data X.
Steps:
1. Calculate prior probabilities for each class.

2. Calculate likelihoods of features for each class.


3. Multiply priors and likelihoods to compute P(C∣X)
4. Assign the class with the highest probability.
Advantages:
 Simple and fast for large datasets.

 Effective for text classification tasks.


Example:
Classifying emails as spam or not spam based on word frequencies.
50) Explain in detail about Multiple-Bernoulli Model.
The Multiple-Bernoulli Model is a probabilistic model used in text classification, treating
documents as binary feature vectors.
Key Concepts:
 Each term is either present (1) or absent (0) in a document.

 Assumes binary independence of terms.


Steps:
1. Represent documents as binary vectors.
2. Estimate probabilities of term occurrences for each class.
3. Use Bayes' theorem to compute class probabilities for new documents.

Advantages:
 Simplifies computations with binary data.
 Suitable for scenarios with sparse data.
Limitations:
 Loses term frequency information.

 Assumes equal importance for all terms.

51) Explain in detail about K-Means Algorithm.


K-Means is a clustering algorithm that partitions data into KKK clusters based on
similarity.
Steps:

1. Initialize KKK cluster centroids randomly.


2. Assign each data point to the nearest centroid.
3. Recalculate centroids as the mean of assigned points.
4. Repeat steps 2 and 3 until centroids stabilize.
Applications:

 Document clustering.
 Image compression.
Example:
Clustering articles into categories like sports, politics, and technology.
Advantages:
 Simple and scalable.
 Works well with well-separated clusters.

Limitations:
 Sensitive to initial centroid placement.
 Struggles with non-spherical clusters.

52) Brief about Expectation-Maximization (EM) Algorithm.

Expectation-Maximization (EM) is an iterative algorithm used for finding maximum


likelihood estimates of parameters in statistical models.

Steps:
1. Expectation Step (E-Step):
o Estimate the missing data using current parameter values.
2. Maximization Step (M-Step):
o Update parameter estimates to maximize the likelihood function.

3. Repeat until convergence.


Applications:
 Text clustering.
 Latent variable models like Gaussian Mixture Models (GMM).

53) Consider a web graph with three nodes 1, 2, and 3. The links are as follows: 1 → 2,
3 → 2, 2 → 1, 2 → 3. Write down the transition probability matrices for the surfer's
walk with teleporting, for the teleport probability: a=0.5a = 0.5a=0.5, and compute the
PageRank.
Transition Probability Matrix with Teleporting:

For a=0.5:
M=0.5⋅A+0.5⋅ T
 A: Link-based transition matrix.
 T: Teleportation matrix (uniform probability for all nodes).
Steps to Compute PageRank:

1. Formulate M.
2. Start with an initial PageRank vector.
3. Iteratively update: PR=M⋅ PR
4. Normalize the vector after convergence.

Result:
 PR: Final rank scores for each node.

54) How do the various nodes of a distributed crawler communicate and share URLs?
Distributed crawlers handle large-scale web crawling tasks by splitting the workload
across multiple nodes. Communication and URL sharing among nodes are critical for
efficient crawling.
Mechanisms for Communication:

1. Message Queues:
o Nodes communicate using message queues to share URLs and status updates.
2. Central Coordination:
o A central server assigns URLs to crawler nodes and maintains a global URL
queue.
3. Decentralized Communication:
o Peer-to-peer communication ensures that nodes exchange URLs directly
without central coordination.
URL Sharing Strategies:

1. Partitioning by Domain:
o Each node is assigned specific domains or subdomains to avoid overlap.
2. Hash-Based Partitioning:
o URLs are hashed, and the hash value determines the node responsible for
crawling.
3. URL Deduplication:

o Nodes check against a shared database to avoid redundant crawling.


4. Load Balancing:
o Distribute URLs dynamically to ensure even workload across nodes.

55) When does relevance feedback work?


Relevance feedback improves search results by leveraging user feedback about the
relevance of retrieved documents.
When it Works:
1. User Interaction:
o Works well when users explicitly mark documents as relevant or non-relevant.

2. High Query Ambiguity:


o Helps refine results for vague or ambiguous queries.
3. Rich Document Set:
o Effective when the retrieved documents cover diverse aspects of the query.
Challenges:

 Requires user effort.


 May fail if feedback is inconsistent or biased.

56) Explain the process of Information Retrieval and the components involved in it
with a neat architecture.
Information Retrieval (IR) is the process of obtaining relevant information from a large
repository based on user queries.
Process:

1. Crawling:
o Collect data from sources like the web or databases.
2. Indexing:
o Organize data into searchable structures, such as inverted indices.
3. Query Processing:

o Parse and interpret user queries.


4. Matching:
o Compare queries with indexed data using models like vector space or
probabilistic models.
5. Ranking:
o Rank results based on relevance.
6. Retrieval:

o Return the most relevant results to the user.


Components:
1. Document Collection:
o Source of data to be retrieved.

2. Indexing Engine:
o Creates indices for efficient search.
3. Query Processor:
o Translates user input into a machine-readable format.
4. Ranking Algorithm:

o Determines the relevance of documents to the query.


5. Search Interface:
o Provides a user-friendly interface for queries and results.

You might also like