IR Workbook Answers
IR Workbook Answers
3. Link analysis:
o Abnormal backlink patterns (e.g., link farms).
o Excessive outbound links to unrelated domains.
4. User signals:
o Negative feedback (e.g., frequent reports as spam).
V(n)=k⋅ nβ
Where:
V(n)V(n)V(n): Vocabulary size after processing n words.
k: Constant depending on the corpus.
β: Typically between 0.4 and 0.6, indicating sublinear growth.
Implications:
As more text is added, the vocabulary size grows, but at a diminishing rate.
Used in information retrieval to estimate dictionary size for indexing.
Example: In a collection of news articles, adding more articles will introduce new terms, but
most will be repeats of common words.
o Factors like click-through rate (CTR), dwell time, and past searches.
4. Semantic matching:
o Expands queries to include synonyms or related terms (e.g., "movie" matched
with "film").
5. Content freshness:
o Prioritizes recent content for time-sensitive queries (e.g., news).
Example: For the query "best smartphone 2024," relevance scoring considers text in product
reviews, their backlinks, and recent updates to ensure top results reflect the latest models.
Types:
o User-based CF: Finds users with similar preferences.
o Item-based CF: Finds items often interacted with together.
Process:
o Build a user-item interaction matrix.
o Compute similarity (e.g., cosine similarity).
o Recommend based on neighbours’ preferences.
Example:
In Netflix, if User A likes movies "X" and "Y," and User B likes "Y" and "Z," the system might
recommend "Z" to User A, assuming similar tastes.
10) Write the pros and cons of using classification algorithms over clustering approaches in
text mining.
Example:
Classification: Categorizing emails into "spam" or "not spam."
13) Write short notes on Latent Semantic Indexing (LSI) with a clear example.
LSI is a technique that reduces the dimensionality of a term-document matrix to uncover
hidden relationships between terms and documents.
Process:
Example:
For documents about "artificial intelligence" and "machine learning":
Original terms: "AI," "artificial," "learning," "machine."
LSI maps these terms into latent topics (e.g., "technology").
Advantage: Handles synonymy (different words with the same meaning) and polysemy
(same word with different meanings).
14) Describe in detail about Web Crawling with a neat architecture diagram.
Web crawling is the process of systematically browsing the web to collect and store data.
Process:
1. Seed URLs: Start with an initial list of URLs.
Example: Search engines like Google use web crawlers to build their indexes.
Diagram:
15) Explain the vector space model in XML retrieval. Write in detail about the evaluation
of the data retrieved.
Vector Space Model (VSM):
Represents documents and queries as vectors in a high-dimensional space.
Documents are scored based on their cosine similarity to the query vector.
Application to XML Retrieval:
XML documents have hierarchical structures.
The VSM is adapted to index and retrieve specific XML elements or attributes based
on user queries.
Steps:
2. Posting Lists: For each term, stores document IDs and possibly additional information
(e.g., term frequency, positions).
Steps to Create an Inverted Index:
1. Tokenization: Break documents into terms.
2. Normalization: Lowercase terms, remove punctuation, etc.
3. Indexing: Create mappings of terms to document lists.
Example:
For documents:
orange → Doc2
Example:
For the input "apple banana apple":
Mapper Output: (apple, 1), (banana, 1), (apple, 1).
Reducer Output: (apple, 2), (banana, 1).
19) Explain the use of snippet generation, summarization, and question answering in web
search and retrieval.
1. Snippet Generation:
20) Explain in detail the steps involved in finding the root node and its leaf nodes through
information gain value using a decision tree.
Steps:
1. Compute Entropy:
o Measure the disorder or impurity of the dataset.
H(S)=−∑p(i)log2(p(i))
Where p(i) is the probability of each class.
2. Calculate Information Gain:
22) Design a simple sentiment analysis system considering a database of your choice with
minimal records. Sketch the architecture and highlight the functions of the modules
identified.
Architecture:
1. Data Collection Module:
o Collects user reviews or social media posts.
2. Preprocessing Module:
o Cleans the data:
Data:
"I love this product." (Positive)
"This is the worst experience ever." (Negative)
"The service was okay, not great." (Neutral)
System Flow:
23) Demonstrate the use of K-means clustering algorithm in grouping different documents.
Assume a simple database with 3 types of documents. State the assumptions made.
Steps for K-Means Clustering:
1. Initialization: Choose K=3 clusters. Randomly initialize centroids.
2. Assignment: Assign each document to the nearest cluster based on cosine similarity
or Euclidean distance.
3. Update: Recalculate centroids by averaging feature vectors of assigned documents.
4. Repeat: Iterate until clusters stabilize.
Database Example:
Documents:
1. "AI and machine learning techniques."
2. "Deep learning and neural networks."
3. "Cooking recipes for beginners."
4. "Best travel destinations in 2024."
Assumptions:
Text is represented as numerical vectors (e.g., TF-IDF).
The number of clusters (KKK) is known beforehand.
Result: Documents grouped into categories based on content similarity.
How It Works:
1. A peer issues a query.
2. The query propagates through the network using techniques like flooding or
distributed hash tables (DHT).
3. Peers with matching content respond directly.
Example:
BitTorrent protocol for file sharing uses P2P to search and retrieve files.
Advantages:
1. Precision:
o Proportion of retrieved documents that are relevant.
Precision=Relevant Retrieved Documents/Total Retrieved Documents
2. Recall:
o Proportion of relevant documents retrieved out of all relevant documents.
7. Index Coverage:
o Percentage of web pages indexed by the search engine.
8. Diversity:
o Variety in the types of results returned.
9. User Satisfaction:
o Measured through surveys or implicit feedback
26) Define the term Stemming.
Stemming is the process of reducing words to their root or base form by removing suffixes
or prefixes. It is used in text preprocessing to standardize words and reduce redundancy in
information retrieval systems.
Example:
5. Data Archiving:
o Preserve snapshots of web pages (e.g., Wayback Machine).
6. Monitoring:
o Detect changes in websites, such as updates to regulations or policies.
7. E-commerce:
1. On-Page SEO:
o Optimizing content, meta tags, and keywords.
o Ensuring mobile-friendliness and fast loading times.
2. Off-Page SEO:
o Building backlinks from reputable websites.
3. Technical SEO:
o Improving crawlability, URL structure, and site architecture.
4. Content Quality:
o Producing relevant, high-quality content to engage users.
Benefits:
Increased traffic.
Improved user experience.
2. Predict ratings for unseen items based on a user's ratings of similar items.
Example:
In an e-commerce setting:
A user who purchased Phone A is recommended Phone B because many users who
bought Phone A also bought Phone B.
Advantages:
Effective when user behaviour data is sparse.
3. Unlinked Pages: Pages with no inbound links are not crawled by search engines.
4. Technical Barriers: Content requiring special formats (e.g., Flash) or located in
databases.
Examples:
Online banking portals.
Medical records.
Government databases.
The Invisible Web contains valuable information, especially for research and specialized
queries.
2. Dunn Index: Ratio of the smallest inter-cluster distance to the largest intra-cluster
distance.
2. Indexing Module:
o Creates an index for efficient document retrieval.
3. Query Processor:
o Processes user queries to match them with indexed documents.
4. Retrieval Module:
4. Ranking Algorithm:
o Uses factors like relevance, popularity, and authority to rank results.
5. Database/Storage:
o Stores the web pages, metadata, and indexes.
2. Hyperlinked Structure:
o Links between documents provide new ways to evaluate relevance (e.g.,
PageRank).
3. User Interaction:
o Personalization and query expansion based on user behaviour.
4. Challenges:
o Spam, duplicated content, and ensuring real-time updates.
5. Multimedia Retrieval:
o Necessity to process images, videos, and other non-textual content.
6. Global Access:
o IR systems cater to multilingual and geographically distributed users.
37) Explain in detail about binary independence model for Probability Ranking Principle
(PRP).
The Binary Independence Model (BIM) is a probabilistic IR model based on the Probability
Ranking Principle (PRP), which ranks documents by their likelihood of relevance to a query.
Assumptions:
1. Binary relevance: A document is either relevant or not.
2. Independence: Terms in a document occur independently.
Key Formula:
The probability of a document DDD being relevant is:
P(R∣D)=(P(R)⋅P(D∣R) ) / P(D)
Steps:
1. Compute probabilities for relevant and non-relevant terms.
Limitations:
Assumes independence, which may not hold in real-world text.
1. Term-Document Matrix:
o Represents the frequency of terms in documents.
2. Singular Value Decomposition (SVD):
o Decomposes the matrix into three smaller matrices: A=U Σ VT
3. Dimensionality Reduction:
Advantages:
Handles synonymy and polysemy.
Improves retrieval accuracy.
Example:
For terms like car and automobile, LSI identifies their similarity, improving search results for
related queries.
39) How do we process a query using an inverted index and the basic Boolean Retrieval
model?
Inverted Index:
An inverted index maps terms to the list of documents containing those terms.
Example:
For documents:
Doc 1: "apple banana"
Doc 2: "banana cherry"
Index:
o apple: [1]
o banana: [1, 2]
o cherry: [2]
Query Processing:
1. Query: apple AND banana
2. Retrieve Posting Lists:
o apple: [1]
o banana: [1, 2]
3. Intersect Lists:
o Result: [1]
Advantages:
Efficient for Boolean queries.
40) Describe about how to estimate the query generation probability for query likelihood
model.
In the Query Likelihood Model, the probability P(Q∣D) estimates how likely a document D is
to generate the query Q.
Steps:
1. Unigram Language Model:
o Each document is modelled as a probability distribution over terms.
Advantages:
The model incorporates term frequency and handles probabilistic ranking.
1. Tokenization:
o Split the document into smaller tokens (e.g., words or phrases).
2. Hashing:
o Apply a hashing function to convert tokens into numerical representations.
3. Fingerprint Generation:
o Use a method like MinHash or SimHash to create a compact fingerprint of the
document.
4. Comparison:
o Compare fingerprints using similarity metrics such as Jaccard similarity or
Hamming distance.
Applications:
Detecting plagiarism.
Eliminating duplicate content in search engine indexes.
Identifying similar documents in large datasets.
Advantages:
Fast and scalable for large datasets.
4. Extensibility:
o Can be customized for specific tasks like multimedia crawling.
Architecture:
1. URL Frontier:
4. Indexer:
o Stores extracted data for future retrieval.
5. Scheduler:
o Determines the order in which URLs are fetched.
44) Explain in detail about Vector Space Model for XML Retrieval.
The Vector Space Model (VSM) is adapted for XML retrieval to account for structured
data.
Key Concepts:
1. Document Representation:
o Represent XML documents as vectors of terms, considering tags and
attributes.
2. Query Representation:
o Represent queries as vectors, with terms and structure matching XML tags.
3. Similarity Measurement:
o Use cosine similarity to compute the relevance of documents to a query.
Process:
1. Parse XML documents and create a term-vector matrix.
2. Parse user queries, mapping terms to corresponding XML structures.
1. Authority:
o Pages that are highly referenced by others.
2. Hub:
o Pages that reference many other relevant pages.
Steps:
4. Search Module:
o Allows users to search for existing questions.
Examples:
Quora, Stack Overflow.
Advantages:
Disadvantages:
Cold-start problem (new users or items).
Scalability issues with large datasets.
A user who likes mystery novels is recommended books with similar genres.
Advantages:
Handles new users if sufficient item data is available.
Transparent recommendations based on item features.
Disadvantages:
Key Features:
1. User History:
o Leverages past queries and interactions.
2. Location-Based Results:
o Recommends results based on geographic location.
3. Demographics:
o Incorporates age, gender, and preferences.
4. Device-Specific Optimization:
o Adapts results to the user’s device type.
Example:
A user searching for "restaurants" in New York gets location-specific results based on
prior preferences.
Challenges:
Privacy concerns.
Balancing personalization with result diversity.
Advantages:
Simplifies computations with binary data.
Suitable for scenarios with sparse data.
Limitations:
Loses term frequency information.
Document clustering.
Image compression.
Example:
Clustering articles into categories like sports, politics, and technology.
Advantages:
Simple and scalable.
Works well with well-separated clusters.
Limitations:
Sensitive to initial centroid placement.
Struggles with non-spherical clusters.
Steps:
1. Expectation Step (E-Step):
o Estimate the missing data using current parameter values.
2. Maximization Step (M-Step):
o Update parameter estimates to maximize the likelihood function.
53) Consider a web graph with three nodes 1, 2, and 3. The links are as follows: 1 → 2,
3 → 2, 2 → 1, 2 → 3. Write down the transition probability matrices for the surfer's
walk with teleporting, for the teleport probability: a=0.5a = 0.5a=0.5, and compute the
PageRank.
Transition Probability Matrix with Teleporting:
For a=0.5:
M=0.5⋅A+0.5⋅ T
A: Link-based transition matrix.
T: Teleportation matrix (uniform probability for all nodes).
Steps to Compute PageRank:
1. Formulate M.
2. Start with an initial PageRank vector.
3. Iteratively update: PR=M⋅ PR
4. Normalize the vector after convergence.
Result:
PR: Final rank scores for each node.
54) How do the various nodes of a distributed crawler communicate and share URLs?
Distributed crawlers handle large-scale web crawling tasks by splitting the workload
across multiple nodes. Communication and URL sharing among nodes are critical for
efficient crawling.
Mechanisms for Communication:
1. Message Queues:
o Nodes communicate using message queues to share URLs and status updates.
2. Central Coordination:
o A central server assigns URLs to crawler nodes and maintains a global URL
queue.
3. Decentralized Communication:
o Peer-to-peer communication ensures that nodes exchange URLs directly
without central coordination.
URL Sharing Strategies:
1. Partitioning by Domain:
o Each node is assigned specific domains or subdomains to avoid overlap.
2. Hash-Based Partitioning:
o URLs are hashed, and the hash value determines the node responsible for
crawling.
3. URL Deduplication:
56) Explain the process of Information Retrieval and the components involved in it
with a neat architecture.
Information Retrieval (IR) is the process of obtaining relevant information from a large
repository based on user queries.
Process:
1. Crawling:
o Collect data from sources like the web or databases.
2. Indexing:
o Organize data into searchable structures, such as inverted indices.
3. Query Processing:
2. Indexing Engine:
o Creates indices for efficient search.
3. Query Processor:
o Translates user input into a machine-readable format.
4. Ranking Algorithm: