IR qb1
IR qb1
Unit 1
1. Define k-gram indexing and explain its significance in Information Retrieval systems.
K-Gram Indexing is a technique used in Information Retrieval (IR) to enhance text searching by
breaking words into overlapping sequences of k contiguous characters, called k-grams.
For example, if k = 3 (3-gram), the word "search" is broken into the following 3-grams:
A k-gram index is a data structure that maps these k-grams to words that contain them.
2. Describe the process of constructing a k-gram index. Highlight the key steps involved and
the data structures used.
Constructing a k-gram index involves breaking words into k-grams and storing them efficiently for
fast retrieval. The process includes the following key steps:
1. Tokenization of Text Corpus
Extract words from the text collection (documents, database, or query inputs).
Convert words to lowercase for case-insensitive matching.
Remove punctuation and special characters if needed.
2. Generating K-Grams
{
"^se" : {"search", "seeking", "seam"},
"sea" : {"search", "seam"},
"ear" : {"search"},
"arc" : {"search"},
"rch" : {"search"},
"ch$" : {"search"},
"see" : {"seeking"},
"eek" : {"seeking"},
"eki" : {"seeking"},
"kin" : {"seeking"},
"ing$" : {"seeking"},
"eam" : {"seam"},
"am$" : {"seam"}
}
3. Explain how wildcard queries are handled in k-gram indexing. Discuss the challenges
associated with wildcard queries and potential solutions.
Solution:
o Use optimized indexing techniques like suffix tries or suffix arrays for prefix/suffix-
based lookups.
2. Excessive Candidate Words
o Queries like "a*" may return thousands of results.
Solution:
Solution:
Solution:
Average Precision (AP) is a metric used to evaluate the effectiveness of an Information Retrieval
(IR) system by measuring how well relevant documents are ranked within search results. It
calculates the mean precision at different recall levels for a single query.
Example Calculation
Suppose a search engine retrieves 5 documents for a query. The relevant documents are marked
with ✅, and non-relevant ones are marked ❌.
Step 2: Compute AP
AP= (1.00+0.67+0.75)
3
AP= 2.42
3
AP= 0.807
This means that, on average, the search system retrieves relevant documents with 80.7% precision
across different ranks.
Significance of Average Precision
Relevance judgments are the process of assessing whether a retrieved document is relevant to a
given query in an Information Retrieval (IR) system. These judgments are made by human
assessors or automated techniques and serve as the ground truth for evaluating search
performance.
A posting list (or inverted index) maps each term to a list of documents where the term appears.
Based on the given document-term matrix:
Document Terms
Doc1 cat, dog, fish
Doc2 cat, bird, fish
Doc3 dog, bird, elephant
Doc4 cat, dog, elephant
A posting list (or inverted index) associates each term with a list of documents that contain it.
Document Terms
Doc1 apple, banana, grape
Doc2 apple, grape, orange
Doc3 banana, orange, pear
Doc4 apple, grape, pear
Using the given inverted index, we create a Term-Document Matrix (binary representation, where 1
indicates the presence of a term in a document, and 0 indicates absence):
Using the Boolean Retrieval Model, we perform an AND operation between the columns for cat and
fish.
Final Answer
Using the Boolean Retrieval Model, the documents that contain both "cat" and "fish" are:
Doc1, Doc2
This allows efficient retrieval of relevant documents in Information Retrieval (IR) systems using
Boolean queries. 🎯
We have a term-document matrix and total term counts for each document:
Term Doc1 Doc2 Doc3 Doc4
cat 15 28 0 0
dog 18 0 32 25
fish 11 19 13 0
Doc1 = 48
Doc2 = 85
Doc3 = 74
Doc4 = 30
TF−IDF=TF×IDF
These TF-IDF scores represent the importance of each term in each document, helping in ranking
and retrieval in Information Retrieval (IR) systems. 🎯’
10. Given the term-document matrix and the TF-IDF scores calculated from Problem 4,
calculate the cosine similarity between each pair of documents (Doc1, Doc2), (Doc1,
Doc3), (Doc1, Doc4), (Doc2, Doc3), (Doc2,Doc4), and (Doc3, Doc4).
Where:
A⋅B is the dot product of vectors A and B.
∣∣A∣∣ is the magnitude (Euclidean norm) of vector A, calculated as:
Interpretation
Doc1 and Doc2 (0.803) → High similarity, meaning they contain overlapping terms.
Doc1 and Doc3 (0.690) → Moderate similarity.
Doc1 and Doc4 (0.597) → Some similarity.
Doc2 and Doc3 (0.160) → Low similarity.
Doc2 and Doc4 (0.000) → No similarity (no shared terms).
Doc3 and Doc4 (0.926) → Very high similarity, meaning they are very close in content.
11. Consider the following queries expressed in terms of TF-IDF weighted vectors:
Query1: cat: 0.5, dog: 0.5, fish: 0
Query2: cat: 0, dog: 0.5, fish: 0.5
Calculate the cosine similarity between each query and each document from the term-
document matrix in Problem 4.
where:
Query Vectors
Document Vectors
D1 D2 D3
Q1 0.8513 0.7365 0.4719
Q2 0.2554 0.9206 0.9439
Total No of terms in Doc1, Doc2, Doc3 & Doc4 are 65, 48, 36 and 92 respectively. calculate
the TF-IDF score for each term-document pair.
Apple 22 9 0 40
Banana 14 0 12 0
Orange 0 23 14 0
Total number of terms in each document:
Term DF
Apple 3 (appears in Doc1, Doc2, Doc4)
Banana 2 (appears in Doc1, Doc3)
Orange 2 (appears in Doc2, Doc3)
Final TF-IDF Table
Term Doc1 Doc2 Doc3 Doc4
Given Data
Recall= 20
50
Precision = 20
30
Precision+RecallF
F-score=2× (0.6667×0.4)
(0.6667+0.4)
F-score=2× 0.2667
1.0667
F-score= 2 × 0.25 = 0.5
Final Results
14. You have a test collection containing 100 relevant documents for a query. Your retrieval
system retrieves 80 documents, out of which 60 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.
Given:
Recall = 60
100
Precision= 60
80
F-score=2× Precision×Recall
Precision+RecallF
F-score=2× (0.75×0.6)
(0.75+0.6)
F-score=2× 0.45
1.35
Final Results
15. In a test collection, there are a total of 50 relevant documents for a query. Your retrieval
system retrieves 60 documents, out of which 40 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval
Given Data:
Recall= 40
50
Precision= 40
60
F-score=2× Precision×Recall
Precision+RecallF
F-score=2× (0.6667×0.8)
(0.6667+0.8)
F-score=2× 0.5333
1.4667
F-score=2× 0.3636=0.7273
Final Results
16. You have a test collection with 200 relevant documents for a query. Your retrieval
system retrieves 150 documents, out of which 120 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.
Given Data:
Recall= 120
200
Precision= 120
150
F-score=2× Precision×Recall
Precision+RecallF
F-score=2× (0.8×0.6)
(0.8+0.6)
F-score=2× 0.48
1.4
F-score=2× 0.3429=0.6857
Final Results
Recall= 70
80
Precision= 70
90
F-score=2× Precision×Recall
Precision+RecallF
F-score=2× (0.7778×0.875)
(0.7778+0.875)
F-score=2× 0.6806
1.6528
F-score=2× 0.4119=0.8238
pineapple $pineapple$ $p, pi, in, ne, ea, ap, pp, pl, le, e$
computer $computer$ $c, co, om, mp, pu, ut, te, er, r$
programming $programming$ $p, pr, ro, og, gr, ra, am, mm, mi, in, ng, g$
elephant $elephant$ $e, el, le, ep, ph, ha, an, nt, t$
database $database$ $d, da, at, ta, ab, ba, as, se, e$
pineapple $pineapple$ $pi, pin, ine, nea, eap, app, ppl, ple, le$
computer $computer$ $co, com, omp, mpu, put, ute, ter, er$
programming $programming$ $pr, pro, rog, ogr, gra, ram, amm, mmi, min, ing, ng$
elephant $elephant$ $el, ele, lep, eph, pha, han, ant, nt$
database $database$ $da, dat, ata, tab, aba, bas, ase, se$
3. 4-gram (Four-gram) Index
pineapple $pineapple$ $pin, pine, inea, neap, eapp, appl, pple, ple$
programming $programming$ $pro, prog, rogr, ogra, gram, ramm, ammi, mmin, ming, ing$
19. Calculate the Levenshtein distance between the following pair of words:
a. kitten and sitting b. intention and execution c. robot and orbit d. power and flower
The Levenshtein distance (Edit Distance) between two words is the minimum number of single-
character edits (insertions, deletions, or substitutions) required to change one word into the other.
Levenshtein Distance = 3
Levenshtein Distance = 3
(c) robot → orbit
Step Operation
Levenshtein Distance = 2
Levenshtein Distance = 2
Final Results:
Word Pair Levenshtein Distance
kitten → sitting 3
intention → execution 3
robot → orbit 2
power → flower 2
The Soundex algorithm encodes words (mainly names) into a four-character code consisting of a
letter followed by three digits, which represent similar sounding consonants. The general steps are:
(a) Williams
1. First letter: W
2. Convert remaining letters: I(ignored), L(4), L(4), I(ignored), A(ignored), M(5), S(2)
3. Remove duplicates: W-4-5-2
4. Final Soundex: W452
(b) Gonzalez
1. First letter: G
2. Convert remaining letters: O(ignored), N(5), Z(2), A(ignored), L(4), E(ignored), Z(2)
3. Remove duplicates: G-5-2-4
4. Final Soundex: G524
(c) Harrison
1. First letter: H
2. Convert remaining letters: A(ignored), R(6), R(6), I(ignored), S(2), O(ignored), N(5)
3. Remove duplicates: H-6-2-5
4. Final Soundex: H625
(d) Parker
1. First letter: P
2. Convert remaining letters: A(ignored), R(6), K(2), E(ignored), R(6)
3. Remove duplicates: P-6-2-6
4. Final Soundex: P626
(e) Jackson
1. First letter: J
2. Convert remaining letters: A(ignored), C(2), K(2), S(2), O(ignored), N(5)
3. Remove duplicates: J-2-5
4. Final Soundex (padded with 0): J252
(f) Thompson
1. First letter: T
2. Convert remaining letters: H(ignored), O(ignored), M(5), P(1), S(2), O(ignored), N(5)
3. Remove duplicates: T-5-1-2
4. Final Soundex: T512
Final Soundex Codes:
1. Compare and contrast the Naive Bayes and Support Vector Machines (SVM) algorithms
for text classification. Highlight their strengths and weaknesses.
Both Naïve Bayes (NB) and Support Vector Machines (SVM) are widely used in text classification
tasks such as spam detection, sentiment analysis, and topic classification. However, they differ
significantly in their approach, performance, and suitability for different types of text datasets.
Performance with Small Works well, even with small Requires more data for robust decision
Data datasets boundaries
Easily interpretable
Interpretability Less interpretable (black-box nature)
(probabilistic outputs)
Handling of Imbalanced Works well if probabilities are Can be biased; requires techniques like
Data adjusted class weighting
2. Strengths and Weaknesses
✅Strengths:
❌Weaknesses:
✅Strengths:
❌Weaknesses:
2. Compare and contrast the effectiveness of K-means and hierarchical clustering in text
data analysis. Discuss their suitability for different types of text corpora and retrieval tasks.
1. Comparison of Effectiveness
Agglomerative (bottom-up) or
Algorithm Type Partition-based (iterative)
Divisive (top-down)
Memory Usage Low (stores only cluster centroids) High (stores distance matrix)
K-means Clustering
Best for:
o Large-scale text datasets (e.g., news articles, social media posts).
o Applications where predefined clusters are needed (e.g., topic modeling, document
categorization).
o High-dimensional text representations (TF-IDF, word embeddings) where efficiency
is critical.
Hierarchical Clustering
Best for:
o Small to medium-sized corpora (e.g., research papers, legal documents).
o Tasks requiring hierarchical structures (e.g., taxonomy generation, document
organization).
o Exploratory analysis where the number of clusters is unknown.
Document Classification K-means Assigns predefined labels, scalable for large corpora
Hierarchical Document
Hierarchical Generates a tree-like structure for nested categories
Organization
Retrieval Task Best Choice Reason
Keyword-Based Search
K-means Clusters similar terms efficiently
Optimization
3. Discuss challenges and issues in applying clustering techniques to large-scale text data.
4. Explain link analysis and the PageRank algorithm. How does PageRank work to
determinethe importance of web pages?
5. Describe the PageRank algorithm and how it calculates the importance of web pages
based on their incoming links. Discuss its role in web search ranking.
One of the most famous applications of link analysis is the PageRank algorithm, developed by Larry
Page and Sergey Brin at Google, which revolutionized web search by ranking web pages based on
their link structure rather than just keyword matching.
Introduction to PageRank
The PageRank algorithm was developed by Larry Page and Sergey Brin at Google to measure the
importance of web pages based on their link structure. It is a graph-based ranking algorithm that
assigns a numerical score to each web page, representing its importance within a network of linked
pages.
The fundamental idea behind PageRank is that a web page is more important if many other
important pages link to it. This concept is based on the assumption that hyperlinks act as votes of
confidence, where links from authoritative pages pass more value than those from less credible
sources.
The process starts with equal PageRank values for all pages and iteratively updates them based on
incoming links until the values converge.
1. Authority-Based Ranking – Pages with more incoming links from authoritative sources have
higher ranks.
2. Link Weighting – A link from a highly-ranked page (e.g., Wikipedia) is more valuable than a
link from a low-ranked page.
3. Damping Factor – Accounts for random user behavior, ensuring that pages without links
still get some rank.
4. Iterative Computation – The algorithm runs multiple iterations until PageRank values
stabilize.
1. More Links = Higher Rank – Pages that receive links from multiple sources are considered
more important.
2. Quality Matters – Links from high PageRank pages contribute more to a page’s ranking than
links from low-ranked pages.
3. Prevents Manipulation – Since PageRank distributes weight based on outbound links,
spammy pages with excessive outbound links do not accumulate much authority.
4. Improves Search Engine Ranking – Pages with higher PageRank are more likely to appear at
the top of search results, leading to better visibility.
1. Determining Authority – Pages with higher PageRank are considered more authoritative
and are more likely to appear at the top of search results.
2. Reducing Spam – Links from high-quality sources are weighted more, reducing the impact
of low-quality or manipulated links.
3. Improving User Experience – Helps rank credible and informative pages higher, making
search results more relevant.
4. Foundation for Modern Algorithms – Although Google now uses hundreds of ranking
factors, PageRank remains a core concept in search engine optimization (SEO).
6. Explain how link analysis algorithms like HITS (Hypertext Induced Topic Search)
contribute to improving search engine relevance.
Link analysis algorithms evaluate the structure of hyperlinks between web pages to determine their
importance, authority, and relevance. These algorithms help search engines rank pages more
effectively by identifying authoritative sources and relevant hubs within a given topic. One such
algorithm is HITS (Hypertext Induced Topic Search), which improves search engine relevance by
analyzing the link structure beyond simple keyword matching.
The HITS algorithm, developed by Jon Kleinberg in 1999, classifies web pages into two categories:
1. Authorities – Pages that are highly referenced by other important pages on a topic.
2. Hubs – Pages that link to multiple authoritative sources on a topic.
1. Sampling Phase – Given a search query, HITS collects a small set of web pages related to the
query, typically those retrieved by traditional search engines.
2. Iterative Authority-Hub Calculation – The algorithm assigns two scores to each page:
o Authority Score (A(p)): The sum of hub scores of all pages linking to it.
o Hub Score (H(p)): The sum of authority scores of all pages it links to.
where ii represents pages linking to pp and jj represents pages linked by pp. After several
iterations, the scores converge, distinguishing authoritative sources from hub pages.
1. Topic-Specific Ranking – Unlike PageRank, which ranks pages globally, HITS focuses on
query-relevant pages, improving ranking for specific topics.
2. Identifying Expert Pages – HITS recognizes authoritative sources (e.g., government or
academic websites) by considering who references them.
3. Improving Query Expansion – By identifying strong hubs, HITS finds additional relevant
resources for a given query.
4. Handling New and Dynamic Content – Since HITS dynamically analyzes link structure for
each query, it can better rank emerging pages that are not yet globally popular.
5. Complementing PageRank – While PageRank focuses on global importance, HITS refines
results within specific topics, making search results more contextually relevant.
7. Discuss the impact of web information retrieval on modern search engine technologies
and user experiences.
Web Information Retrieval (IR) is the process of extracting relevant data from the vast and ever-
growing collection of online documents. It serves as the foundation of modern search engines,
helping them efficiently retrieve, rank, and present relevant web pages to users. Advances in IR
have transformed search engine technologies and significantly improved user experiences by
enhancing accuracy, speed, and personalization.
8. Discuss applications of link analysis in information retrieval systems beyond web search.
9. Compare and contrast pairwise and listwise learning to rank approaches. Discuss their
advantages and limitations.
Learning to Rank (LTR) is a machine learning technique used in information retrieval systems to
optimize the ranking of documents based on relevance to a given query. LTR methods can be
categorized into three main approaches: pointwise, pairwise, and listwise. While pairwise
approaches focus on ranking relative document pairs, listwise methods optimize the ordering of an
entire list of documents. Understanding the differences between these two approaches helps in
selecting the most suitable ranking model for specific applications.
Pairwise approaches convert the ranking problem into a classification or regression task by
comparing pairs of documents. The model learns a function that determines which document in a
given pair should be ranked higher.
Ignores global ranking order: It does not optimize the ranking of the entire result list, which
may lead to suboptimal rankings.
Higher computational complexity: Since all possible document pairs are considered, it can
be expensive for large-scale datasets.
Listwise approaches consider the ranking of an entire list of documents rather than individual
pairs. They directly optimize ranking measures such as NDCG (Normalized Discounted Cumulative
Gain) or MAP (Mean Average Precision).
Optimizes global ranking: Takes into account the entire ranked list, leading to better
ranking accuracy.
Directly maximizes ranking metrics: Unlike pairwise methods, listwise approaches optimize
metrics like NDCG, which directly correlate with ranking quality.
Better performance for complex ranking tasks: Provides more accurate rankings, especially
in search engines and recommendation systems.
More computationally intensive: Requires handling the entire list of documents, making it
slower for large-scale datasets.
Higher data requirements: Needs well-labeled training data, which may not always be
available.
Difficult to implement: More complex than pairwise methods due to direct optimization of
ranking measures.
10. Discuss the role of supervised learning techniques in learning to rank and their impact
on search engine result quality.
Learning to Rank (LTR) is a machine learning approach used to optimize the ordering of search
results based on user relevance. Supervised learning techniques play a crucial role in LTR by
leveraging labeled training data to learn ranking models. These techniques help search engines
improve relevance, personalization, and user satisfaction.
Supervised learning techniques in LTR use query-document pairs with relevance labels to train
models that predict ranking scores. These models learn patterns from historical user interactions,
clicks, and content features to improve ranking effectiveness.
11. How does supervised learning for ranking differ from traditional relevance feedback
methods in Information Retrieval? Discuss their respective advantages and limitations.
Information Retrieval (IR) systems aim to rank documents based on relevance to user queries.
Supervised learning for ranking and traditional relevance feedback are two techniques used to
improve ranking quality. While supervised learning leverages labeled training data to learn ranking
models, relevance feedback relies on user-provided relevance judgments to refine query results
dynamically.
1. Supervised Learning for Ranking
Supervised Learning to Rank (LTR) uses labeled training data to train machine learning models that
optimize the ranking of search results.
How It Works:
A dataset is created with query-document pairs labeled with relevance scores (e.g., click-
through rates, user feedback, editorial judgments).
The model is trained using pointwise, pairwise, or listwise approaches to predict ranking
scores.
The trained model is applied to new search queries to generate optimized ranked results.
Common algorithms include RankSVM, LambdaMART, and neural networks like BERT-
based ranking models.
Advantages
Disadvantages
Relevance feedback is an older, manual approach where users provide feedback on retrieved
documents, which the system uses to refine search results.
How It Works:
The user marks some search results as relevant or non-relevant after an initial search.
The system reweights query terms based on user feedback (e.g., Rocchio algorithm, Pseudo-
Relevance Feedback).
The updated query retrieves a new set of improved results.
Advantages
12. Describe the process of feature selection and extraction in learning to rank. What are the
key features used to train ranking models, and how are they selected or engineered?
1. Introduction
Feature selection and extraction are critical steps in Learning to Rank (LTR), where machine
learning models are trained to optimize the ranking of documents based on query relevance.
Features represent various aspects of the query-document relationship and are used to predict
relevance scores. Proper selection and engineering of features improve ranking model
performance, reduce overfitting, and enhance computational efficiency.
Feature selection and extraction in LTR involve multiple steps to identify and refine the most
informative features for ranking models.
1. Feature Identification – Potential features are collected from query, document, and user
interaction data.
2. Feature Engineering – Raw features are transformed into meaningful inputs using
techniques like normalization, scaling, and encoding.
3. Feature Selection – Redundant or irrelevant features are removed using statistical and
machine learning techniques such as mutual information, correlation analysis, or feature
importance scores.
4. Feature Evaluation – Selected features are validated using cross-validation or ranking
metrics like NDCG (Normalized Discounted Cumulative Gain) and MAP (Mean Average
Precision).
LTR models use three main types of features: query-dependent features, document-specific
features, and query-document interaction features.
Query-Dependent Features
Document-Specific Features
These features describe document characteristics independent of the query.
BM25 score – Measures term relevance based on frequency and document length.
Cosine similarity – Measures similarity between query and document vector
representations.
Word embedding similarity – Uses models like Word2Vec or BERT to compute semantic
similarity.
To improve model efficiency, irrelevant or redundant features are removed using selection
methods.
Filter Methods – Rank features based on statistical scores (e.g., mutual information, chi-
square test).
Wrapper Methods – Use machine learning models to evaluate feature subsets (e.g.,
recursive feature elimination).
Embedded Methods – Select features during model training using techniques like L1
regularization (Lasso) or tree-based feature importance.
13. Describe web graph representation in link analysis. How are web pages and hyperlinks
represented in a web graph OR Explain how web graphs are represented in link analysis.
Discuss the concepts of nodes, edges, and directed graphs in the context of web pages and
hyperlinks.
1. Introduction
Web graph representation is a fundamental concept in link analysis, where web pages and
hyperlinks are modeled as a directed graph. This representation helps in analyzing the structure of
the web, identifying important pages, and improving search engine ranking algorithms like
PageRank and HITS.
1. Nodes (Vertices) represent web pages. Each webpage is treated as a unique node in the
graph.
2. Edges (Links) represent hyperlinks between web pages. If page A contains a hyperlink to
page B, a directed edge is drawn from node A to node B.
3. Directed Graph is formed since hyperlinks have a direction, meaning a link from page A to
page B does not imply a link back from B to A.
3. Properties of Web Graphs
Scale-Free Nature – The web graph follows a power-law distribution where a few web
pages (hubs) have a significantly higher number of links.
Connectivity – Some pages have high in-degree (many incoming links) and act as
authoritative sources, while others have high out-degree (many outgoing links) and act as
hubs.
Clusters and Communities – The web graph naturally forms topic-specific clusters where
pages on similar topics are more densely connected.
PageRank Algorithm – Uses link structure to assign importance to web pages based on
incoming links.
HITS Algorithm – Identifies hubs (pages linking to many relevant pages) and authorities
(pages that are linked to by many hubs).
Web Crawling and Indexing – Helps search engines efficiently explore and rank web pages.
Spam Detection – Identifies unnatural link patterns used for ranking manipulation.
14. Discuss the difference between the PageRank and HITS algorithms.
Link analysis is evolving with advancements in artificial intelligence, graph theory, and information
retrieval (IR). One of the key trends is the integration of Graph Neural Networks (GNNs), which
enhance traditional link analysis by learning complex relationships in large-scale graphs. Another
emerging area is dynamic and temporal link analysis, which considers evolving web structures and
social networks over time, rather than treating them as static entities.
Personalization is becoming a crucial aspect of link analysis, where algorithms adapt to user
preferences and behaviors, improving search rankings and recommendations. Hybrid approaches
that combine link-based ranking methods like PageRank with deep learning and NLP techniques,
such as BERT, are also gaining prominence, leading to more relevant search results.
Another growing concern is misinformation detection on the web and social media. Link analysis is
being used to identify coordinated fake news networks and track the spread of misleading content.
Additionally, privacy concerns are driving research into decentralized and privacy-preserving link
analysis, where federated learning methods allow analysis without compromising user data.
In social network analysis, link analysis plays a fundamental role in identifying influential users,
detecting communities, and understanding information flow. Centrality measures like betweenness
and eigenvector centrality help in ranking key individuals within a network. Platforms like
Facebook and LinkedIn utilize link prediction to recommend new friends or professional
connections based on shared links and mutual interests.
Beyond social and commercial applications, link analysis is used in epidemic modeling, helping
researchers understand how diseases or information spreads through networks. This is particularly
relevant in public health strategies and crisis management.
16. How do link analysis algorithms contribute to combating web spam and improving
search engine relevance?
How Link Analysis Algorithms Combat Web Spam and Improve Search Engine Relevance
17. Consider a simplified web graph with the following link structure:
Page A has links to pages B, C, and D.
Page B has links to pages C and E.
Page C has links to pages A and D.
Page D has a link to page E.
Page E has a link to page A.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one/two iteration(s) of the HITS algorithm.
The HITS (Hyperlink-Induced Topic Search) algorithm assigns each page two scores:
Authority Score: A page’s importance based on the number of high-quality hubs linking to it.
Hub Score: A page’s importance based on the number of high-quality authorities it links to.
Authority Update
The new authority score of a page is the sum of the hub scores of all pages linking to it.
Hub Update
The new hub score of a page is the sum of the authority scores of all pages it links to.
Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 5
B 1 4
C 2 3
D 2 2
E 2 1
Authority Update
Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 1 22
B 5 15
C 9 9
D 8 6
E 6 1
A → B, C
B→C
C → A, D
D→A
Each page starts with an initial authority score and hub score of 1.
Iteration 1
The authority score of a page is the sum of the hub scores of the pages linking to it.
The hub score of a page is the sum of the authority scores of the pages it links to.
Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 2 3
B 1 2
C 2 3
D 1 2
Iteration 2
Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 5 8
B 3 5
C 5 8
D 3 5
A → B, C
B→D
C → B, D
D → A, C
Each page starts with an initial authority score and hub score of 1.
The authority score of a page is the sum of the hub scores of the pages linking to it.
The hub score of a page is the sum of the authority scores of the pages it links to.
Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 4
B 2 2
C 2 4
D 2 3
A → B, C
B → C, D
C → A, D
D→B
Each page starts with an initial authority score and hub score of 1.
The authority score of a page is the sum of the hub scores of the pages linking to it.
The hub score of a page is the sum of the authority scores of the pages it links to.
Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 4
B 2 4
C 2 3
D 2 2
Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 3 14
B 6 15
C 8 10
D 7 6
Unit III
1. How do web crawlers handle dynamic web content during crawling? Explain techniques
such as AJAX crawling, HTML parsing, URL normalization and session handling for
dynamic content extraction. Explain the challenges associated with handling dynamic
web content during crawling.
Web crawlers primarily extract static HTML content, but dynamic web pages—generated using
JavaScript, AJAX, or user interactions—pose significant challenges. Crawlers use specialized
techniques to handle such content effectively.
1. AJAX Crawling
AJAX (Asynchronous JavaScript and XML) allows web pages to load data dynamically
without refreshing. Crawlers handle AJAX content using:
o Headless Browsers (e.g., Puppeteer, Selenium) to render JavaScript-driven content.
o API Simulation to interact with backend services directly.
o Pre-rendering Services that generate static HTML snapshots of dynamic content.
2. HTML Parsing
Crawlers parse and analyze HTML using tools like BeautifulSoup and lxml. Dynamic
elements can be extracted by:
o Identifying hidden data within meta tags, JSON-LD, or inline JavaScript.
o Extracting structured data (e.g., schema.org) embedded in the page.
o Following iframes and embedded resources for additional content.
3. URL Normalization
Many dynamic sites generate multiple URLs for the same content due to parameters (e.g.,
session IDs, tracking codes). Normalization helps avoid duplicate crawling by:
o Removing unnecessary query parameters.
o Converting relative URLs to absolute URLs.
o Ensuring consistent URL case sensitivity and structure.
4. Session Handling and Cookies
Websites often require user authentication or store state information using cookies and
session IDs. Crawlers handle this by:
o Managing cookies to maintain session continuity.
o Using authentication mechanisms (OAuth, tokens, or login automation) to access
restricted content.
o Avoiding session-based URLs to prevent duplicate crawling.
1. JavaScript Execution
Many modern websites rely on JavaScript for rendering content, requiring headless
browsers or JavaScript-enabled crawlers, which increase crawling overhead.
2. Anti-Crawling Mechanisms
Websites implement CAPTCHAs, IP blocking, or bot detection techniques (e.g., behavior
analysis, fingerprinting) to restrict automated crawling.
3. Infinite Scrolling and Pagination
Content loading dynamically via infinite scrolling makes it difficult for crawlers to
determine when to stop fetching data. Special strategies (e.g., simulating user scrolling) are
required.
4. Dynamic URL Generation
Some websites generate random or temporary URLs that change with every session, making
it hard to track and normalize them.
5. Data Consistency and Freshness
Since dynamic content updates frequently, crawlers must decide how often to revisit pages
without overwhelming the server.
6. Ethical and Legal Considerations
Crawling dynamic content may violate a site's robots.txt rules, terms of service, or data
privacy policies. Crawlers must ensure compliance to avoid legal risks.
2. Describe the role of AJAX crawling scheme and the use of sitemaps in crawling dynamic
web content. Provide examples of how these techniques are implemented in practice.
Modern web applications rely on AJAX (Asynchronous JavaScript and XML) to load content
dynamically. Traditional crawlers struggle to extract such content since it isn't present in the initial
HTML source. The AJAX crawling scheme and sitemaps help address this challenge
AJAX allows web pages to fetch data asynchronously without refreshing, making crawling more
complex. To handle this, crawlers use:
b. Headless Browsers
Crawlers use headless browsers (e.g., Puppeteer, Selenium) to simulate user interactions
and load JavaScript-rendered content.
Example: A news website loads articles via AJAX. A Selenium-based crawler can wait for
content to load and extract data.
Sitemaps provide structured lists of URLs to help search engines discover and crawl web pages
efficiently.
a. XML Sitemaps
Websites provide an XML file listing all pages to ensure important URLs are indexed.
Example: An e-commerce website generates a sitemap.xml file with product pages that load
via AJAX, ensuring crawlers can find them.
b. Dynamic Sitemaps
Implementation Examples
3. Compare and contrast local and global similarity measures for near-duplicate detection.
Provide examples of scenarios where each measure is suitable.
Near-duplicate detection identifies documents that share significant content but may have minor
variations. Local similarity measures focus on small parts of a document, while global similarity
measures analyze the overall content.
1. Local Similarity Measures
Local similarity methods compare small portions (substrings, words, or tokens) of documents to
detect near-duplicates.
Characteristics
1. Shingling (k-grams)
o Splits text into overlapping k-length substrings (shingles).
o Example: "web crawling techniques" → {web, craw, rawl, awli, ling, ...}
o Uses Jaccard similarity on shingle sets.
2. Locality-Sensitive Hashing (LSH)
o Hashes shingles and compares only similar hash values.
o Used in duplicate web page detection.
3. Edit Distance (Levenshtein Distance)
o Counts the minimum operations (insert, delete, replace) needed to transform one
text into another.
o Useful for typo detection and OCR error correction.
Global similarity methods compare documents as whole entities, focusing on overall content rather
than specific parts.
Characteristics
1. Cosine Similarity
o Represents documents as TF-IDF vectors and calculates the cosine of the angle
between them.
o Example: Used in news article clustering.
2. Jaccard Similarity (Set-Based)
o Measures overlap between word/token sets.
o Used in document clustering and duplicate webpage detection.
3. Euclidean Distance on Word Embeddings
o Compares semantic similarity using word embeddings (Word2Vec, BERT).
o Example: Used in semantic search engines.
1. Plagiarism Detection
Academic Integrity: Universities use near-duplicate detection in tools like Turnitin and
Copyscape to identify plagiarized assignments, research papers, and thesis submissions.
Code Plagiarism: Platforms like Moss (Measure of Software Similarity) detect copied
programming code in student submissions and open-source repositories.
Web Index Optimization: Search engines like Google remove near-duplicate pages to avoid
redundancy and improve search quality.
Canonicalization: Helps determine the canonical version of a page when multiple URLs
contain similar content (e.g., product descriptions across different retailers).
News Deduplication: Platforms like Google News and Yahoo News group near-duplicate
articles from different publishers to avoid repetition.
Fake News Detection: Identifies copied or manipulated news articles spreading
misinformation.
Product Deduplication: Online marketplaces like Amazon, eBay, and Flipkart identify
duplicate product listings across different sellers.
Price Comparison: Ensures accurate price comparisons by removing duplicate product
pages with slight variations.
Efficient Web Crawling: Search engines use SimHash and Locality-Sensitive Hashing (LSH)
to avoid indexing near-duplicate pages, reducing storage costs.
Content Scraping Detection: Websites detect and block bots scraping content for
unauthorized reproduction.
Duplicate Post Detection: Platforms like Twitter, Facebook, and Reddit identify near-
duplicate posts and filter spam.
Meme & Fake Review Detection: Detects repeated fake reviews on platforms like Yelp and
Amazon.
Duplicate Research Paper Detection: Databases like PubMed and arXiv filter near-duplicate
scientific publications.
Clinical Trial Deduplication: Ensures accuracy in medical research by detecting redundant
trial data.
Extractive text summarization selects key sentences from a document to form a concise summary
while preserving the original meaning. Several techniques, including graph-based methods and
sentence scoring approaches, help identify important content.
1. Graph-Based Methods
Graph-based approaches model text as a graph where nodes represent sentences, and edges
capture relationships between them. Ranking algorithms are then used to extract the most
important sentences.
a. TextRank Algorithm
Similar to TextRank but uses Jaccard or Cosine similarity to connect sentences in the graph.
Applies Markov chains to compute sentence importance, selecting highly ranked sentences.
More effective for multi-document summarization.
Sentence scoring techniques assign importance scores to each sentence based on linguistic,
statistical, or machine learning methods.
Sentences are ranked based on the frequency of important words (TF) and their rarity
across documents (IDF).
High TF-IDF scores indicate key sentences for extraction.
Short sentences may lack sufficient information and are often discarded.
Sentences containing named entities (e.g., people, places, organizations) are prioritized.
Uses supervised learning models (e.g., SVM, Random Forest, Neural Networks) to score
sentences based on:
o Word frequency
o Sentence position
o Semantic similarity
Trained using human-labeled summaries to learn important patterns.
Abstractive text summarization involves generating a summary that paraphrases and condenses
the original text, rather than simply extracting key sentences. While this approach provides more
human-like summaries, it presents several challenges.
Issue: Summaries may be grammatically incorrect or lack logical flow, making them hard to
read.
Example: "The company profits increased, however, declined last year."
Solution: Pre-trained language models like GPT-4, T5, and BART improve fluency using
context-aware generation.
Issue: Standard models struggle with long documents due to limited context window size
(e.g., 512-1024 tokens in Transformer models).
Example: Summarizing a 50-page legal document accurately.
Solution: Longformer, BigBird, and hierarchical attention mechanisms allow models to
process longer inputs efficiently.
d. Domain-Specific Challenges
e. Computational Complexity
Issue: Training large-scale abstractive models requires high computational power and
memory.
Example: Training a Transformer-based summarization model on limited hardware is
expensive.
Solution: Optimizations like knowledge distillation, quantization, and pruning reduce model
size while maintaining accuracy.
a. Transformer-Based Models
Models like RL-Sum use reinforcement learning to improve factual accuracy and readability.
Helps reduce hallucinations and redundancy in generated summaries.
GPT-4 and LLaMA models enable summarization with minimal labeled data.
Reduces the need for large, task-specific datasets.
7. Discuss common evaluation metrics used to assess the quality of text summaries, such as
ROUGE and BLEU. Explain how these metrics measure the similarity between generated
summaries and reference summaries.
Evaluating the quality of generated summaries is crucial to ensure they are accurate, coherent, and
informative. Several automatic metrics, such as ROUGE and BLEU, measure the similarity between a
generated summary and human-written reference summaries.
ROUGE is the most widely used metric for summarization, focusing on n-gram recall, precision, and
F1-score between the generated and reference summaries.
ROUGE compares overlapping words or phrases between summaries. Key variants include:
Strengths:
✔Works well for extractive summarization (word overlap is high).
✔Simple and efficient for large-scale evaluations.
Limitations:
✘Fails to capture paraphrased or semantically similar sentences.
✘Doesn't consider summary coherence or readability.
Originally designed for machine translation, BLEU is also used for summarization. It measures n-
gram precision rather than recall.
Example Calculation:
Reference: "The quick brown fox jumps over the lazy dog."
Generated: "A fast brown fox leaps over a sleepy dog."
Strengths:
✔Useful when precision is more important than recall.
✔Good for machine-generated summaries with strict phrase matching.
Limitations:
✘Ignores synonyms and sentence structure.
✘Biased against abstractive summarization, which may use different words.
8. Discuss different approaches for question answering in information retrieval, including
keyword-based, document retrieval, and passage retrieval methods.
Question Answering (QA) in Information Retrieval (IR) focuses on retrieving relevant information
from large text corpora to provide precise answers to user queries. Different approaches include
keyword-based methods, document retrieval, and passage retrieval, each offering distinct
advantages depending on the query complexity and information need.
This approach relies on keyword matching between the user's question and documents in the
collection. It is fast and efficient but lacks semantic understanding.
How It Works
Example
Advantages
Disadvantages
✘Ignores context and word variations (e.g., “capital” vs. “main city”).
✘Struggles with complex or natural language queries.
This approach retrieves full documents relevant to the query using advanced ranking algorithms. It
is commonly used in search engines and digital libraries.
How It Works
Uses vector space models, BM25, or dense retrieval (e.g., BERT-based models).
Scores documents based on query-document similarity.
Returns entire documents rather than specific answers.
Example
Advantages
Disadvantages
This approach retrieves specific text passages instead of entire documents, improving precision for
direct answers. It is widely used in modern search engines and AI-powered chatbots.
How It Works
Example
Advantages
9. Explain how natural language processing techniques such as Named Entity Recognition
(NER) and semantic parsing contribute to question answering systems.
Natural Language Processing (NLP) techniques such as Named Entity Recognition (NER) and
Semantic Parsing play a crucial role in improving the accuracy and efficiency of Question Answering
(QA) systems. These techniques help in understanding, analyzing, and extracting meaningful
information from text, allowing QA models to provide precise and contextually relevant answers.
Named Entity Recognition (NER) is an NLP technique that identifies and classifies entities (such as
names of people, places, organizations, dates, and numerical values) in a given text.
1. Entity Extraction: The system scans the input query and extracts important named entities.
2. Entity Type Recognition: The extracted entities are classified into predefined categories
such as Person, Location, Date, Organization, Number, etc.
3. Query Understanding: By identifying key entities, the system understands the context of the
question.
4. Answer Retrieval: The QA model searches for relevant passages containing similar entities
and ranks them based on relevance.
Example
Benefits of NER in QA
1. Improves precision by focusing on relevant entities instead of searching through the entire
document.
2. Enhances query understanding, allowing the system to differentiate between similar terms
(e.g., Apple the company vs. apple the fruit).
3. Reduces search space, making the retrieval process faster and more efficient.
1. Syntax Analysis: The input question is broken down into grammatical components (subject,
verb, object).
2. Meaning Representation: The system translates the question into a structured query format,
such as SQL for database retrieval or SPARQL for querying knowledge graphs.
3. Information Retrieval: The structured query is executed against a database, knowledge
graph, or text corpus to fetch relevant information.
4. Answer Generation: The retrieved data is transformed into a human-readable response.
Example
10. Provide examples of question answering systems and evaluate their effectiveness in
providing precise answers.
1. Google Search is a widely used QA system that provides featured snippets and passage-
based answers directly in search results. It uses BERT-based models and dense retrieval
techniques to extract the most relevant passage from indexed documents. It is highly
accurate for factual queries, provides fast retrieval, and is continually updated with fresh
information. However, it may misinterpret complex queries, is biased toward popular
sources, and does not always provide direct answers for ambiguous or multi-step questions.
2. IBM Watson is an advanced QA system used for enterprise solutions, healthcare, and
finance. It processes structured and unstructured data, leveraging deep NLP and knowledge
graphs. It is highly effective for domain-specific queries, uses deep reasoning to provide
context-aware answers, and can analyze both structured and unstructured data. However, it
is computationally expensive, not always user-friendly, and limited to trained domains,
making it less effective in open-ended QA.
3. OpenAI ChatGPT is an AI-powered conversational agent capable of answering both factual
and reasoning-based questions. It uses transformer-based deep learning models and has
been trained on vast amounts of text. It handles complex and open-ended questions,
provides contextual understanding, and generates human-like responses. However, it may
generate incorrect or outdated answers, is prone to hallucination, and lacks precise
citations, making fact-checking difficult.
4. Wolfram Alpha is a computational knowledge engine that answers math, science, and data-
driven questions by computing results rather than searching for text-based answers. It is
highly accurate for mathematical and scientific queries, provides step-by-step solutions, and
generates results from structured data. However, it is limited to structured and factual
queries, does not support conversational interactions, and requires precise query
formulation for accurate results.
5. Siri, Alexa, and Google Assistant are voice-based QA systems integrated into smart
assistants that process spoken queries and return answers using NLP and web search. They
are optimized for voice-based interactions, provide quick response times, and integrate
with smart devices. However, they have limited reasoning capabilities, struggle with multi-
turn complex queries, and may have voice recognition errors that affect accuracy.
6. Different QA systems excel in different scenarios. Google Search is best for quick factual
lookups. IBM Watson is powerful for enterprise and domain-specific tasks. ChatGPT is ideal
for conversational, reasoning-based, and explanatory questions. Wolfram Alpha provides
precise computations for math and science. Voice Assistants are convenient for real-time
voice-based QA.
7. Future advancements in deep learning, knowledge graphs, and hybrid retrieval-generation
models will further improve QA systems, making them more accurate, context-aware, and
interactive.
11. Discuss the challenges associated with question answering, including ambiguity
resolution, answer validation, and handling of incomplete or noisy queries.
1. Ambiguity Resolution – Many questions contain ambiguous terms or phrases that can be
interpreted in multiple ways. For example, "Who is the president?" could refer to different
countries or organizations. QA systems must rely on context-awareness techniques like
Named Entity Recognition (NER) and Word Sense Disambiguation (WSD) to resolve
ambiguity effectively. However, incorrect context identification can lead to misleading
answers.
2. Answer Validation – Even when a system retrieves a relevant response, verifying its
correctness is difficult. Some sources may provide outdated, biased, or inaccurate
information. QA models need fact-checking mechanisms, external knowledge bases, and
citation techniques to ensure reliable answers. Misinformation remains a challenge,
especially when systems generate answers based on incomplete or biased training data.
3. Handling Incomplete Queries – Users often provide vague or underspecified questions, such
as "How long does it take?" without specifying what process they are referring to. QA
systems must infer missing details using historical data or predefined templates. However,
excessive reliance on inference can introduce errors, leading to incorrect or contextually
irrelevant responses.
4. Noisy Queries – User queries may contain typographical errors, grammatical mistakes,
informal language, or inconsistent phrasing. For example, "wht is captl of Frnace?" requires
spell correction and normalization before processing. NLP techniques such as fuzzy
matching, synonym recognition, and language models help correct errors, but highly
unstructured inputs remain challenging.
5. Scalability & Efficiency – Large-scale QA systems must process vast amounts of textual data
in real time while maintaining high accuracy. Indexing, caching, and parallel computing can
improve response time, but balancing computational efficiency with accuracy is difficult. As
data volumes grow, ensuring fast and precise question answering requires continuous
optimization of search and retrieval algorithms.
6. Bias & Fairness – QA systems are prone to bias due to imbalanced training data or biased
sources. If a dataset favors certain viewpoints or demographics, the system may generate
skewed answers. Ensuring fairness requires diverse training data, bias-mitigation
techniques, and transparency in answer generation. However, defining and enforcing
fairness in QA remains an ongoing research challenge.
7. Context Awareness in Conversational QA – Multi-turn question answering, where responses
depend on previous context, presents additional difficulties. For instance, if a user asks,
"What is the weather in Paris?" followed by "And tomorrow?" the system must retain
context. Maintaining coherence across conversations requires memory-augmented
architectures, but errors in context tracking can lead to irrelevant or repetitive answers.
8. Multilingual & Cross-Language QA – Handling questions in multiple languages requires
large multilingual datasets and effective cross-lingual embeddings. Many QA models
perform well in English but struggle with low-resource languages due to limited training
data. Translating queries accurately while preserving meaning is another challenge in cross-
lingual question answering.
9. Contradictory Information Handling – When different sources provide conflicting answers,
selecting the most accurate one is difficult. For example, some websites might report
different historical events or scientific claims. QA systems must assess source credibility,
detect inconsistencies, and provide balanced viewpoints where necessary. However,
automating contradiction detection remains an open problem.
10. Security & Misinformation Risks – Malicious users can manipulate QA systems by injecting
false information into training data or crafting adversarial queries to exploit model
weaknesses. Additionally, QA models trained on internet data may inadvertently generate
harmful or misleading responses. Ensuring robustness against misinformation attacks and
adversarial queries requires enhanced security mechanisms and continuous model
evaluation.
12. Explain how collaborative filtering algorithms such as user-based and item-based
methods work. Discuss techniques to address the cold start problem in collaborative
filtering.
Collaborative filtering (CF) is a recommendation technique that suggests items to users based on
the preferences of similar users or the similarity between items. It is widely used in applications
such as e-commerce, streaming services, and online advertising. There are two main types of
collaborative filtering: user-based and item-based methods.
User-based collaborative filtering operates on the assumption that users with similar preferences in
the past will have similar preferences in the future. The process involves:
Finding Similar Users – The system identifies users with similar rating patterns using
similarity measures like cosine similarity, Pearson correlation, or Jaccard similarity.
Predicting Ratings – Once similar users are identified, the system predicts a user’s rating for
an item based on ratings from similar users.
Generating Recommendations – The system recommends items with the highest predicted
ratings that the user has not yet interacted with.
For example, if User A and User B have rated multiple movies similarly, and User B has rated a
movie that User A has not seen, the system may recommend that movie to User A.
Item-based collaborative filtering focuses on item similarity rather than user similarity. The key
steps include:
Computing Item Similarities – Items are compared based on how users have rated them.
Similarity measures such as cosine similarity and adjusted cosine similarity are commonly
used.
Predicting Ratings – The system predicts a user's rating for an item by analyzing their
ratings for similar items.
Recommending Items – Highly similar items to those a user has interacted with are
recommended.
For instance, if many users who liked "The Lord of the Rings" also liked "The Hobbit," then "The
Hobbit" would be recommended to users who enjoyed "The Lord of the Rings."
The cold start problem arises when a recommendation system lacks sufficient data for new users or
new items. Since CF relies on historical interactions, new users or items may not have enough
ratings for meaningful recommendations. Several techniques help mitigate this issue:
13. Describe content-based filtering approaches, including feature extraction and similarity
measures used in content-based recommendation systems.
Content-based filtering (CBF) is a recommendation approach that suggests items to users based on
the characteristics of items they have previously interacted with. Instead of relying on user
interactions like collaborative filtering, CBF analyzes item attributes and user preferences to
generate recommendations.
Content-based filtering works by comparing the features of items to those of items the user has
previously liked. The process involves:
Feature Extraction – Identifying and extracting relevant attributes from items (e.g.,
keywords in articles, genre in movies, ingredients in recipes).
User Profile Construction – Creating a profile based on user preferences by analyzing their
past interactions.
Similarity Measurement – Comparing the features of items to those preferred by the user to
generate recommendations.
Feature extraction is a crucial step where relevant attributes are identified from items. Some
common feature extraction techniques include:
Once features are extracted, similarity measures are used to compare items and generate
recommendations. Some common similarity measures include:
Cosine Similarity – Measures the cosine of the angle between two feature vectors,
commonly used in text-based and numeric feature comparisons.
Euclidean Distance – Computes the direct distance between feature vectors, used for
numerical attributes.
Jaccard Similarity – Measures the similarity between sets, often used for categorical
features like genre or tags.
Pearson Correlation – Evaluates the linear correlation between features, useful in rating-
based similarity comparisons.
Advantages:
Limitations:
Struggles with new items lacking user interactions (cold start problem for items).
Over-specialization can occur, where users are only recommended similar items without
diversity.
Requires high-quality feature extraction for accurate recommendations.
14. Discuss the advantages and limitations of online evaluation methods compared to offline
evaluation methods, such as test collections and user studies.
Evaluating recommendation systems, search engines, or information retrieval models requires both
online and offline evaluation methods. Each approach has its advantages and limitations based on
real-world applicability, scalability, and reliability.
Online evaluation involves testing models in real-time using live user interactions. This is
commonly done through A/B testing, multi-armed bandits, and click-through rate (CTR) analysis in
a deployed system.
1. Real-User Feedback – Directly captures how real users interact with recommendations,
ensuring practical relevance.
2. Dynamic Adaptation – Can adjust recommendations dynamically based on evolving user
behavior.
3. Business-Centric Metrics – Allows measuring actual business goals like conversion rates,
revenue impact, and engagement.
4. Scalability – Can handle large-scale evaluations across diverse user bases without artificial
constraints.
Offline evaluation involves assessing models using test collections, benchmark datasets, and user
studies before deployment. Metrics such as precision, recall, NDCG (Normalized Discounted
Cumulative Gain), and MAP (Mean Average Precision) are commonly used.
1. Fast and Cost-Effective – Does not require a live system, allowing rapid testing of multiple
models.
2. Controlled Environment – Eliminates external influences, making comparisons more
reliable.
3. Safe Experimentation – Avoids negative impact on actual users, preventing potential
dissatisfaction.
4. Replicability – Can be consistently applied across different models, ensuring fair
comparisons.
1. Limited Real-World Relevance – User behavior in live environments may differ from test
collections, reducing practical applicability.
2. Lack of Engagement Metrics – Cannot measure real-world user satisfaction or engagement
directly.
3. Cold Start and Data Biases – May not capture changes in user preferences or new content
effectively.
4. Assumes Static Data – Real-world data evolves, whereas offline evaluations use fixed
datasets, limiting adaptability.
The Jaccard similarity between two sets A and B is calculated using the formula:
J(A,B)= ∣A∩B∣
∣A∪B∣
B={"apple","orange","grape","kiwi"}
Step 1: Compute Intersection ∣A∩B∣
A∩B={"apple","orange","grape"}
So, ∣A∩B∣=3
A∪B={"apple","banana","orange","grape","kiwi"}
So, ∣A∪B∣=5
J(A,B)= 3 =0.6
5
Final Answer:
The Jaccard similarity between the two sets is 0.6 (or 60%).
The Jaccard similarity between two sets A and B is calculated using the formula:
J(A,B)= ∣A∩B∣
∣A∪B∣
where:
Document 1:
A={"machine","learning","algorithm","data","science"}
Document 2:
B={"algorithm","data","science","model","prediction"}
A∩B={"algorithm","data","science"}
∣A∩B∣=3
Step 3: Compute Union A∪B
A∪B={"machine","learning","algorithm","data","science","model","prediction"}
∣A∪B∣=7
J(A,B)= ∣A∩B∣
∣A∪B∣
J(A,B)= 3
7
J(A,B)≈ 0.4286
Final Answer
The Jaccard similarity between the two documents is 0.4286 (or 42.86%).
The Jaccard similarity between two sets A and B is given by the formula:
where:
J(A,B)= ∣A∩B∣
∣A∪B∣
Transaction Document 1:
A={"bread","milk","eggs","cheese"}
Transaction Document 2:
B={"bread","butter","milk","yogurt"}
A∩B={"bread","milk"}
∣A∩B∣=2
A∪B={"bread","milk","eggs","cheese","butter","yogurt"}
∣A∪B∣=6|
J(A,B)= ∣A∩B∣
∣A∪B∣
J(A,B)= 2
6
J(A,B)≈ 0.3333
Final Answer
The Jaccard similarity between the two transaction documents is 0.3333 (or 33.33%).
The Jaccard similarity between two sets A and B is given by the formula:
J(A,B)= ∣A∩B∣
∣A∪B∣
where:
Product Document 1:
A={"smartphone","camera","battery","display"}
Product Document 2:
B={"smartphone","camera","storage","processor"}
A∩B={"smartphone","camera"}
∣A∩B∣=2|
A∪B={"smartphone","camera","battery","display","storage","processor"}
∣A∪B∣=6
J(A,B)= ∣A∩B∣
∣A∪B∣
J(A,B)= 2
6
J(A,B)≈ 0.3333
Final Answer
The Jaccard similarity between the two product documents is 0.3333 (or 33.33%).