0% found this document useful (0 votes)
35 views78 pages

IR qb1

The document outlines a question bank for Information Retrieval, covering topics such as k-gram indexing, wildcard query processing, average precision, relevance judgments, and constructing posting lists. It explains the significance of k-gram indexing in enhancing search efficiency, the process of evaluating IR systems using average precision, and the importance of relevance judgments in performance evaluation. Additionally, it includes exercises on creating posting lists and calculating TF-IDF scores based on provided term-document matrices.

Uploaded by

yadavsimran2212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views78 pages

IR qb1

The document outlines a question bank for Information Retrieval, covering topics such as k-gram indexing, wildcard query processing, average precision, relevance judgments, and constructing posting lists. It explains the significance of k-gram indexing in enhancing search efficiency, the process of evaluating IR systems using average precision, and the importance of relevance judgments in performance evaluation. Additionally, it includes exercises on creating posting lists and calculating TF-IDF scores based on provided term-document matrices.

Uploaded by

yadavsimran2212
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

T.Y.B.Sc.

(CS) Sem VI Information Retrieval Question Bank

Unit 1

1. Define k-gram indexing and explain its significance in Information Retrieval systems.

K-Gram Indexing is a technique used in Information Retrieval (IR) to enhance text searching by
breaking words into overlapping sequences of k contiguous characters, called k-grams.

For example, if k = 3 (3-gram), the word "search" is broken into the following 3-grams:

 ^se (adding ^ as a start marker)


 sea
 ear
 arc
 rch
 ch$ (adding $ as an end marker)

A k-gram index is a data structure that maps these k-grams to words that contain them.

Significance in Information Retrieval Systems

K-gram indexing is particularly useful in the following ways:

1. Efficient Wildcard Query Processing


o Queries like "com*tion" (for words like "communication", "computation") can be
efficiently processed by looking up k-grams (com, ati, ion) instead of scanning the
full index.
2. Spelling Correction & Fuzzy Matching
o By indexing k-grams, a system can suggest correct words when a user makes a typo.
For example, if "informaton" is entered, the system can find "information" by
comparing common k-grams.
3. Autocomplete & Query Suggestions
o K-gram indexing enables predictive text input by quickly retrieving terms that
match the beginning of a query.
4. Plagiarism Detection & Similarity Matching
o It helps in detecting similar texts across large datasets by comparing k-gram
distributions.
5. Language Processing & Named Entity Recognition
o Useful in NLP applications where partial matching of words or phrases is needed.
o

2. Describe the process of constructing a k-gram index. Highlight the key steps involved and
the data structures used.

Constructing a k-gram index involves breaking words into k-grams and storing them efficiently for
fast retrieval. The process includes the following key steps:
1. Tokenization of Text Corpus

 Extract words from the text collection (documents, database, or query inputs).
 Convert words to lowercase for case-insensitive matching.
 Remove punctuation and special characters if needed.

Example Input (Word List):

["search", "seeking", "seam"]

2. Generating K-Grams

 For each word, generate overlapping k-grams (substrings of length k).


 Add special markers (^ for start and $ for end) to preserve word boundaries.

Example (3-Grams for "search"):

^se, sea, ear, arc, rch, ch$

3. Building the K-Gram Inverted Index

 Create a hash table (dictionary/map) where:


o Key: A unique k-gram.
o Value: A set or list of words containing that k-gram.

Example K-Gram Index (for words "search", "seeking", "seam"):

{
"^se" : {"search", "seeking", "seam"},
"sea" : {"search", "seam"},
"ear" : {"search"},
"arc" : {"search"},
"rch" : {"search"},
"ch$" : {"search"},
"see" : {"seeking"},
"eek" : {"seeking"},
"eki" : {"seeking"},
"kin" : {"seeking"},
"ing$" : {"seeking"},
"eam" : {"seam"},
"am$" : {"seam"}
}

4. Storing the K-Gram Index

 Use an inverted index for fast lookup.


 Store in-memory using hash maps (dictionaries) or tries (prefix trees).
 Optionally, store on disk using database tables or key-value stores.
Common Data Structures Used:

 HashMap (Dictionary): Fast lookup of k-grams.


 Trie (Prefix Tree): Efficient storage and lookup of k-grams in lexicographic order.
 Inverted Index: Maps k-grams to words efficiently for wildcard queries.

5. Query Processing Using K-Gram Index

 When a user submits a query (e.g., "com*tion"), the system:


o Breaks it into k-grams (com, ati, ion).
o Retrieves words containing these k-grams from the index.
o Intersects results to find words matching the wildcard pattern.

3. Explain how wildcard queries are handled in k-gram indexing. Discuss the challenges
associated with wildcard queries and potential solutions.

Wildcard queries, such as "comp*tion" (matching "competition", "computation"), are efficiently


processed using a k-gram index instead of scanning the entire vocabulary. The approach involves
breaking the query into k-grams, retrieving candidate words from the index, and filtering results.

Steps to Process a Wildcard Query

1. Convert the Query into K-Grams


o If the query is "comp*tion" and k = 3, extract k-grams:
o "com", "pti", "ion"
2. Look Up the K-Grams in the K-Gram Index
o Retrieve word lists for each k-gram:
o "com" → {"competition", "computation"}
o "pti" → {"competition"}
o "ion" → {"competition", "computation"}
3. Intersect the Retrieved Word Lists
o Only keep words appearing in all lists:
o {"competition"}
o This ensures words contain all required k-grams.
4. Post-Filtering for Exact Match
o Verify that the final words match the wildcard pattern (comp*tion).
o This removes false positives (e.g., "computation" would be eliminated).

Challenges with Wildcard Queries

1. High Computational Cost


o If the wildcard appears at the start ("*tion"), many k-grams may match, leading to
large candidate sets.
o Searching large inverted lists can be expensive.

Solution:

o Use optimized indexing techniques like suffix tries or suffix arrays for prefix/suffix-
based lookups.
2. Excessive Candidate Words
o Queries like "a*" may return thousands of results.

Solution:

o Set a limit on retrieved words or use frequency-based pruning.


3. Handling Multiple Wildcards
o Queries like "*comp*tion" are difficult since they match words at any position.

Solution:

Use n-gram tokenization instead of only k-grams.


o
Leverage regular expressions for final filtering.
o
4. Memory Usage of K-Gram Index
o Storing all k-grams for a large vocabulary consumes significant space.

Solution:

o Use compressed indexing techniques like Bloom filters or hash-based structures.

4. Explain the concept of average precision in evaluating IR systems.

Average Precision (AP) is a metric used to evaluate the effectiveness of an Information Retrieval
(IR) system by measuring how well relevant documents are ranked within search results. It
calculates the mean precision at different recall levels for a single query.

How Average Precision is Calculated

1. Retrieve and Rank Documents


o The IR system returns a ranked list of documents based on a query.
2. Identify Relevant Documents
o Mark documents as relevant (R) or non-relevant (N) based on ground truth.
3. Compute Precision at Each Relevant Document’s Position
o Precision at rank k (P@k) = (# of relevant documents retrieved up to k) / k
o Only consider ranks where a relevant document appears.
4. Compute Average Precision
o AP is the mean of precision values at positions where relevant documents appear.

Formula for AP:


Where:

o R = Total number of relevant documents.


o P(k) = Precision at rank k.
o rel(k) = 1 if document at k is relevant, else 0.
o n = Total retrieved documents.

Example Calculation

Suppose a search engine retrieves 5 documents for a query. The relevant documents are marked
with ✅, and non-relevant ones are marked ❌.

Rank Document Relevant?


1 D1 ✅Y
2 D2 ❌N
3 D3 ✅Y
4 D4 ✅Y
5 D5 ❌N

Step 1: Compute Precision at Each Relevant Document

Precision at rank k = (Number of relevant documents retrieved up to k) ÷ (k)

Rank (k) Relevant? Precision @ k


1 ✅Y 1/1 = 1.00
2 ❌N -
3 ✅Y 2/3 = 0.67
4 ✅ Y 3/4 = 0.75
5 ❌N -

Step 2: Compute AP

AP= (1.00+0.67+0.75)
3

AP= 2.42
3

AP= 0.807

Final AP Score = 0.807 (~80.7%)

This means that, on average, the search system retrieves relevant documents with 80.7% precision
across different ranks.
Significance of Average Precision

1. Ranks Precision Across Multiple Recall Levels


o AP considers ranking order, unlike simple precision or recall.
2. Handles Uneven Distribution of Relevant Documents
o Gives higher scores when relevant documents appear earlier in the ranking.
3. Used for Mean Average Precision (MAP)
o AP is computed per query; MAP averages AP across multiple queries for overall
system performance.

5. Discuss the process of relevance judgments and their importance in performance


evaluation.

Relevance judgments are the process of assessing whether a retrieved document is relevant to a
given query in an Information Retrieval (IR) system. These judgments are made by human
assessors or automated techniques and serve as the ground truth for evaluating search
performance.

2. Process of Relevance Judgments

1. Defining the Query Set


o A set of test queries is selected for evaluation.
o Example: "best programming languages in 2025".
2. Retrieving Candidate Documents
o The IR system retrieves a ranked list of documents based on each query.
3. Manual or Automated Assessment
o Human Experts:
 Review the documents and label them as relevant (R) or non-relevant (N).
 May use graded relevance (e.g., highly relevant, partially relevant, not
relevant).
o Automated Methods:
 Use user behavior (clicks, dwell time) to infer relevance.
4. Building a Ground Truth Dataset
o The query-document pairs are stored with their relevance labels.
o Example:
o Query: "best programming languages in 2025"
o D1: "Python is the most popular language in 2025." → Relevant
o D2: "History of Java programming language." → Non-Relevant

3. Importance of Relevance Judgments in Performance Evaluation

1. Creates a Benchmark for Evaluation


o Relevance judgments provide a gold standard to compare IR system outputs.
2. Enables Computation of IR Metrics
o Metrics like Precision, Recall, Mean Average Precision (MAP), and NDCG depend on
relevance labels.
3. Improves Search Engine Ranking
o Helps train machine learning models for ranking (e.g., Google’s PageRank, Learning-
to-Rank).
4. Handles Query Ambiguity
o Clarifies subjective interpretations of relevance (e.g., "apple" as fruit vs. tech
company).
5. Enhances User Satisfaction
o Systems optimized using relevance judgments return more useful results, improving
user experience.

6. Given the following document-term matrix:


Document Terms
Doc1 cat, dog, fish
Doc2 cat, bird, fish
Doc3 dog, bird, elephant
Doc4 cat, dog, elephant
Construct the posting list for each term: cat, dog, fish, bird, elephant.

A posting list (or inverted index) maps each term to a list of documents where the term appears.
Based on the given document-term matrix:

Document Terms
Doc1 cat, dog, fish
Doc2 cat, bird, fish
Doc3 dog, bird, elephant
Doc4 cat, dog, elephant

Posting List for Each Term

1. cat → {Doc1, Doc2, Doc4}


2. dog → {Doc1, Doc3, Doc4}
3. fish → {Doc1, Doc2}
4. bird → {Doc2, Doc3}
5. elephant → {Doc3, Doc4}

Final Representation (Posting Lists)

cat → [Doc1, Doc2, Doc4]


dog → [Doc1, Doc3, Doc4]
fish → [Doc1, Doc2]
bird → [Doc2, Doc3]
elephant → [Doc3, Doc4]

7. Consider the following document-term matrix:


Document Terms
Doc1 apple, banana, grape
Doc2 apple, grape, orange
Doc3 banana, orange, pear
Doc4 apple, grape, pear
Create the posting list for each term: apple, banana, grape, orange, pear.

A posting list (or inverted index) associates each term with a list of documents that contain it.

Given Document-Term Matrix

Document Terms
Doc1 apple, banana, grape
Doc2 apple, grape, orange
Doc3 banana, orange, pear
Doc4 apple, grape, pear

Posting List for Each Term

1. apple → {Doc1, Doc2, Doc4}


2. banana → {Doc1, Doc3}
3. grape → {Doc1, Doc2, Doc4}
4. orange → {Doc2, Doc3}
5. pear → {Doc3, Doc4}

Final Representation (Posting Lists)

apple → [Doc1, Doc2, Doc4]


banana → [Doc1, Doc3]
grape → [Doc1, Doc2, Doc4]
orange → [Doc2, Doc3]
pear → [Doc3, Doc4]

8. Given the inverted index with posting lists:


Term Posting List
cat Doc1, Doc2, Doc4
dog Doc1, Doc3, Doc4
fish Doc1, Doc2
Calculate the Term Document Matrix and find the documents which contain both 'cat'
and 'fish' using the Boolean Retrieval Model.

Step 1: Construct the Term-Document Matrix

Using the given inverted index, we create a Term-Document Matrix (binary representation, where 1
indicates the presence of a term in a document, and 0 indicates absence):

Document cat dog fish


Doc1 1 1 1
Doc2 1 0 1
Doc3 0 1 0
Document cat dog fish
Doc4 1 1 0

Step 2: Find Documents Containing Both "cat" AND "fish"

Using the Boolean Retrieval Model, we perform an AND operation between the columns for cat and
fish.

Document cat (1/0) fish (1/0) AND (cat, fish)


Doc1 1 1 ✅(1)
Doc2 1 1 ✅(1)
Doc3 0 0 ❌(0)
Doc4 1 0 ❌(0)

✅Documents that contain both "cat" and "fish":


Doc1, Doc2 🎯

Final Answer

Using the Boolean Retrieval Model, the documents that contain both "cat" and "fish" are:

Doc1, Doc2

This allows efficient retrieval of relevant documents in Information Retrieval (IR) systems using
Boolean queries. 🎯

9. Given the following term-document matrix for a set of documents:


Term Doc1 Doc2 Doc3 Doc4
cat 15 28 0 0
dog 18 0 32 25
fish 11 19 13 0
Total No of terms in Doc1, Doc2, Doc3 & Doc4 are 48, 85, 74 and 30 respectively.
Calculate the TF-IDF score for each term-document pair using the following TF and IDF
calculations:
• Term Frequency (TF) = (Number of occurrences of term in document) / (Total number
of terms in the
document)
• Inverse Document Frequency (IDF) = log(Total number of documents / Number of
documents
containing the term) + 1

Step 1: Given Data

We have a term-document matrix and total term counts for each document:
Term Doc1 Doc2 Doc3 Doc4
cat 15 28 0 0
dog 18 0 32 25
fish 11 19 13 0

Total terms per document:

 Doc1 = 48
 Doc2 = 85
 Doc3 = 74
 Doc4 = 30

Step 2: Calculate Term Frequency (TF)

TF= Number of occurrences of term in document


Total number of terms in the document

Term TF in Doc1 TF in Doc2 TF in Doc3 TF in Doc4


cat 15/48 = 0.3125 28/85 = 0.3294 0 0
dog 18/48 = 0.375 0 32/74 = 0.4324 25/30 = 0.8333
fish 11/48 = 0.2292 19/85 = 0.2235 13/74 = 0.1757 0

Step 3: Calculate Inverse Document Frequency (IDF)

Total number of documents N = 4.

 cat appears in 2 documents → IDF = log(4/2) + 1 = log(2) + 1 ≈ 1.693


 dog appears in 3 documents → IDF = log(4/3) + 1 = log(1.33) + 1 ≈ 1.176
 fish appears in 3 documents → IDF = log(4/3) + 1 = log(1.33) + 1 ≈ 1.176

Step 4: Compute TF-IDF

TF−IDF=TF×IDF

Term TF-IDF in Doc1 TF-IDF in Doc2 TF-IDF in Doc3 TF-IDF in Doc4


0.3125 × 1.693 = 0.3294 × 1.693 =
cat 0 0
0.529 0.558
dog 0.375 × 1.176 = 0.441 0 0.4324 × 1.176 = 0.8333 × 1.176 =
Term TF-IDF in Doc1 TF-IDF in Doc2 TF-IDF in Doc3 TF-IDF in Doc4
0.508 0.980
0.2292 × 1.176 = 0.2235 × 1.176 = 0.1757 × 1.176 =
fish 0
0.270 0.263 0.207

Final TF-IDF Scores

Term Doc1 Doc2 Doc3 Doc4


cat 0.529 0.558 0 0
dog 0.441 0 0.508 0.980
fish 0.270 0.263 0.207 0

These TF-IDF scores represent the importance of each term in each document, helping in ranking
and retrieval in Information Retrieval (IR) systems. 🎯’

10. Given the term-document matrix and the TF-IDF scores calculated from Problem 4,
calculate the cosine similarity between each pair of documents (Doc1, Doc2), (Doc1,
Doc3), (Doc1, Doc4), (Doc2, Doc3), (Doc2,Doc4), and (Doc3, Doc4).

Step 1: Given TF-IDF Matrix

From the previous problem, we have the TF-IDF matrix:

Term Doc1 Doc2 Doc3 Doc4


cat 0.529 0.558 0 0
dog 0.441 0 0.508 0.980
fish 0.270 0.263 0.207 0

Each document is represented as a TF-IDF vector:

 Doc1 → (0.529, 0.441, 0.270)


 Doc2 → (0.558, 0, 0.263)
 Doc3 → (0, 0.508, 0.207)
 Doc4 → (0, 0.980, 0)

Step 2: Cosine Similarity Formula

The cosine similarity between two vectors A and B is calculated as:

cosine similarity= A⋅B


∣∣A∣∣×∣∣B∣∣

Where:
 A⋅B is the dot product of vectors A and B.
 ∣∣A∣∣ is the magnitude (Euclidean norm) of vector A, calculated as:

Step 3: Compute Magnitudes of Vectors

Step 4: Compute Cosine Similarities


Step 5: Final Cosine Similarity Scores

Document Pair Cosine Similarity


(Doc1, Doc2) 0.803
(Doc1, Doc3) 0.690
(Doc1, Doc4) 0.597
(Doc2, Doc3) 0.160
(Doc2, Doc4) 0.000
(Doc3, Doc4) 0.926

Interpretation

 Doc1 and Doc2 (0.803) → High similarity, meaning they contain overlapping terms.
 Doc1 and Doc3 (0.690) → Moderate similarity.
 Doc1 and Doc4 (0.597) → Some similarity.
 Doc2 and Doc3 (0.160) → Low similarity.
 Doc2 and Doc4 (0.000) → No similarity (no shared terms).
 Doc3 and Doc4 (0.926) → Very high similarity, meaning they are very close in content.

11. Consider the following queries expressed in terms of TF-IDF weighted vectors:
Query1: cat: 0.5, dog: 0.5, fish: 0
Query2: cat: 0, dog: 0.5, fish: 0.5
Calculate the cosine similarity between each query and each document from the term-
document matrix in Problem 4.

for two vectors A and B:


cos (θ) = A⋅B
∣∣A∣∣×∣∣B∣∣

for two vectors AAA and BBB:

where:

 A⋅B is the dot product:

A⋅B = A1 B1 + A2B2 + A3B3

 ∥A∥ is the magnitude (Euclidean norm):

Step 2: Given Vectors

Query Vectors

 Query1 (Q1) = (0.5, 0.5, 0)


 Query2 (Q2) = (0, 0.5, 0.5)

Document Vectors

 D1 = (0.8, 0.2, 0.1)


 D2 = (0.1, 0.7, 0.3)
 D3 = (0.2, 0.5, 0.9)
Final Results (Rounded to 4 Decimal Places)

D1 D2 D3
Q1 0.8513 0.7365 0.4719
Q2 0.2554 0.9206 0.9439

12. Given the following term-document matrix:

Term Doc1 Doc2 Doc3 Doc4


apple 22 9 0 40
banana 14 0 12 0
orange 0 23 14 0

Total No of terms in Doc1, Doc2, Doc3 & Doc4 are 65, 48, 36 and 92 respectively. calculate
the TF-IDF score for each term-document pair.

Step 1: Given Data Term-Document Matrix


Term Doc1 Doc2 Doc3 Doc4

Apple 22 9 0 40

Banana 14 0 12 0

Orange 0 23 14 0
Total number of terms in each document:

 Total terms in Doc1 = 65


 Total terms in Doc2 = 48
 Total terms in Doc3 = 36
 Total terms in Doc4 = 92

Total number of documents: N=4

Step 2: Compute Term Frequency (TF)


Step 3: Compute Document Frequency (DF)

DF=Number of documents containing the term

Term DF
Apple 3 (appears in Doc1, Doc2, Doc4)
Banana 2 (appears in Doc1, Doc3)
Orange 2 (appears in Doc2, Doc3)
Final TF-IDF Table
Term Doc1 Doc2 Doc3 Doc4

Apple 0.0423 0.0234 0.0000 0.0543

Banana 0.1493 0.0000 0.2310 0.0000

Orange 0.0000 0.3321 0.2696 0.0000


13. Suppose you have a test collection with 50 relevant documents for a given query. Your
retrieval system returns 30 documents, out of which 20 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.
• Recall = (Number of relevant documents retrieved) / (Total number of relevant
documents)
• Precision = (Number of relevant documents retrieved) / (Total number of documents
retrieved)
• F-score = 2 * (Precision * Recall) / (Precision + Recall)

Given Data

 Total relevant documents (R) = 50


 Retrieved documents (T) = 30
 Relevant documents retrieved (RR) = 20

Step 1: Recall Calculation


Recall= Number of relevant documents retrieved

Total number of relevant documents

Recall= 20

50

Recall = 0.4 (or 40%)

Step 2: Precision Calculation


Precision= Number of relevant documents retrieved

Total number of documents retrieved

Precision = 20

30

Precision = 0.6667 (or 66.67%)

Step 3: F-score Calculation


F-score=2× Precision×Recall

Precision+RecallF

F-score=2× (0.6667×0.4)

(0.6667+0.4)

F-score=2× 0.2667

1.0667
F-score= 2 × 0.25 = 0.5

F-score= 50%(or 0.5)

Final Results

 Recall = 0.4 (40%)


 Precision = 0.6667 (66.67%)
 F-score = 0.5 (50%)

14. You have a test collection containing 100 relevant documents for a query. Your retrieval
system retrieves 80 documents, out of which 60 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.

Given:

 Total relevant documents (R) = 100


 Retrieved documents (N) = 80
 Relevant retrieved documents (r) = 60

Step 1: Recall Calculation

Recall= Number of relevant documents retrieved

Total number of relevant documents

Recall = 60

100

Recall = 60% (0.6)

Step 2: Precision Calculation

Precision= Number of relevant documents retrieved

Total number of documents retrieved

Precision= 60

80

Precision= 75% (0.75)


Step 3: F-score Calculation

F-score=2× Precision×Recall

Precision+RecallF

F-score=2× (0.75×0.6)

(0.75+0.6)

F-score=2× 0.45

1.35

F-score=2× 0.3333 = 0.6667

F-score= 66.67%(or 0.6667)

Final Results

 Recall = 60% (0.6)


 Precision = 75% (0.75)
 F-score = 66.67% (0.6667)

15. In a test collection, there are a total of 50 relevant documents for a query. Your retrieval
system retrieves 60 documents, out of which 40 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval

Given Data:

 Total relevant documents (R) = 50


 Retrieved documents (N) = 60
 Relevant retrieved documents (r) = 40

Step 1: Recall Calculation

Recall= Number of relevant documents retrieved

Total number of relevant documents

Recall= 40

50

Recall = 0.8 (or 80%)


Step 2: Precision Calculation

Precision= Number of relevant documents retrieved

Total number of documents retrieved

Precision= 40

60

Precision= 0.75 (or 75%)

Step 3: F-score Calculation

F-score=2× Precision×Recall

Precision+RecallF

F-score=2× (0.6667×0.8)

(0.6667+0.8)

F-score=2× 0.5333

1.4667

F-score=2× 0.3636=0.7273

F-score= 0.7273 (or 72.73%)

Final Results

 Recall = 80% (0.8)


 Precision = 66.67% (0.6667)
 F-score = 72.73% (0.7273)

16. You have a test collection with 200 relevant documents for a query. Your retrieval
system retrieves 150 documents, out of which 120 are relevant. Calculate the Recall,
Precision, and F-score for this retrieval.

Given Data:

 Total relevant documents (R) = 200


 Retrieved documents (N) = 150
 Relevant retrieved documents (r) = 120
Step 1: Recall Calculation

Recall= Number of relevant documents retrieved

Total number of relevant documents

Recall= 120

200

Recall = 0.6 (or 60%)

Step 2: Precision Calculation

Precision= Number of relevant documents retrieved

Total number of documents retrieved

Precision= 120

150

Precision= 0.8 (or 80%)

Step 3: F-score Calculation

F-score=2× Precision×Recall

Precision+RecallF

F-score=2× (0.8×0.6)

(0.8+0.6)

F-score=2× 0.48

1.4

F-score=2× 0.3429=0.6857

F-score= 68.57%(or 0.6857)

Final Results

 Recall = 60% (0.6)


 Precision = 80% (0.8)
 F-score = 68.57% (0.6857)
17. In a test collection, there are 80 relevant documents for a query. Your retrieval system
retrieves 90 documents, out of which 70 are relevant. Calculate the Recall, Precision, and
F-score for this retrieval.
Given Data:

 Total relevant documents (R) = 80


 Retrieved documents (N) = 90
 Relevant retrieved documents (r) = 70

Step 1: Recall Calculation

Recall= Number of relevant documents retrieved

Total number of relevant documents

Recall= 70

80

Recall = 0.875 (or 87.5%)

Step 2: Precision Calculation

Precision= Number of relevant documents retrieved

Total number of documents retrieved

Precision= 70

90

Precision= 0.7778 (or 77.78%)

Step 3: F-score Calculation

F-score=2× Precision×Recall

Precision+RecallF

F-score=2× (0.7778×0.875)

(0.7778+0.875)

F-score=2× 0.6806

1.6528

F-score=2× 0.4119=0.8238

F-score= 82.38%(or 0.8238)


18. Construct 2-gram, 3-gram and 4-gram index for the following terms:
a. banana b. pineapple c. computer d. programming e. elephant f. database.
Given Terms:
Banana
Pineapple
Computer
Programming
Elephant
database

Step 1: Construct n-grams

1. 2-gram (Bigram) Index

Term Padded Term 2-grams (Bigrams)

banana $banana$ $b, ba, an, na, an, na, a$

pineapple $pineapple$ $p, pi, in, ne, ea, ap, pp, pl, le, e$

computer $computer$ $c, co, om, mp, pu, ut, te, er, r$

programming $programming$ $p, pr, ro, og, gr, ra, am, mm, mi, in, ng, g$

elephant $elephant$ $e, el, le, ep, ph, ha, an, nt, t$

database $database$ $d, da, at, ta, ab, ba, as, se, e$

2. 3-gram (Trigram) Index

Term Padded Term 3-grams (Trigrams)

banana $banana$ $ba, ban, ana, nan, ana, na$

pineapple $pineapple$ $pi, pin, ine, nea, eap, app, ppl, ple, le$

computer $computer$ $co, com, omp, mpu, put, ute, ter, er$

programming $programming$ $pr, pro, rog, ogr, gra, ram, amm, mmi, min, ing, ng$

elephant $elephant$ $el, ele, lep, eph, pha, han, ant, nt$

database $database$ $da, dat, ata, tab, aba, bas, ase, se$
3. 4-gram (Four-gram) Index

Term Padded Term 4-grams (Four-grams)

banana $banana$ $ban, bana, anan, nana, ana$

pineapple $pineapple$ $pin, pine, inea, neap, eapp, appl, pple, ple$

computer $computer$ $com, comp, ompu, mput, pute, uter, ter$

programming $programming$ $pro, prog, rogr, ogra, gram, ramm, ammi, mmin, ming, ing$

elephant $elephant$ $ele, elep, leph, epha, phan, hant, ant$

database $database$ $dat, data, atab, taba, abas, base, ase$

19. Calculate the Levenshtein distance between the following pair of words:
a. kitten and sitting b. intention and execution c. robot and orbit d. power and flower

The Levenshtein distance (Edit Distance) between two words is the minimum number of single-
character edits (insertions, deletions, or substitutions) required to change one word into the other.

(a) kitten → sitting


Step Operation

kitten → sitten Substitute 'k' → 's'

sitten → sittin Substitute 'e' → 'i'

sittin → sitting Insert 'g' at the end

Levenshtein Distance = 3

(b) intention → execution


Step Operation

intention → extention Substitute 'i' → 'e'

extention → exention Delete 't'

exention → execution Substitute 'n' → 'c'

Levenshtein Distance = 3
(c) robot → orbit

Step Operation

robot → orbot Substitute 'r' → 'o'

orbot → orbit Substitute 'b' → 'i'

Levenshtein Distance = 2

(d) power → flower


Step Operation

power → fpower Insert 'f' at the beginning

fpower → flower Substitute 'p' → 'l'

Levenshtein Distance = 2

Final Results:
Word Pair Levenshtein Distance

kitten → sitting 3

intention → execution 3

robot → orbit 2

power → flower 2

20. Using the Soundex algorithm, encode the following:


a. Williams b. Gonzalez c. Harrison d. Parker e. Jackson f. Thompson

The Soundex algorithm encodes words (mainly names) into a four-character code consisting of a
letter followed by three digits, which represent similar sounding consonants. The general steps are:

1. Keep the first letter of the word (uppercase).


2. Convert the rest of the letters into digits using the mapping below:
o B, F, P, V → 1
o C, G, J, K, Q, S, X, Z → 2
o D, T → 3
o L→4
o M, N → 5
o R→6
3. Remove vowels (A, E, I, O, U), H, W, and Y unless they are the first letter.
4. If two adjacent letters have the same code, keep only one.
5. Pad the result with zeros if necessary to make a 4-character code.

Encoding the Given Words:

(a) Williams

1. First letter: W
2. Convert remaining letters: I(ignored), L(4), L(4), I(ignored), A(ignored), M(5), S(2)
3. Remove duplicates: W-4-5-2
4. Final Soundex: W452

(b) Gonzalez

1. First letter: G
2. Convert remaining letters: O(ignored), N(5), Z(2), A(ignored), L(4), E(ignored), Z(2)
3. Remove duplicates: G-5-2-4
4. Final Soundex: G524

(c) Harrison

1. First letter: H
2. Convert remaining letters: A(ignored), R(6), R(6), I(ignored), S(2), O(ignored), N(5)
3. Remove duplicates: H-6-2-5
4. Final Soundex: H625

(d) Parker

1. First letter: P
2. Convert remaining letters: A(ignored), R(6), K(2), E(ignored), R(6)
3. Remove duplicates: P-6-2-6
4. Final Soundex: P626

(e) Jackson

1. First letter: J
2. Convert remaining letters: A(ignored), C(2), K(2), S(2), O(ignored), N(5)
3. Remove duplicates: J-2-5
4. Final Soundex (padded with 0): J252

(f) Thompson

1. First letter: T
2. Convert remaining letters: H(ignored), O(ignored), M(5), P(1), S(2), O(ignored), N(5)
3. Remove duplicates: T-5-1-2
4. Final Soundex: T512
Final Soundex Codes:

Word Soundex Code


Williams W452
Gonzalez G524
Harrison H625
Parker P626
Jackson J252
Thompson T512
Unit II

1. Compare and contrast the Naive Bayes and Support Vector Machines (SVM) algorithms
for text classification. Highlight their strengths and weaknesses.

Both Naïve Bayes (NB) and Support Vector Machines (SVM) are widely used in text classification
tasks such as spam detection, sentiment analysis, and topic classification. However, they differ
significantly in their approach, performance, and suitability for different types of text datasets.

1. Key Differences: Naïve Bayes vs. SVM


Feature Naïve Bayes (NB) Support Vector Machines (SVM)

Probabilistic (Bayesian Discriminative (margin-based


Algorithm Type
approach) classification)

Bayes’ theorem with Finds an optimal hyperplane for


Mathematical Basis
independence assumption classification

Very fast, linear time complexity Slower, quadratic or higher complexity


Training Speed
O(n)O(n) O(n2)O(n^2)

Less scalable, expensive for very large


Scalability Highly scalable to large datasets
datasets

Assumes features are No independence assumption,


Feature Dependence
conditionally independent considers feature interactions

Handling of High- Performs well due to Works well with high-dimensional


Dimensional Data independence assumption sparse data

Performance with Small Works well, even with small Requires more data for robust decision
Data datasets boundaries

Can be affected by noisy


Sensitivity to Noisy Data More robust to outliers and noise
features

Easily interpretable
Interpretability Less interpretable (black-box nature)
(probabilistic outputs)

Requires careful tuning (kernel, C,


Parameter Tuning Minimal tuning required
gamma)

Handling of Imbalanced Works well if probabilities are Can be biased; requires techniques like
Data adjusted class weighting
2. Strengths and Weaknesses

Naïve Bayes (NB)

✅Strengths:

 Extremely fast training and classification.


 Works well with small datasets and high-dimensional text data.
 Simple and interpretable (probabilities assigned to each class).
 Works well for binary and multi-class classification.

❌Weaknesses:

 Assumes independence of features (which is rarely true in text data).


 Can be easily misled by correlated features.
 Performs worse when features are dependent or for complex classification tasks.

Support Vector Machines (SVM)

✅Strengths:

 Works well in high-dimensional spaces (like TF-IDF or word embeddings).


 Effective with complex decision boundaries (useful for nuanced classification).
 Robust to overfitting and noise, especially with proper kernel selection.
 Can handle imbalanced datasets better with appropriate techniques (e.g., weighted classes).

❌Weaknesses:

 Computationally expensive for large datasets.


 Requires tuning of hyperparameters (e.g., kernel type, C, gamma).
 Difficult to interpret compared to Naïve Bayes.

2. Compare and contrast the effectiveness of K-means and hierarchical clustering in text
data analysis. Discuss their suitability for different types of text corpora and retrieval tasks.

1. Comparison of Effectiveness

Feature K-means Hierarchical Clustering

Agglomerative (bottom-up) or
Algorithm Type Partition-based (iterative)
Divisive (top-down)

Less scalable due to high


Scalability Highly scalable for large datasets
computational complexity

Time O(n⋅k⋅i)O(n \cdot k \cdot i) (where nn is data O(n2log⁡n)O(n^2 \log n) for


Feature K-means Hierarchical Clustering

Complexity points, kk is clusters, ii is iterations) agglomerative

Can find clusters of arbitrary


Cluster Shape Assumes spherical clusters
shapes

Memory Usage Low (stores only cluster centroids) High (stores distance matrix)

Dendrogram provides hierarchical


Interpretability Less interpretable, requires predefined kk
structure

More stable as it does not rely on


Stability Sensitive to initialization
initialization

2. Suitability for Different Types of Text Corpora

K-means Clustering

 Best for:
o Large-scale text datasets (e.g., news articles, social media posts).
o Applications where predefined clusters are needed (e.g., topic modeling, document
categorization).
o High-dimensional text representations (TF-IDF, word embeddings) where efficiency
is critical.

Hierarchical Clustering

 Best for:
o Small to medium-sized corpora (e.g., research papers, legal documents).
o Tasks requiring hierarchical structures (e.g., taxonomy generation, document
organization).
o Exploratory analysis where the number of clusters is unknown.

3. Suitability for Retrieval Tasks


Retrieval Task Best Choice Reason

Efficient, handles large datasets, works well with word


Topic Modeling K-means
embeddings

Document Classification K-means Assigns predefined labels, scalable for large corpora

Hierarchical Document
Hierarchical Generates a tree-like structure for nested categories
Organization
Retrieval Task Best Choice Reason

Keyword-Based Search
K-means Clusters similar terms efficiently
Optimization

Taxonomy Generation Hierarchical Provides a structured representation of relationships

3. Discuss challenges and issues in applying clustering techniques to large-scale text data.

1. High Dimensionality of Text Data


Text data is often represented using high-dimensional feature spaces, making it difficult for
clustering algorithms to measure similarity effectively. Traditional methods like K-means
struggle in such spaces, leading to poor cluster quality. Dimensionality reduction techniques
like PCA, t-SNE, or word embeddings (Word2Vec, BERT) can help but add computational
complexity.
2. Data Sparsity
Text representation methods like TF-IDF create sparse matrices, which reduce clustering
efficiency. Many clustering algorithms, such as K-means, rely on dense vector spaces for
better performance. Word embeddings and semantic representations can help address this
issue.
3. Scalability Issues
Clustering large text datasets is computationally expensive. Hierarchical clustering has a
high time complexity, making it infeasible for big data. Even K-means, though more scalable,
requires multiple iterations, which can be resource-intensive. Mini-batch K-means and
distributed frameworks like Apache Spark can help manage scalability.
4. Choosing the Optimal Number of Clusters
Most clustering algorithms require a predefined number of clusters, which is challenging to
determine. Methods like the Elbow Method, Silhouette Score, or Gap Statistics provide
estimates, but results are often heuristic and dataset-dependent. Hybrid approaches, such
as using hierarchical clustering to determine kk before applying K-means, can improve
results.
5. Overlapping and Ambiguous Text Categories
Many text documents belong to multiple topics, making hard clustering methods less
effective. K-means and hierarchical clustering struggle with documents that exhibit multiple
themes. Soft clustering techniques like Gaussian Mixture Models or topic modeling
approaches like Latent Dirichlet Allocation (LDA) offer better solutions.
6. Noise and Sensitivity to Initialization
K-means is highly sensitive to initial cluster centroids, leading to inconsistent results.
Additionally, text data often contains noise, such as stopwords, typos, and synonyms, which
can distort clustering. Proper text preprocessing, including stopword removal, stemming,
and lemmatization, helps improve clustering performance.
7. Computational Complexity and Memory Constraints
Large-scale text clustering requires significant memory and processing power. Hierarchical
clustering, which stores pairwise distance matrices, becomes infeasible for massive
datasets. Streaming and online clustering algorithms, such as BIRCH or Mini-batch K-means,
help address computational limitations.
8. Evaluating Cluster Quality
Clustering is an unsupervised task, making it difficult to evaluate results. Metrics like
Silhouette Score, Davies-Bouldin Index, and Adjusted Rand Index provide quantitative
assessments but may not always align with human judgment. Manual evaluation and
comparison with known benchmarks improve interpretation.
9. Language and Contextual Challenges
Words often have multiple meanings depending on context, making traditional clustering
approaches ineffective. For example, "bank" could refer to a financial institution or a
riverbank. Context-aware embeddings like BERT and semantic similarity techniques
improve clustering accuracy by capturing nuanced relationships between words.
10. Hybrid and Advanced Approaches
To address these challenges, hybrid approaches combining multiple techniques are often
necessary. Using dimensionality reduction before clustering, leveraging distributed
computing, and applying deep learning-based embeddings enhance clustering performance.
Combining unsupervised clustering with supervised learning can also refine results.

4. Explain link analysis and the PageRank algorithm. How does PageRank work to
determinethe importance of web pages?

5. Describe the PageRank algorithm and how it calculates the importance of web pages
based on their incoming links. Discuss its role in web search ranking.

What is Link Analysis?

Link analysis is a technique used to evaluate relationships between interconnected entities,


typically in the form of networks or graphs. In the context of web search, link analysis examines the
structure of hyperlinks between web pages to assess their authority, relevance, and importance. It
is widely used in search engine ranking, social network analysis, and fraud detection.

One of the most famous applications of link analysis is the PageRank algorithm, developed by Larry
Page and Sergey Brin at Google, which revolutionized web search by ranking web pages based on
their link structure rather than just keyword matching.

Introduction to PageRank

The PageRank algorithm was developed by Larry Page and Sergey Brin at Google to measure the
importance of web pages based on their link structure. It is a graph-based ranking algorithm that
assigns a numerical score to each web page, representing its importance within a network of linked
pages.

The fundamental idea behind PageRank is that a web page is more important if many other
important pages link to it. This concept is based on the assumption that hyperlinks act as votes of
confidence, where links from authoritative pages pass more value than those from less credible
sources.

PageRank Algorithm Formula


Where:

 PR(P)) = PageRank score of page P.


 N = Total number of web pages in the network.
 d = Damping factor (typically set to 0.85) to account for the probability that a user randomly
jumps to another page instead of following links.
 Pi = Web pages linking to P.
 PR(Pi) = PageRank score of linking page Pi.
 L(Pi)= Number of outbound links on page Pi.

The process starts with equal PageRank values for all pages and iteratively updates them based on
incoming links until the values converge.

Key Characteristics of PageRank

1. Authority-Based Ranking – Pages with more incoming links from authoritative sources have
higher ranks.
2. Link Weighting – A link from a highly-ranked page (e.g., Wikipedia) is more valuable than a
link from a low-ranked page.
3. Damping Factor – Accounts for random user behavior, ensuring that pages without links
still get some rank.
4. Iterative Computation – The algorithm runs multiple iterations until PageRank values
stabilize.

How PageRank Determines the Importance of Web Pages

1. More Links = Higher Rank – Pages that receive links from multiple sources are considered
more important.
2. Quality Matters – Links from high PageRank pages contribute more to a page’s ranking than
links from low-ranked pages.
3. Prevents Manipulation – Since PageRank distributes weight based on outbound links,
spammy pages with excessive outbound links do not accumulate much authority.
4. Improves Search Engine Ranking – Pages with higher PageRank are more likely to appear at
the top of search results, leading to better visibility.

Role of PageRank in Web Search Ranking

1. Determining Authority – Pages with higher PageRank are considered more authoritative
and are more likely to appear at the top of search results.
2. Reducing Spam – Links from high-quality sources are weighted more, reducing the impact
of low-quality or manipulated links.
3. Improving User Experience – Helps rank credible and informative pages higher, making
search results more relevant.
4. Foundation for Modern Algorithms – Although Google now uses hundreds of ranking
factors, PageRank remains a core concept in search engine optimization (SEO).
6. Explain how link analysis algorithms like HITS (Hypertext Induced Topic Search)
contribute to improving search engine relevance.

Link analysis algorithms evaluate the structure of hyperlinks between web pages to determine their
importance, authority, and relevance. These algorithms help search engines rank pages more
effectively by identifying authoritative sources and relevant hubs within a given topic. One such
algorithm is HITS (Hypertext Induced Topic Search), which improves search engine relevance by
analyzing the link structure beyond simple keyword matching.

HITS Algorithm: Hypertext Induced Topic Search

The HITS algorithm, developed by Jon Kleinberg in 1999, classifies web pages into two categories:

1. Authorities – Pages that are highly referenced by other important pages on a topic.
2. Hubs – Pages that link to multiple authoritative sources on a topic.

HITS operates in two phases:

1. Sampling Phase – Given a search query, HITS collects a small set of web pages related to the
query, typically those retrieved by traditional search engines.
2. Iterative Authority-Hub Calculation – The algorithm assigns two scores to each page:
o Authority Score (A(p)): The sum of hub scores of all pages linking to it.
o Hub Score (H(p)): The sum of authority scores of all pages it links to.

These scores are updated iteratively using:

where ii represents pages linking to pp and jj represents pages linked by pp. After several
iterations, the scores converge, distinguishing authoritative sources from hub pages.

How HITS Improves Search Engine Relevance

1. Topic-Specific Ranking – Unlike PageRank, which ranks pages globally, HITS focuses on
query-relevant pages, improving ranking for specific topics.
2. Identifying Expert Pages – HITS recognizes authoritative sources (e.g., government or
academic websites) by considering who references them.
3. Improving Query Expansion – By identifying strong hubs, HITS finds additional relevant
resources for a given query.
4. Handling New and Dynamic Content – Since HITS dynamically analyzes link structure for
each query, it can better rank emerging pages that are not yet globally popular.
5. Complementing PageRank – While PageRank focuses on global importance, HITS refines
results within specific topics, making search results more contextually relevant.
7. Discuss the impact of web information retrieval on modern search engine technologies
and user experiences.

Web Information Retrieval (IR) is the process of extracting relevant data from the vast and ever-
growing collection of online documents. It serves as the foundation of modern search engines,
helping them efficiently retrieve, rank, and present relevant web pages to users. Advances in IR
have transformed search engine technologies and significantly improved user experiences by
enhancing accuracy, speed, and personalization.

Impact on Modern Search Engine Technologies

1. Improved Ranking Algorithms


Traditional search engines relied on keyword-based matching, but modern IR techniques
incorporate advanced ranking algorithms like PageRank, HITS, and machine learning-based
models to prioritize high-quality, authoritative content. Deep learning models, such as BERT
and RankBrain, enable search engines to understand the context and intent behind user
queries.
2. Semantic Search and Natural Language Processing (NLP)
Web IR has shifted from simple keyword matching to semantic search, which understands
the meaning of words and their relationships. NLP techniques, including named entity
recognition (NER), sentiment analysis, and word embeddings (Word2Vec, BERT), allow
search engines to process queries more like humans, improving accuracy and relevance.
3. Personalization and User Intent Recognition
Modern search engines leverage user behavior analysis, location data, and browsing history
to tailor search results to individual preferences. By utilizing collaborative filtering and
reinforcement learning, search engines can provide personalized recommendations,
improving relevance and engagement.
4. Indexing and Scalability Enhancements
As web content grows exponentially, efficient indexing techniques such as inverted indexes,
distributed computing, and parallel processing help search engines store and retrieve data
quickly. Technologies like Google’s Caffeine indexing system allow real-time indexing,
ensuring that search results remain fresh and up to date.
5. Multimedia and Multimodal Search
Advances in IR have expanded search capabilities beyond text. Image, video, voice, and even
multimodal search (combining text and visual inputs) enable users to retrieve information
in diverse formats. Reverse image search, voice assistants, and AI-powered video search
exemplify how search engines now handle various media types.
6. Handling Large-Scale and Noisy Data
With billions of web pages and user-generated content, search engines must filter out spam,
duplicate content, and misinformation. Machine learning classifiers, knowledge graphs, and
fact-checking algorithms help improve result quality by prioritizing trustworthy sources.

Impact on User Experience

1. Faster and More Accurate Search Results


With advanced indexing and ranking algorithms, users receive near-instantaneous and
highly relevant results. Predictive search suggestions and query auto-completion reduce the
effort required to find information.
2. Conversational and Voice Search
Web IR innovations have enabled voice-based assistants like Google Assistant, Siri, and
Alexa, allowing users to perform searches hands-free. Conversational AI models like
ChatGPT and Bard further enhance interaction by providing human-like responses to
complex queries.
3. Context-Aware and Localized Search
Search engines now consider geolocation, device type, and browsing history to deliver
location-specific and personalized results. For example, a search for "restaurants near me"
provides results based on real-time GPS data.
4. Enhanced Information Presentation
Web IR has evolved beyond simple text listings. Search engines now offer rich snippets,
featured answers, knowledge panels, and interactive carousels, improving the way users
consume information without clicking multiple links.
5. Improved Accessibility
Advances in speech-to-text, screen readers, and adaptive interfaces make search engines
more accessible to users with disabilities, ensuring inclusivity in information retrieval.
6. Reduced Search Friction
Features like zero-click searches, smart answers, and AI-driven summarization reduce the
time users spend sifting through multiple pages, providing instant answers within search
results.

8. Discuss applications of link analysis in information retrieval systems beyond web search.

Link analysis is a technique used to examine relationships between interconnected entities in a


network. While it is widely known for its role in web search engines, it has several other
applications in information retrieval across different domains. These applications leverage graph-
based ranking, connectivity patterns, and authority detection to enhance data retrieval,
organization, and analysis.

1. Social Network Analysis


Link analysis is used in social media platforms and professional networks to identify
influencers, detect communities, and recognize fake accounts. It helps in recommending
friends, groups, or content based on user interactions and connectivity patterns.
2. Citation and Academic Research Analysis
Academic databases utilize link analysis to rank research papers based on citations,
measure author impact, and track emerging research trends. It helps researchers find
influential studies and improves the organization of scientific literature.
3. Fraud Detection and Cybersecurity
Financial institutions and cybersecurity systems use link analysis to detect fraudulent
transactions, prevent money laundering, and identify phishing or spam networks. It enables
the recognition of suspicious connections and patterns in transactional data.
4. Recommender Systems
E-commerce, streaming services, and online learning platforms apply link analysis to
suggest products, content, or courses based on user behavior and interaction history. It
enhances personalized recommendations and improves user engagement.
5. Healthcare and Bioinformatics
Medical research and disease tracking systems utilize link analysis to study disease spread
patterns, analyze protein interactions, and optimize patient referral networks. It plays a
crucial role in epidemiology and drug discovery.
6. Intelligence and Crime Investigation
Law enforcement agencies use link analysis to map criminal networks, track online
radicalization, and analyze communication metadata. It assists in identifying key suspects
and their associations in crime investigations.
7. Supply Chain and Logistics Optimization
Link analysis helps optimize delivery routes, manage supplier relationships, and mitigate
risks in supply chains. It improves efficiency in transportation and logistics by identifying
the best paths and connections between suppliers and distributors.

9. Compare and contrast pairwise and listwise learning to rank approaches. Discuss their
advantages and limitations.

Learning to Rank (LTR) is a machine learning technique used in information retrieval systems to
optimize the ranking of documents based on relevance to a given query. LTR methods can be
categorized into three main approaches: pointwise, pairwise, and listwise. While pairwise
approaches focus on ranking relative document pairs, listwise methods optimize the ordering of an
entire list of documents. Understanding the differences between these two approaches helps in
selecting the most suitable ranking model for specific applications.

1. Pairwise Learning to Rank

Pairwise approaches convert the ranking problem into a classification or regression task by
comparing pairs of documents. The model learns a function that determines which document in a
given pair should be ranked higher.

Advantages of Pairwise Approaches

 Simple and efficient: Easier to implement compared to listwise methods.


 Scalable: Works well with large datasets since it only considers pairs instead of full
document lists.
 Effective in ranking optimization: Focuses on improving relative ranking rather than
absolute relevance scores.

Limitations of Pairwise Approaches

 Ignores global ranking order: It does not optimize the ranking of the entire result list, which
may lead to suboptimal rankings.
 Higher computational complexity: Since all possible document pairs are considered, it can
be expensive for large-scale datasets.

Examples of Pairwise Algorithms

 RankSVM: Uses Support Vector Machines (SVMs) to learn ranking functions.


 RankBoost: Applies boosting techniques to refine rankings iteratively.
2. Listwise Learning to Rank

Listwise approaches consider the ranking of an entire list of documents rather than individual
pairs. They directly optimize ranking measures such as NDCG (Normalized Discounted Cumulative
Gain) or MAP (Mean Average Precision).

Advantages of Listwise Approaches

 Optimizes global ranking: Takes into account the entire ranked list, leading to better
ranking accuracy.
 Directly maximizes ranking metrics: Unlike pairwise methods, listwise approaches optimize
metrics like NDCG, which directly correlate with ranking quality.
 Better performance for complex ranking tasks: Provides more accurate rankings, especially
in search engines and recommendation systems.

Limitations of Listwise Approaches

 More computationally intensive: Requires handling the entire list of documents, making it
slower for large-scale datasets.
 Higher data requirements: Needs well-labeled training data, which may not always be
available.
 Difficult to implement: More complex than pairwise methods due to direct optimization of
ranking measures.

Examples of Listwise Algorithms

 LambdaMART: A gradient boosting framework optimized for ranking tasks.


 ListNet: Uses a probability distribution over ranking lists to optimize ordering.

10. Discuss the role of supervised learning techniques in learning to rank and their impact
on search engine result quality.

Learning to Rank (LTR) is a machine learning approach used to optimize the ordering of search
results based on user relevance. Supervised learning techniques play a crucial role in LTR by
leveraging labeled training data to learn ranking models. These techniques help search engines
improve relevance, personalization, and user satisfaction.

1. Role of Supervised Learning in Learning to Rank

Supervised learning techniques in LTR use query-document pairs with relevance labels to train
models that predict ranking scores. These models learn patterns from historical user interactions,
clicks, and content features to improve ranking effectiveness.

Key supervised learning approaches in LTR include:

 Pointwise models: Treat ranking as a regression or classification problem by predicting


individual document relevance scores.
 Pairwise models: Compare pairs of documents to determine which one should rank higher
(e.g., RankSVM, RankBoost).
 Listwise models: Optimize the ranking of entire document lists using metrics like NDCG
(Normalized Discounted Cumulative Gain) (e.g., LambdaMART, ListNet).

Supervised learning algorithms used in LTR include:

 Decision Trees and Gradient Boosting (LambdaMART, XGBoost, LightGBM)


 Neural Networks (DeepRank, BERT-based ranking models)
 Support Vector Machines (RankSVM)

2. Impact on Search Engine Result Quality

Supervised learning techniques significantly enhance search engine performance by improving


relevance, efficiency, and user experience.

1. Improved Search Result Relevance


o Trained models learn to rank documents based on context, query intent, and
historical user behavior.
o Reduces irrelevant results, increasing user engagement and satisfaction.
2. Personalized Search Results
o Machine learning models adapt rankings based on user preferences, location, and
browsing history.
o Personalized search enhances the click-through rate (CTR) and search experience.
3. Better Handling of Ambiguous Queries
o Supervised learning helps distinguish between different user intents by analyzing
past interactions.
o Improves results for queries with multiple interpretations (e.g., "Apple" as a fruit or
a company).
4. Optimizing for User Engagement Metrics
o LTR models incorporate click-through rates, dwell time, and bounce rates to refine
rankings.
o Reduces reliance on keyword matching alone, improving search relevance.
5. Adaptability to Changing Trends
o Machine learning models can update dynamically as new trends and user behaviors
emerge.
o Ensures search engines remain relevant in rapidly evolving domains (e.g., news
search, e-commerce).
6. Handling Noisy and Large-Scale Data
o Supervised learning techniques efficiently rank results in big data environments.
o Deep learning models (BERT, RankNet) help extract meaning from complex,
unstructured text.

11. How does supervised learning for ranking differ from traditional relevance feedback
methods in Information Retrieval? Discuss their respective advantages and limitations.

Information Retrieval (IR) systems aim to rank documents based on relevance to user queries.
Supervised learning for ranking and traditional relevance feedback are two techniques used to
improve ranking quality. While supervised learning leverages labeled training data to learn ranking
models, relevance feedback relies on user-provided relevance judgments to refine query results
dynamically.
1. Supervised Learning for Ranking

Supervised Learning to Rank (LTR) uses labeled training data to train machine learning models that
optimize the ranking of search results.

How It Works:

 A dataset is created with query-document pairs labeled with relevance scores (e.g., click-
through rates, user feedback, editorial judgments).
 The model is trained using pointwise, pairwise, or listwise approaches to predict ranking
scores.
 The trained model is applied to new search queries to generate optimized ranked results.
 Common algorithms include RankSVM, LambdaMART, and neural networks like BERT-
based ranking models.

Advantages

 Automates ranking improvements at scale.


 Provides personalized search results based on user behavior.
 Learns complex ranking patterns from large datasets.
 Optimizes directly for ranking metrics like NDCG and MAP.

Disadvantages

 Requires large labeled datasets, which can be costly to create.


 Computationally expensive, especially with deep learning models.
 Some models (e.g., neural networks) lack interpretability.

2. Traditional Relevance Feedback Methods

Relevance feedback is an older, manual approach where users provide feedback on retrieved
documents, which the system uses to refine search results.

How It Works:

 The user marks some search results as relevant or non-relevant after an initial search.
 The system reweights query terms based on user feedback (e.g., Rocchio algorithm, Pseudo-
Relevance Feedback).
 The updated query retrieves a new set of improved results.

Advantages

 Improves search accuracy by refining query results based on user feedback.


 Does not require extensive labeled training data.
 Computationally efficient and easy to implement.
 Useful in specialized domains where expert feedback enhances retrieval.
Disadvantages

 Requires active user participation, making it less scalable.


 Only refines query expansion and does not improve ranking models.
 Limited generalization, as feedback only benefits individual queries.
 Prone to errors if users provide incorrect relevance feedback.

12. Describe the process of feature selection and extraction in learning to rank. What are the
key features used to train ranking models, and how are they selected or engineered?

1. Introduction

Feature selection and extraction are critical steps in Learning to Rank (LTR), where machine
learning models are trained to optimize the ranking of documents based on query relevance.
Features represent various aspects of the query-document relationship and are used to predict
relevance scores. Proper selection and engineering of features improve ranking model
performance, reduce overfitting, and enhance computational efficiency.

2. Feature Selection and Extraction Process

Feature selection and extraction in LTR involve multiple steps to identify and refine the most
informative features for ranking models.

1. Feature Identification – Potential features are collected from query, document, and user
interaction data.
2. Feature Engineering – Raw features are transformed into meaningful inputs using
techniques like normalization, scaling, and encoding.
3. Feature Selection – Redundant or irrelevant features are removed using statistical and
machine learning techniques such as mutual information, correlation analysis, or feature
importance scores.
4. Feature Evaluation – Selected features are validated using cross-validation or ranking
metrics like NDCG (Normalized Discounted Cumulative Gain) and MAP (Mean Average
Precision).

3. Key Features in Learning to Rank Models

LTR models use three main types of features: query-dependent features, document-specific
features, and query-document interaction features.

Query-Dependent Features

These features capture properties of the query that influence ranking.

 Query length – Number of words in the query.


 Query term frequency – Frequency of query terms in the dataset.
 Query specificity – Measures how rare or common the query terms are.

Document-Specific Features
These features describe document characteristics independent of the query.

 PageRank – Measures the importance of a webpage based on incoming links.


 Document length – Number of words or tokens in the document.
 TF-IDF score – Importance of terms based on their frequency in the document relative to
the entire corpus.

Query-Document Interaction Features

These features represent how well a document matches a query.

 BM25 score – Measures term relevance based on frequency and document length.
 Cosine similarity – Measures similarity between query and document vector
representations.
 Word embedding similarity – Uses models like Word2Vec or BERT to compute semantic
similarity.

4. Feature Selection Techniques

To improve model efficiency, irrelevant or redundant features are removed using selection
methods.

 Filter Methods – Rank features based on statistical scores (e.g., mutual information, chi-
square test).
 Wrapper Methods – Use machine learning models to evaluate feature subsets (e.g.,
recursive feature elimination).
 Embedded Methods – Select features during model training using techniques like L1
regularization (Lasso) or tree-based feature importance.

13. Describe web graph representation in link analysis. How are web pages and hyperlinks
represented in a web graph OR Explain how web graphs are represented in link analysis.
Discuss the concepts of nodes, edges, and directed graphs in the context of web pages and
hyperlinks.

1. Introduction

Web graph representation is a fundamental concept in link analysis, where web pages and
hyperlinks are modeled as a directed graph. This representation helps in analyzing the structure of
the web, identifying important pages, and improving search engine ranking algorithms like
PageRank and HITS.

2. Nodes, Edges, and Directed Graphs in Web Graphs

1. Nodes (Vertices) represent web pages. Each webpage is treated as a unique node in the
graph.
2. Edges (Links) represent hyperlinks between web pages. If page A contains a hyperlink to
page B, a directed edge is drawn from node A to node B.
3. Directed Graph is formed since hyperlinks have a direction, meaning a link from page A to
page B does not imply a link back from B to A.
3. Properties of Web Graphs

 Scale-Free Nature – The web graph follows a power-law distribution where a few web
pages (hubs) have a significantly higher number of links.
 Connectivity – Some pages have high in-degree (many incoming links) and act as
authoritative sources, while others have high out-degree (many outgoing links) and act as
hubs.
 Clusters and Communities – The web graph naturally forms topic-specific clusters where
pages on similar topics are more densely connected.

4. Applications in Link Analysis

 PageRank Algorithm – Uses link structure to assign importance to web pages based on
incoming links.
 HITS Algorithm – Identifies hubs (pages linking to many relevant pages) and authorities
(pages that are linked to by many hubs).
 Web Crawling and Indexing – Helps search engines efficiently explore and rank web pages.
 Spam Detection – Identifies unnatural link patterns used for ranking manipulation.

14. Discuss the difference between the PageRank and HITS algorithms.

Comparison of PageRank and HITS Algorithms

Feature PageRank HITS


Importance is based on the Importance is based on hub and authority
Ranking Basis
overall link structure of the web. relationships.
Query Query-independent Query-dependent (computed for a subset of pages
Dependency (precomputed for all pages). retrieved for a query).
Computation Computed over the entire web Computed only on a subgraph relevant to a
Scope graph. specific query.
Assigns two scores: Authority Score (importance
Assigns rank to pages based on
Working of a page based on incoming links) and Hub Score
the ranks of linking pages using
Principle (importance of a page based on outgoing links to
the random surfer model.
authorities).
Iteratively distributes ranks
Algorithm Iteratively updates hub and authority scores until
among linked pages until
Iteration convergence.
convergence.
More resistant to link spam since
Spam More vulnerable to manipulation via artificial hub
highly linked authoritative pages
Resistance creation.
retain higher ranks.
Provides a global importance Provides topic-specific rankings for relevant
Focus
ranking of web pages. pages.
Used in general web search
Used for topic-sensitive ranking, useful for specific
Use Case ranking by search engines like
information retrieval tasks.
Google.
15. Discuss future directions and emerging trends in link analysis and its role in modern IR
systems. OR Discuss how link analysis can be used in social network analysis and
recommendation systems.

1.Future Directions and Emerging Trends in Link Analysis

Link analysis is evolving with advancements in artificial intelligence, graph theory, and information
retrieval (IR). One of the key trends is the integration of Graph Neural Networks (GNNs), which
enhance traditional link analysis by learning complex relationships in large-scale graphs. Another
emerging area is dynamic and temporal link analysis, which considers evolving web structures and
social networks over time, rather than treating them as static entities.

Personalization is becoming a crucial aspect of link analysis, where algorithms adapt to user
preferences and behaviors, improving search rankings and recommendations. Hybrid approaches
that combine link-based ranking methods like PageRank with deep learning and NLP techniques,
such as BERT, are also gaining prominence, leading to more relevant search results.

Another growing concern is misinformation detection on the web and social media. Link analysis is
being used to identify coordinated fake news networks and track the spread of misleading content.
Additionally, privacy concerns are driving research into decentralized and privacy-preserving link
analysis, where federated learning methods allow analysis without compromising user data.

2.Link Analysis in Social Network Analysis and Recommendation Systems

In social network analysis, link analysis plays a fundamental role in identifying influential users,
detecting communities, and understanding information flow. Centrality measures like betweenness
and eigenvector centrality help in ranking key individuals within a network. Platforms like
Facebook and LinkedIn utilize link prediction to recommend new friends or professional
connections based on shared links and mutual interests.

In recommendation systems, link analysis enhances content discovery by analyzing user-item


interaction graphs. Streaming services such as Netflix and YouTube leverage link-based techniques
to suggest personalized content, improving user engagement. Another important application is
anomaly detection, where link structures are analyzed to identify fraudulent activities, fake
accounts, or bot networks.

Beyond social and commercial applications, link analysis is used in epidemic modeling, helping
researchers understand how diseases or information spreads through networks. This is particularly
relevant in public health strategies and crisis management.

16. How do link analysis algorithms contribute to combating web spam and improving
search engine relevance?

How Link Analysis Algorithms Combat Web Spam and Improve Search Engine Relevance

1. Identifying Link Manipulation and Spam Networks


Link analysis algorithms detect unnatural link patterns, such as excessive reciprocal links,
link farms, and artificially inflated inbound links. This helps search engines filter out
spammy and low-quality pages from rankings.
2. PageRank and Trust-Based Ranking
Algorithms like PageRank distribute ranking power based on the quality of incoming links.
By prioritizing links from authoritative sources and penalizing suspicious links, search
engines reduce the impact of link spam tactics like paid links and private blog networks
(PBNs).
3. HITS Algorithm and Link Authority Differentiation
The HITS (Hyperlink-Induced Topic Search) algorithm differentiates between hubs (pages
with many outbound links) and authorities (highly referenced pages). This distinction helps
search engines demote spam hubs that artificially link to unrelated or low-quality content.
4. Spam Detection Through Link Neighborhoods
Search engines analyze link neighborhoods—the set of linked pages around a particular
website. If a page is frequently linked by known spam sources, it is likely to be spam itself
and may receive a lower ranking or deindexing.
5. TrustRank Algorithm for Filtering Spam
TrustRank assigns high trust to manually vetted authoritative sites and propagates this
trust through outbound links. Pages linked by trusted sources gain credibility, while those
in spam-heavy networks are penalized, reducing the effectiveness of spam techniques.
6. Machine Learning in Link Spam Detection
Advanced machine learning models combined with link analysis detect evolving spam
tactics. Features such as link velocity (rate of acquiring links), anchor text diversity, and
backlink quality help classify spam pages versus genuine content.
7. Graph-Based Anomaly Detection
Search engines use graph-based methods to detect anomalies in link structures, such as
unnatural clusters of interlinked pages or sudden spikes in backlinks. This allows them to
flag and neutralize spammy SEO tactics.
8. Combatting Cloaking and Hidden Links
Some spammers use cloaking (showing different content to search engines and users) or
hidden links (placing invisible links on a page). Link analysis helps detect unusual link
placements and penalizes such practices.

17. Consider a simplified web graph with the following link structure:
Page A has links to pages B, C, and D.
Page B has links to pages C and E.
Page C has links to pages A and D.
Page D has a link to page E.
Page E has a link to page A.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one/two iteration(s) of the HITS algorithm.

The HITS (Hyperlink-Induced Topic Search) algorithm assigns each page two scores:

 Authority Score: A page’s importance based on the number of high-quality hubs linking to it.
 Hub Score: A page’s importance based on the number of high-quality authorities it links to.

Step 1: Initial Scores

All pages start with an initial authority and hub score of 1.


Page Authority Score (Initial) Hub Score (Initial)
A 1 1
B 1 1
C 1 1
D 1 1
E 1 1

Step 2: First Iteration

Authority Update

The new authority score of a page is the sum of the hub scores of all pages linking to it.

 A’s authority score = Hub(E) = 1


 B’s authority score = Hub(A) = 1
 C’s authority score = Hub(A) + Hub(B) = 1 + 1 = 2
 D’s authority score = Hub(A) + Hub(C) = 1 + 1 = 2
 E’s authority score = Hub(B) + Hub(D) = 1 + 1 = 2

Hub Update

The new hub score of a page is the sum of the authority scores of all pages it links to.

 A’s hub score = Authority(B) + Authority(C) + Authority(D) = 1 + 2 + 2 = 5


 B’s hub score = Authority(C) + Authority(E) = 2 + 2 = 4
 C’s hub score = Authority(A) + Authority(D) = 1 + 2 = 3
 D’s hub score = Authority(E) = 2
 E’s hub score = Authority(A) = 1

Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 5
B 1 4
C 2 3
D 2 2
E 2 1

Step 3: Second Iteration

Authority Update

 A’s authority score = Hub(E) = 1


 B’s authority score = Hub(A) = 5
 C’s authority score = Hub(A) + Hub(B) = 5 + 4 = 9
 D’s authority score = Hub(A) + Hub(C) = 5 + 3 = 8
 E’s authority score = Hub(B) + Hub(D) = 4 + 2 = 6
Hub Update

 A’s hub score = Authority(B) + Authority(C) + Authority(D) = 5 + 9 + 8 = 22


 B’s hub score = Authority(C) + Authority(E) = 9 + 6 = 15
 C’s hub score = Authority(A) + Authority(D) = 1 + 8 = 9
 D’s hub score = Authority(E) = 6
 E’s hub score = Authority(A) = 1

Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 1 22
B 5 15
C 9 9
D 8 6
E 6 1

18. Consider a web graph with the following link structure:


Page A has links to pages B and C.
Page B has a link to page C.
Page C has links to pages A and D.
Page D has a link to page A.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.

Step 1: Given Web Graph Link Structure

 A → B, C
 B→C
 C → A, D
 D→A

Each page starts with an initial authority score and hub score of 1.

Iteration 1

Authority Score Update

The authority score of a page is the sum of the hub scores of the pages linking to it.

 A's authority score = Hub(C) + Hub(D) = 1 + 1 = 2


 B's authority score = Hub(A) = 1
 C's authority score = Hub(A) + Hub(B) = 1 + 1 = 2
 D's authority score = Hub(C) = 1
Hub Score Update

The hub score of a page is the sum of the authority scores of the pages it links to.

 A's hub score = Authority(B) + Authority(C) = 1 + 2 = 3


 B's hub score = Authority(C) = 2
 C's hub score = Authority(A) + Authority(D) = 2 + 1 = 3
 D's hub score = Authority(A) = 2

Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 2 3
B 1 2
C 2 3
D 1 2

Iteration 2

Authority Score Update

 A's authority score = Hub(C) + Hub(D) = 3 + 2 = 5


 B's authority score = Hub(A) = 3
 C's authority score = Hub(A) + Hub(B) = 3 + 2 = 5
 D's authority score = Hub(C) = 3

Hub Score Update

 A's hub score = Authority(B) + Authority(C) = 3 + 5 = 8


 B's hub score = Authority(C) = 5
 C's hub score = Authority(A) + Authority(D) = 5 + 3 = 8
 D's hub score = Authority(A) = 5

Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 5 8
B 3 5
C 5 8
D 3 5

19. Given the following link structure:


Page A has links to pages B and C.
Page B has a link to page D.
Page C has links to pages B and D.
Page D has links to pages A and C.
Using the initial authority and hub scores of 1 for all pages, calculate the authority and hub
scores for each page after one iteration of the HITS algorithm.
Step 1: Given Web Graph Link Structure

 A → B, C
 B→D
 C → B, D
 D → A, C

Each page starts with an initial authority score and hub score of 1.

Step 2: First Iteration of HITS Algorithm

Authority Score Update

The authority score of a page is the sum of the hub scores of the pages linking to it.

 A's authority score = Hub(D) = 1


 B's authority score = Hub(A) + Hub(C) = 1 + 1 = 2
 C's authority score = Hub(A) + Hub(D) = 1 + 1 = 2
 D's authority score = Hub(B) + Hub(C) = 1 + 1 = 2

Hub Score Update

The hub score of a page is the sum of the authority scores of the pages it links to.

 A's hub score = Authority(B) + Authority(C) = 2 + 2 = 4


 B's hub score = Authority(D) = 2
 C's hub score = Authority(B) + Authority(D) = 2 + 2 = 4
 D's hub score = Authority(A) + Authority(C) = 1 + 2 = 3

Results After One Iteration

Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 4
B 2 2
C 2 4
D 2 3

20. Consider a web graph with the following link structure:


Page A has links to pages B and C.
Page B has links to pages C and D.
Page C has links to pages A and D.
Page D has a link to page B.
Perform two iterations of the HITS algorithm to calculate the authority and hub scores for
each page. Assume the initial authority and hub scores are both 1 for all pages.
Step 1: Given Web Graph Link Structure

 A → B, C
 B → C, D
 C → A, D
 D→B

Each page starts with an initial authority score and hub score of 1.

Step 2: First Iteration of HITS Algorithm

Authority Score Update

The authority score of a page is the sum of the hub scores of the pages linking to it.

 A's authority score = Hub(C) = 1


 B's authority score = Hub(A) + Hub(D) = 1 + 1 = 2
 C's authority score = Hub(A) + Hub(B) = 1 + 1 = 2
 D's authority score = Hub(B) + Hub(C) = 1 + 1 = 2

Hub Score Update

The hub score of a page is the sum of the authority scores of the pages it links to.

 A's hub score = Authority(B) + Authority(C) = 2 + 2 = 4


 B's hub score = Authority(C) + Authority(D) = 2 + 2 = 4
 C's hub score = Authority(A) + Authority(D) = 1 + 2 = 3
 D's hub score = Authority(B) = 2

Page Authority Score (After 1st Iteration) Hub Score (After 1st Iteration)
A 1 4
B 2 4
C 2 3
D 2 2

Step 3: Second Iteration of HITS Algorithm

Authority Score Update

 A's authority score = Hub(C) = 3


 B's authority score = Hub(A) + Hub(D) = 4 + 2 = 6
 C's authority score = Hub(A) + Hub(B) = 4 + 4 = 8
 D's authority score = Hub(B) + Hub(C) = 4 + 3 = 7
Hub Score Update

 A's hub score = Authority(B) + Authority(C) = 6 + 8 = 14


 B's hub score = Authority(C) + Authority(D) = 8 + 7 = 15
 C's hub score = Authority(A) + Authority(D) = 3 + 7 = 10
 D's hub score = Authority(B) = 6

Page Authority Score (After 2nd Iteration) Hub Score (After 2nd Iteration)
A 3 14
B 6 15
C 8 10
D 7 6
Unit III

1. How do web crawlers handle dynamic web content during crawling? Explain techniques
such as AJAX crawling, HTML parsing, URL normalization and session handling for
dynamic content extraction. Explain the challenges associated with handling dynamic
web content during crawling.

Web crawlers primarily extract static HTML content, but dynamic web pages—generated using
JavaScript, AJAX, or user interactions—pose significant challenges. Crawlers use specialized
techniques to handle such content effectively.

Techniques for Crawling Dynamic Web Content

1. AJAX Crawling
AJAX (Asynchronous JavaScript and XML) allows web pages to load data dynamically
without refreshing. Crawlers handle AJAX content using:
o Headless Browsers (e.g., Puppeteer, Selenium) to render JavaScript-driven content.
o API Simulation to interact with backend services directly.
o Pre-rendering Services that generate static HTML snapshots of dynamic content.
2. HTML Parsing
Crawlers parse and analyze HTML using tools like BeautifulSoup and lxml. Dynamic
elements can be extracted by:
o Identifying hidden data within meta tags, JSON-LD, or inline JavaScript.
o Extracting structured data (e.g., schema.org) embedded in the page.
o Following iframes and embedded resources for additional content.
3. URL Normalization
Many dynamic sites generate multiple URLs for the same content due to parameters (e.g.,
session IDs, tracking codes). Normalization helps avoid duplicate crawling by:
o Removing unnecessary query parameters.
o Converting relative URLs to absolute URLs.
o Ensuring consistent URL case sensitivity and structure.
4. Session Handling and Cookies
Websites often require user authentication or store state information using cookies and
session IDs. Crawlers handle this by:
o Managing cookies to maintain session continuity.
o Using authentication mechanisms (OAuth, tokens, or login automation) to access
restricted content.
o Avoiding session-based URLs to prevent duplicate crawling.

Challenges in Crawling Dynamic Web Content

1. JavaScript Execution
Many modern websites rely on JavaScript for rendering content, requiring headless
browsers or JavaScript-enabled crawlers, which increase crawling overhead.
2. Anti-Crawling Mechanisms
Websites implement CAPTCHAs, IP blocking, or bot detection techniques (e.g., behavior
analysis, fingerprinting) to restrict automated crawling.
3. Infinite Scrolling and Pagination
Content loading dynamically via infinite scrolling makes it difficult for crawlers to
determine when to stop fetching data. Special strategies (e.g., simulating user scrolling) are
required.
4. Dynamic URL Generation
Some websites generate random or temporary URLs that change with every session, making
it hard to track and normalize them.
5. Data Consistency and Freshness
Since dynamic content updates frequently, crawlers must decide how often to revisit pages
without overwhelming the server.
6. Ethical and Legal Considerations
Crawling dynamic content may violate a site's robots.txt rules, terms of service, or data
privacy policies. Crawlers must ensure compliance to avoid legal risks.

2. Describe the role of AJAX crawling scheme and the use of sitemaps in crawling dynamic
web content. Provide examples of how these techniques are implemented in practice.

Modern web applications rely on AJAX (Asynchronous JavaScript and XML) to load content
dynamically. Traditional crawlers struggle to extract such content since it isn't present in the initial
HTML source. The AJAX crawling scheme and sitemaps help address this challenge

1. AJAX Crawling Scheme

AJAX allows web pages to fetch data asynchronously without refreshing, making crawling more
complex. To handle this, crawlers use:

a. Pre-rendering (Snapshot-based Crawling)

 Websites can provide pre-rendered static HTML versions of AJAX-driven pages.


 Search engines like Google use the Chrome Rendering Engine to process JavaScript and
extract dynamic content.
 Example: A web page using Vue.js loads product details dynamically. A pre-rendering
service like Rendertron generates a static HTML version for crawlers.

b. Headless Browsers

 Crawlers use headless browsers (e.g., Puppeteer, Selenium) to simulate user interactions
and load JavaScript-rendered content.
 Example: A news website loads articles via AJAX. A Selenium-based crawler can wait for
content to load and extract data.

c. API Calls Simulation

 Some websites expose AJAX-loaded content through public APIs.


 Crawlers can fetch content directly using REST API calls, bypassing JavaScript execution.
 Example: A stock market site updates prices via AJAX. A crawler fetches real-time data from
an API instead of parsing the web page.
2. Role of Sitemaps in Crawling Dynamic Content

Sitemaps provide structured lists of URLs to help search engines discover and crawl web pages
efficiently.

a. XML Sitemaps

 Websites provide an XML file listing all pages to ensure important URLs are indexed.
 Example: An e-commerce website generates a sitemap.xml file with product pages that load
via AJAX, ensuring crawlers can find them.

b. Dynamic Sitemaps

 Websites with frequently changing content generate sitemaps dynamically.


 Example: A blog platform with user-generated posts updates its sitemap daily with new
articles.

c. URL Hash Fragments Handling

 AJAX-based applications use # fragments (example.com/#page1).


 Search engines ignore fragments, but Google’s escaped fragment (?_escaped_fragment_=)
method was historically used to serve a crawlable version.

Implementation Examples

1. Google AJAX Crawling (Pre-rendering)


1. Google supports JavaScript crawling but recommends pre-rendering for complex
sites.
2. Example: A React-based SPA (Single Page Application) uses Next.js server-side
rendering (SSR) to ensure content is accessible to crawlers.
2. Sitemaps for AJAX-loaded Pages
1. A real estate website loads property listings via AJAX.
2. It generates a sitemap.xml file listing property URLs to ensure indexing.
3. Headless Browser for Crawling
1. A web scraper uses Puppeteer to crawl a news website that loads articles
dynamically.
2. The script waits for AJAX content to load, extracts text, and stores it.

3. Compare and contrast local and global similarity measures for near-duplicate detection.
Provide examples of scenarios where each measure is suitable.

Near-duplicate detection identifies documents that share significant content but may have minor
variations. Local similarity measures focus on small parts of a document, while global similarity
measures analyze the overall content.
1. Local Similarity Measures

Local similarity methods compare small portions (substrings, words, or tokens) of documents to
detect near-duplicates.

Characteristics

 Focus on small overlapping text portions.


 Effective for detecting partial plagiarism or content reuse.
 Robust against minor edits, paraphrasing, and reordered sections.

Examples of Local Similarity Measures

1. Shingling (k-grams)
o Splits text into overlapping k-length substrings (shingles).
o Example: "web crawling techniques" → {web, craw, rawl, awli, ling, ...}
o Uses Jaccard similarity on shingle sets.
2. Locality-Sensitive Hashing (LSH)
o Hashes shingles and compares only similar hash values.
o Used in duplicate web page detection.
3. Edit Distance (Levenshtein Distance)
o Counts the minimum operations (insert, delete, replace) needed to transform one
text into another.
o Useful for typo detection and OCR error correction.

Suitable Scenarios for Local Similarity

 Detecting plagiarized or slightly modified documents.


 Finding paraphrased content in academic writing.
 Identifying similar product descriptions on e-commerce sites.

2. Global Similarity Measures

Global similarity methods compare documents as whole entities, focusing on overall content rather
than specific parts.

Characteristics

 Measure similarity using entire document structure and meaning.


 Suitable for detecting fully rewritten or restructured duplicates.
 Less sensitive to small edits or paraphrasing.

Examples of Global Similarity Measures

1. Cosine Similarity
o Represents documents as TF-IDF vectors and calculates the cosine of the angle
between them.
o Example: Used in news article clustering.
2. Jaccard Similarity (Set-Based)
o Measures overlap between word/token sets.
o Used in document clustering and duplicate webpage detection.
3. Euclidean Distance on Word Embeddings
o Compares semantic similarity using word embeddings (Word2Vec, BERT).
o Example: Used in semantic search engines.

Suitable Scenarios for Global Similarity

 Identifying duplicate or highly similar articles.


 Grouping similar legal documents or patents.
 Finding similar research papers in academic databases.

4. Provide examples of applications where near-duplicate page detection is critical, such as


detecting plagiarism and identifying duplicate content in search results.

Near-duplicate detection is crucial in various domains where redundant or slightly modified


content can impact performance, credibility, and efficiency. Below are some key applications:

1. Plagiarism Detection

 Academic Integrity: Universities use near-duplicate detection in tools like Turnitin and
Copyscape to identify plagiarized assignments, research papers, and thesis submissions.
 Code Plagiarism: Platforms like Moss (Measure of Software Similarity) detect copied
programming code in student submissions and open-source repositories.

2. Duplicate Content Filtering in Search Engines

 Web Index Optimization: Search engines like Google remove near-duplicate pages to avoid
redundancy and improve search quality.
 Canonicalization: Helps determine the canonical version of a page when multiple URLs
contain similar content (e.g., product descriptions across different retailers).

3. News Aggregation and Clustering

 News Deduplication: Platforms like Google News and Yahoo News group near-duplicate
articles from different publishers to avoid repetition.
 Fake News Detection: Identifies copied or manipulated news articles spreading
misinformation.

4. E-commerce and Product Catalog Management

 Product Deduplication: Online marketplaces like Amazon, eBay, and Flipkart identify
duplicate product listings across different sellers.
 Price Comparison: Ensures accurate price comparisons by removing duplicate product
pages with slight variations.

5. Legal and Patent Document Analysis


 Patent Similarity Detection: Identifies near-duplicate patent filings to prevent intellectual
property conflicts.
 Legal Case Deduplication: Law firms use near-duplicate detection to find similar legal cases
and judgments for research.

6. Web Crawling and Data Scraping

 Efficient Web Crawling: Search engines use SimHash and Locality-Sensitive Hashing (LSH)
to avoid indexing near-duplicate pages, reducing storage costs.
 Content Scraping Detection: Websites detect and block bots scraping content for
unauthorized reproduction.

7. Social Media and Spam Detection

 Duplicate Post Detection: Platforms like Twitter, Facebook, and Reddit identify near-
duplicate posts and filter spam.
 Meme & Fake Review Detection: Detects repeated fake reviews on platforms like Yelp and
Amazon.

8. Biomedical and Scientific Literature Analysis

 Duplicate Research Paper Detection: Databases like PubMed and arXiv filter near-duplicate
scientific publications.
 Clinical Trial Deduplication: Ensures accuracy in medical research by detecting redundant
trial data.

5. Describe common techniques used in extractive text summarization, such as graph-based


methods and sentence scoring approaches.

Extractive text summarization selects key sentences from a document to form a concise summary
while preserving the original meaning. Several techniques, including graph-based methods and
sentence scoring approaches, help identify important content.

1. Graph-Based Methods

Graph-based approaches model text as a graph where nodes represent sentences, and edges
capture relationships between them. Ranking algorithms are then used to extract the most
important sentences.

a. TextRank Algorithm

 Based on Google’s PageRank, TextRank constructs a graph where:


o Nodes: Sentences in the document.
o Edges: Weighted by sentence similarity (e.g., cosine similarity between TF-IDF
vectors).
 Sentences with higher centrality (i.e., more connections) are extracted for summarization.
b. LexRank Algorithm

 Similar to TextRank but uses Jaccard or Cosine similarity to connect sentences in the graph.
 Applies Markov chains to compute sentence importance, selecting highly ranked sentences.
 More effective for multi-document summarization.

2. Sentence Scoring Approaches

Sentence scoring techniques assign importance scores to each sentence based on linguistic,
statistical, or machine learning methods.

a. Term Frequency-Inverse Document Frequency (TF-IDF)

 Sentences are ranked based on the frequency of important words (TF) and their rarity
across documents (IDF).
 High TF-IDF scores indicate key sentences for extraction.

b. Position-Based Scoring (Lead-3 Heuristic)

 Assumes earlier sentences in a document are more important.


 Often used in news summarization, where the first three sentences of an article provide a
good summary.

c. Sentence Length and Named Entity Recognition (NER)

 Short sentences may lack sufficient information and are often discarded.
 Sentences containing named entities (e.g., people, places, organizations) are prioritized.

d. Machine Learning-Based Scoring

 Uses supervised learning models (e.g., SVM, Random Forest, Neural Networks) to score
sentences based on:
o Word frequency
o Sentence position
o Semantic similarity
 Trained using human-labeled summaries to learn important patterns.

6. Discuss challenges in abstractive text summarization and recent advancements in neural


network-based approaches.

Abstractive text summarization involves generating a summary that paraphrases and condenses
the original text, rather than simply extracting key sentences. While this approach provides more
human-like summaries, it presents several challenges.

1. Challenges in Abstractive Summarization

a. Information Loss & Hallucination


 Issue: The model may introduce incorrect or misleading information not present in the
original text.
 Example: A summarization model might generate a false claim about an event when
summarizing news.
 Solution: Reinforcement learning techniques and factual consistency checks help reduce
hallucinations.

b. Lack of Coherence & Fluency

 Issue: Summaries may be grammatically incorrect or lack logical flow, making them hard to
read.
 Example: "The company profits increased, however, declined last year."
 Solution: Pre-trained language models like GPT-4, T5, and BART improve fluency using
context-aware generation.

c. Handling Long Documents

 Issue: Standard models struggle with long documents due to limited context window size
(e.g., 512-1024 tokens in Transformer models).
 Example: Summarizing a 50-page legal document accurately.
 Solution: Longformer, BigBird, and hierarchical attention mechanisms allow models to
process longer inputs efficiently.

d. Domain-Specific Challenges

 Issue: Summarization in specialized fields (medicine, law, finance) requires understanding


of technical jargon and structured data.
 Example: Summarizing a medical research paper without losing critical findings.
 Solution: Fine-tuned domain-specific models (e.g., BioBART for medical text, LEGAL-BERT
for law) improve accuracy.

e. Computational Complexity

 Issue: Training large-scale abstractive models requires high computational power and
memory.
 Example: Training a Transformer-based summarization model on limited hardware is
expensive.
 Solution: Optimizations like knowledge distillation, quantization, and pruning reduce model
size while maintaining accuracy.

2. Recent Advancements in Neural Network-Based Approaches

a. Transformer-Based Models

 BART (Bidirectional and Auto-Regressive Transformers):


o Pre-trained on text corruption and denoising tasks.
o Outperforms previous models in news and legal summarization.
 T5 (Text-to-Text Transfer Transformer):
o Treats summarization as a sequence-to-sequence (Seq2Seq) problem.
o Performs well across multiple languages and domains.
 PEGASUS (Pre-training with Gap-Sentences):
o Pre-trained by masking important sentences and learning to predict them.
o Achieves state-of-the-art results in news and scientific paper summarization.

b. Reinforcement Learning for Summarization

 Models like RL-Sum use reinforcement learning to improve factual accuracy and readability.
 Helps reduce hallucinations and redundancy in generated summaries.

c. Few-Shot and Zero-Shot Summarization

 GPT-4 and LLaMA models enable summarization with minimal labeled data.
 Reduces the need for large, task-specific datasets.

d. Multi-Document and Cross-Lingual Summarization

 Multi-document summarization models generate summaries from multiple sources (e.g.,


Google News aggregation).
 Cross-lingual summarization (e.g., English to Hindi summary generation) improves
accessibility.

7. Discuss common evaluation metrics used to assess the quality of text summaries, such as
ROUGE and BLEU. Explain how these metrics measure the similarity between generated
summaries and reference summaries.

Evaluating the quality of generated summaries is crucial to ensure they are accurate, coherent, and
informative. Several automatic metrics, such as ROUGE and BLEU, measure the similarity between a
generated summary and human-written reference summaries.

1. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

ROUGE is the most widely used metric for summarization, focusing on n-gram recall, precision, and
F1-score between the generated and reference summaries.

How ROUGE Works

ROUGE compares overlapping words or phrases between summaries. Key variants include:

 ROUGE-N (N-gram Overlap): Measures the overlap of n-grams (sequences of words).


o ROUGE-1 (unigram overlap) → Measures word-level similarity.
o ROUGE-2 (bigram overlap) → Captures phrase-level similarity.
o ROUGE-L (Longest Common Subsequence) → Considers sentence structure and
fluency.
Example Calculation:
Reference Summary: "The cat sat on the mat."
Generated Summary: "A cat was sitting on a mat."

 ROUGE-1 = High (many overlapping words).


 ROUGE-2 = Moderate (some phrase overlap).
 ROUGE-L = Good (similar sentence structure).

Strengths:
✔Works well for extractive summarization (word overlap is high).
✔Simple and efficient for large-scale evaluations.

Limitations:
✘Fails to capture paraphrased or semantically similar sentences.
✘Doesn't consider summary coherence or readability.

2. BLEU (Bilingual Evaluation Understudy)

Originally designed for machine translation, BLEU is also used for summarization. It measures n-
gram precision rather than recall.

How BLEU Works

 Compares generated summary n-grams to reference summary n-grams.


 Uses brevity penalty to discourage overly short summaries.
 BLEU-1, BLEU-2, etc., correspond to different n-gram lengths.

Example Calculation:
Reference: "The quick brown fox jumps over the lazy dog."
Generated: "A fast brown fox leaps over a sleepy dog."

 BLEU-1 = Moderate (word-level similarity is decent).


 BLEU-2 = Low (bigram overlap is limited).

Strengths:
✔Useful when precision is more important than recall.
✔Good for machine-generated summaries with strict phrase matching.

Limitations:
✘Ignores synonyms and sentence structure.
✘Biased against abstractive summarization, which may use different words.
8. Discuss different approaches for question answering in information retrieval, including
keyword-based, document retrieval, and passage retrieval methods.

Question Answering (QA) in Information Retrieval (IR) focuses on retrieving relevant information
from large text corpora to provide precise answers to user queries. Different approaches include
keyword-based methods, document retrieval, and passage retrieval, each offering distinct
advantages depending on the query complexity and information need.

1. Keyword-Based Question Answering

This approach relies on keyword matching between the user's question and documents in the
collection. It is fast and efficient but lacks semantic understanding.

How It Works

 Uses Boolean retrieval or TF-IDF-based ranking.


 Matches exact keywords in queries with indexed documents.
 Often implemented in early search engines (e.g., traditional IR models like BM25).

Example

Query: "Capital of France?"

 The system searches for documents containing "capital" and "France".


 Retrieves documents where both terms appear but may not directly provide the answer.

Advantages

✔Simple and computationally efficient.


✔Works well for structured databases or FAQs.

Disadvantages

✘Ignores context and word variations (e.g., “capital” vs. “main city”).
✘Struggles with complex or natural language queries.

2. Document Retrieval-Based Question Answering

This approach retrieves full documents relevant to the query using advanced ranking algorithms. It
is commonly used in search engines and digital libraries.

How It Works

 Uses vector space models, BM25, or dense retrieval (e.g., BERT-based models).
 Scores documents based on query-document similarity.
 Returns entire documents rather than specific answers.

Example

Query: "Who discovered gravity?"

 The system retrieves a document about Isaac Newton’s contributions.


 The user must manually find the exact answer.

Advantages

✔More effective than keyword-based matching.


✔Uses semantic search techniques (e.g., BM25, Word2Vec).
✔Good for broad or exploratory searches.

Disadvantages

✘User must read full documents to extract answers.


✘Can retrieve irrelevant documents if query interpretation is incorrect.

3. Passage Retrieval-Based Question Answering

This approach retrieves specific text passages instead of entire documents, improving precision for
direct answers. It is widely used in modern search engines and AI-powered chatbots.

How It Works

 Uses sentence embeddings (e.g., BERT, T5, Dense Passage Retrieval).


 Identifies the most relevant paragraph or sentence from documents.
 Some systems extract exact answer spans using reading comprehension models.

Example

Query: "Who discovered gravity?"

 Instead of retrieving full articles, the system extracts:


o "Isaac Newton formulated the theory of gravity in the 17th century."

Advantages

✔Provides direct and concise answers.


✔Uses deep learning-based retrieval models for better accuracy.
✔Reduces user effort in finding information.
Disadvantages

✘Requires powerful NLP models (e.g., BERT, T5).


✘Struggles with ambiguous or complex queries

9. Explain how natural language processing techniques such as Named Entity Recognition
(NER) and semantic parsing contribute to question answering systems.

Natural Language Processing (NLP) techniques such as Named Entity Recognition (NER) and
Semantic Parsing play a crucial role in improving the accuracy and efficiency of Question Answering
(QA) systems. These techniques help in understanding, analyzing, and extracting meaningful
information from text, allowing QA models to provide precise and contextually relevant answers.

1. Named Entity Recognition (NER) in Question Answering

Named Entity Recognition (NER) is an NLP technique that identifies and classifies entities (such as
names of people, places, organizations, dates, and numerical values) in a given text.

How NER Works in QA Systems

1. Entity Extraction: The system scans the input query and extracts important named entities.
2. Entity Type Recognition: The extracted entities are classified into predefined categories
such as Person, Location, Date, Organization, Number, etc.
3. Query Understanding: By identifying key entities, the system understands the context of the
question.
4. Answer Retrieval: The QA model searches for relevant passages containing similar entities
and ranks them based on relevance.

Example

Query: "Who is the president of France?"

 NER extracts: Entity = "France" (Location), "President" (Title)


 The system looks for passages where the entity "France" is linked to a "President".
 Answer: "Emmanuel Macron."

Benefits of NER in QA

1. Improves precision by focusing on relevant entities instead of searching through the entire
document.
2. Enhances query understanding, allowing the system to differentiate between similar terms
(e.g., Apple the company vs. apple the fruit).
3. Reduces search space, making the retrieval process faster and more efficient.

2. Semantic Parsing in Question Answering


Semantic parsing is the process of converting natural language questions into structured
representations (such as logical forms, SQL queries, or knowledge graph queries) that machines can
process to generate answers.

How Semantic Parsing Works in QA Systems

1. Syntax Analysis: The input question is broken down into grammatical components (subject,
verb, object).
2. Meaning Representation: The system translates the question into a structured query format,
such as SQL for database retrieval or SPARQL for querying knowledge graphs.
3. Information Retrieval: The structured query is executed against a database, knowledge
graph, or text corpus to fetch relevant information.
4. Answer Generation: The retrieved data is transformed into a human-readable response.

Example

Query: "What is the capital of Germany?"

 The system converts the question into a structured query:


SQL Query: SELECT Capital FROM Countries WHERE Name = "Germany";
SPARQL Query: SELECT ?capital WHERE { ?country rdf:type dbo:Country ; dbo:name
"Germany" ; dbo:capital ?capital . }
 The query fetches the answer "Berlin."

Benefits of Semantic Parsing in QA

1. Handles complex and multi-hop queries by breaking them into sub-questions.


2. Enables interaction with structured data sources like databases and knowledge graphs.
3. Improves accuracy by enforcing grammatical and logical constraints on question
interpretation.

10. Provide examples of question answering systems and evaluate their effectiveness in
providing precise answers.

1. Google Search is a widely used QA system that provides featured snippets and passage-
based answers directly in search results. It uses BERT-based models and dense retrieval
techniques to extract the most relevant passage from indexed documents. It is highly
accurate for factual queries, provides fast retrieval, and is continually updated with fresh
information. However, it may misinterpret complex queries, is biased toward popular
sources, and does not always provide direct answers for ambiguous or multi-step questions.
2. IBM Watson is an advanced QA system used for enterprise solutions, healthcare, and
finance. It processes structured and unstructured data, leveraging deep NLP and knowledge
graphs. It is highly effective for domain-specific queries, uses deep reasoning to provide
context-aware answers, and can analyze both structured and unstructured data. However, it
is computationally expensive, not always user-friendly, and limited to trained domains,
making it less effective in open-ended QA.
3. OpenAI ChatGPT is an AI-powered conversational agent capable of answering both factual
and reasoning-based questions. It uses transformer-based deep learning models and has
been trained on vast amounts of text. It handles complex and open-ended questions,
provides contextual understanding, and generates human-like responses. However, it may
generate incorrect or outdated answers, is prone to hallucination, and lacks precise
citations, making fact-checking difficult.
4. Wolfram Alpha is a computational knowledge engine that answers math, science, and data-
driven questions by computing results rather than searching for text-based answers. It is
highly accurate for mathematical and scientific queries, provides step-by-step solutions, and
generates results from structured data. However, it is limited to structured and factual
queries, does not support conversational interactions, and requires precise query
formulation for accurate results.
5. Siri, Alexa, and Google Assistant are voice-based QA systems integrated into smart
assistants that process spoken queries and return answers using NLP and web search. They
are optimized for voice-based interactions, provide quick response times, and integrate
with smart devices. However, they have limited reasoning capabilities, struggle with multi-
turn complex queries, and may have voice recognition errors that affect accuracy.
6. Different QA systems excel in different scenarios. Google Search is best for quick factual
lookups. IBM Watson is powerful for enterprise and domain-specific tasks. ChatGPT is ideal
for conversational, reasoning-based, and explanatory questions. Wolfram Alpha provides
precise computations for math and science. Voice Assistants are convenient for real-time
voice-based QA.
7. Future advancements in deep learning, knowledge graphs, and hybrid retrieval-generation
models will further improve QA systems, making them more accurate, context-aware, and
interactive.

11. Discuss the challenges associated with question answering, including ambiguity
resolution, answer validation, and handling of incomplete or noisy queries.

1. Ambiguity Resolution – Many questions contain ambiguous terms or phrases that can be
interpreted in multiple ways. For example, "Who is the president?" could refer to different
countries or organizations. QA systems must rely on context-awareness techniques like
Named Entity Recognition (NER) and Word Sense Disambiguation (WSD) to resolve
ambiguity effectively. However, incorrect context identification can lead to misleading
answers.
2. Answer Validation – Even when a system retrieves a relevant response, verifying its
correctness is difficult. Some sources may provide outdated, biased, or inaccurate
information. QA models need fact-checking mechanisms, external knowledge bases, and
citation techniques to ensure reliable answers. Misinformation remains a challenge,
especially when systems generate answers based on incomplete or biased training data.
3. Handling Incomplete Queries – Users often provide vague or underspecified questions, such
as "How long does it take?" without specifying what process they are referring to. QA
systems must infer missing details using historical data or predefined templates. However,
excessive reliance on inference can introduce errors, leading to incorrect or contextually
irrelevant responses.
4. Noisy Queries – User queries may contain typographical errors, grammatical mistakes,
informal language, or inconsistent phrasing. For example, "wht is captl of Frnace?" requires
spell correction and normalization before processing. NLP techniques such as fuzzy
matching, synonym recognition, and language models help correct errors, but highly
unstructured inputs remain challenging.
5. Scalability & Efficiency – Large-scale QA systems must process vast amounts of textual data
in real time while maintaining high accuracy. Indexing, caching, and parallel computing can
improve response time, but balancing computational efficiency with accuracy is difficult. As
data volumes grow, ensuring fast and precise question answering requires continuous
optimization of search and retrieval algorithms.
6. Bias & Fairness – QA systems are prone to bias due to imbalanced training data or biased
sources. If a dataset favors certain viewpoints or demographics, the system may generate
skewed answers. Ensuring fairness requires diverse training data, bias-mitigation
techniques, and transparency in answer generation. However, defining and enforcing
fairness in QA remains an ongoing research challenge.
7. Context Awareness in Conversational QA – Multi-turn question answering, where responses
depend on previous context, presents additional difficulties. For instance, if a user asks,
"What is the weather in Paris?" followed by "And tomorrow?" the system must retain
context. Maintaining coherence across conversations requires memory-augmented
architectures, but errors in context tracking can lead to irrelevant or repetitive answers.
8. Multilingual & Cross-Language QA – Handling questions in multiple languages requires
large multilingual datasets and effective cross-lingual embeddings. Many QA models
perform well in English but struggle with low-resource languages due to limited training
data. Translating queries accurately while preserving meaning is another challenge in cross-
lingual question answering.
9. Contradictory Information Handling – When different sources provide conflicting answers,
selecting the most accurate one is difficult. For example, some websites might report
different historical events or scientific claims. QA systems must assess source credibility,
detect inconsistencies, and provide balanced viewpoints where necessary. However,
automating contradiction detection remains an open problem.
10. Security & Misinformation Risks – Malicious users can manipulate QA systems by injecting
false information into training data or crafting adversarial queries to exploit model
weaknesses. Additionally, QA models trained on internet data may inadvertently generate
harmful or misleading responses. Ensuring robustness against misinformation attacks and
adversarial queries requires enhanced security mechanisms and continuous model
evaluation.

12. Explain how collaborative filtering algorithms such as user-based and item-based
methods work. Discuss techniques to address the cold start problem in collaborative
filtering.

Collaborative filtering (CF) is a recommendation technique that suggests items to users based on
the preferences of similar users or the similarity between items. It is widely used in applications
such as e-commerce, streaming services, and online advertising. There are two main types of
collaborative filtering: user-based and item-based methods.

1. User-Based Collaborative Filtering

User-based collaborative filtering operates on the assumption that users with similar preferences in
the past will have similar preferences in the future. The process involves:

 Finding Similar Users – The system identifies users with similar rating patterns using
similarity measures like cosine similarity, Pearson correlation, or Jaccard similarity.
 Predicting Ratings – Once similar users are identified, the system predicts a user’s rating for
an item based on ratings from similar users.
 Generating Recommendations – The system recommends items with the highest predicted
ratings that the user has not yet interacted with.
For example, if User A and User B have rated multiple movies similarly, and User B has rated a
movie that User A has not seen, the system may recommend that movie to User A.

2. Item-Based Collaborative Filtering

Item-based collaborative filtering focuses on item similarity rather than user similarity. The key
steps include:

 Computing Item Similarities – Items are compared based on how users have rated them.
Similarity measures such as cosine similarity and adjusted cosine similarity are commonly
used.
 Predicting Ratings – The system predicts a user's rating for an item by analyzing their
ratings for similar items.
 Recommending Items – Highly similar items to those a user has interacted with are
recommended.

For instance, if many users who liked "The Lord of the Rings" also liked "The Hobbit," then "The
Hobbit" would be recommended to users who enjoyed "The Lord of the Rings."

Cold Start Problem in Collaborative Filtering

The cold start problem arises when a recommendation system lacks sufficient data for new users or
new items. Since CF relies on historical interactions, new users or items may not have enough
ratings for meaningful recommendations. Several techniques help mitigate this issue:

1. Hybrid Models – Combining collaborative filtering with content-based filtering allows


recommendations based on item attributes when user interaction data is sparse.
2. Bootstrapping with Demographics – Using demographic data (e.g., age, location) to make
initial recommendations before behavioral data is available.
3. Popularity-Based Recommendations – Suggesting popular or highly-rated items as an initial
step before enough user-specific data is gathered.
4. Active Learning & Onboarding – Asking new users for preferences through surveys or
interactive onboarding to collect early preference data.
5. Cross-Domain Recommendations – Utilizing a user’s preferences from another domain (e.g.,
book purchases to recommend movies) to generate initial suggestions.

13. Describe content-based filtering approaches, including feature extraction and similarity
measures used in content-based recommendation systems.

Content-based filtering (CBF) is a recommendation approach that suggests items to users based on
the characteristics of items they have previously interacted with. Instead of relying on user
interactions like collaborative filtering, CBF analyzes item attributes and user preferences to
generate recommendations.

1. Working of Content-Based Filtering

Content-based filtering works by comparing the features of items to those of items the user has
previously liked. The process involves:
 Feature Extraction – Identifying and extracting relevant attributes from items (e.g.,
keywords in articles, genre in movies, ingredients in recipes).
 User Profile Construction – Creating a profile based on user preferences by analyzing their
past interactions.
 Similarity Measurement – Comparing the features of items to those preferred by the user to
generate recommendations.

2. Feature Extraction in Content-Based Filtering

Feature extraction is a crucial step where relevant attributes are identified from items. Some
common feature extraction techniques include:

 TF-IDF (Term Frequency-Inverse Document Frequency) – Used in text-based


recommendations to extract important words from documents.
 Bag of Words (BoW) – Represents text documents as a set of words without considering
order but with frequency counts.
 Word Embeddings (Word2Vec, BERT) – Captures semantic relationships between words,
improving recommendation quality.
 Metadata-Based Features – Extracts structured attributes like genre, cast, director (for
movies), or ingredients (for recipes).
 Visual & Audio Features – Uses deep learning models to extract visual (image-based) or
audio (music-based) features.

3. Similarity Measures in Content-Based Filtering

Once features are extracted, similarity measures are used to compare items and generate
recommendations. Some common similarity measures include:

 Cosine Similarity – Measures the cosine of the angle between two feature vectors,
commonly used in text-based and numeric feature comparisons.
 Euclidean Distance – Computes the direct distance between feature vectors, used for
numerical attributes.
 Jaccard Similarity – Measures the similarity between sets, often used for categorical
features like genre or tags.
 Pearson Correlation – Evaluates the linear correlation between features, useful in rating-
based similarity comparisons.

4. Advantages and Limitations of Content-Based Filtering

Advantages:

 Provides personalized recommendations based on user preferences.


 No dependency on other users’ interactions, avoiding cold start problems for users.
 Works well for niche items that lack collaborative filtering data.

Limitations:
 Struggles with new items lacking user interactions (cold start problem for items).
 Over-specialization can occur, where users are only recommended similar items without
diversity.
 Requires high-quality feature extraction for accurate recommendations.

5. Enhancements and Hybrid Approaches

To overcome limitations, CBF is often combined with collaborative filtering in hybrid


recommendation systems. Techniques such as content-boosted collaborative filtering and feature
augmentation improve performance by integrating both item features and user behavior data.

Content-based filtering remains a strong approach for personalized recommendations, especially


when high-quality item attributes are available.

14. Discuss the advantages and limitations of online evaluation methods compared to offline
evaluation methods, such as test collections and user studies.

Evaluating recommendation systems, search engines, or information retrieval models requires both
online and offline evaluation methods. Each approach has its advantages and limitations based on
real-world applicability, scalability, and reliability.

1. Online Evaluation Methods

Online evaluation involves testing models in real-time using live user interactions. This is
commonly done through A/B testing, multi-armed bandits, and click-through rate (CTR) analysis in
a deployed system.

Advantages of Online Evaluation

1. Real-User Feedback – Directly captures how real users interact with recommendations,
ensuring practical relevance.
2. Dynamic Adaptation – Can adjust recommendations dynamically based on evolving user
behavior.
3. Business-Centric Metrics – Allows measuring actual business goals like conversion rates,
revenue impact, and engagement.
4. Scalability – Can handle large-scale evaluations across diverse user bases without artificial
constraints.

Limitations of Online Evaluation

1. Resource-Intensive – Requires deployment in a live system, making it expensive and time-


consuming.
2. Risk of Negative Impact – Poor recommendations can lead to a bad user experience,
harming user trust.
3. Delayed Insights – Takes time to gather meaningful user interaction data, making rapid
model iteration difficult.
4. External Factors – User behavior is influenced by seasonality, trends, or external events,
making evaluation noisy.

2. Offline Evaluation Methods

Offline evaluation involves assessing models using test collections, benchmark datasets, and user
studies before deployment. Metrics such as precision, recall, NDCG (Normalized Discounted
Cumulative Gain), and MAP (Mean Average Precision) are commonly used.

Advantages of Offline Evaluation

1. Fast and Cost-Effective – Does not require a live system, allowing rapid testing of multiple
models.
2. Controlled Environment – Eliminates external influences, making comparisons more
reliable.
3. Safe Experimentation – Avoids negative impact on actual users, preventing potential
dissatisfaction.
4. Replicability – Can be consistently applied across different models, ensuring fair
comparisons.

Limitations of Offline Evaluation

1. Limited Real-World Relevance – User behavior in live environments may differ from test
collections, reducing practical applicability.
2. Lack of Engagement Metrics – Cannot measure real-world user satisfaction or engagement
directly.
3. Cold Start and Data Biases – May not capture changes in user preferences or new content
effectively.
4. Assumes Static Data – Real-world data evolves, whereas offline evaluations use fixed
datasets, limiting adaptability.

15. Given two sets of shingles representing web pages:


{ "apple", "banana", "orange", "grape" }
{ "apple", "orange", "grape", "kiwi" }
Compute the Jaccard similarity between the two pages using the formula J(A, B) = |A ∩ B| / |A
∪ B|.

The Jaccard similarity between two sets A and B is calculated using the formula:

J(A,B)= ∣A∩B∣
∣A∪B∣

Given the sets:


A={"apple","banana","orange","grape"}

B={"apple","orange","grape","kiwi"}
Step 1: Compute Intersection ∣A∩B∣

The common elements between A and B are:

A∩B={"apple","orange","grape"}

So, ∣A∩B∣=3

Step 2: Compute Union ∣A∪B∣

The unique elements in both sets are:

A∪B={"apple","banana","orange","grape","kiwi"}

So, ∣A∪B∣=5

Step 3: Compute Jaccard Similarity

J(A,B)= 3 =0.6
5

Final Answer:

The Jaccard similarity between the two sets is 0.6 (or 60%).

16. Given two sets of tokens representing documents:


Document 1: { "machine", "learning", "algorithm", "data", "science" }
Document 2: { "algorithm", "data", "science", "model", "prediction" }
Compute the Jaccard similarity between the two documents.

The Jaccard similarity between two sets A and B is calculated using the formula:

J(A,B)= ∣A∩B∣
∣A∪B∣

where:

 ∣A∩B∣ is the number of common elements between both sets.


 ∣A∪B∣ is the number of unique elements in both sets combined.

Step 1: Define the Sets

Document 1:
A={"machine","learning","algorithm","data","science"}

Document 2:

B={"algorithm","data","science","model","prediction"}

Step 2: Compute Intersection A∩B

The common elements in both sets:

A∩B={"algorithm","data","science"}
∣A∩B∣=3
Step 3: Compute Union A∪B

The unique elements across both sets:

A∪B={"machine","learning","algorithm","data","science","model","prediction"}
∣A∪B∣=7

Step 4: Compute Jaccard Similarity

J(A,B)= ∣A∩B∣
∣A∪B∣

J(A,B)= 3
7

J(A,B)≈ 0.4286

Final Answer

The Jaccard similarity between the two documents is 0.4286 (or 42.86%).

17. Given two sets of terms representing customer transaction documents:


Transaction Document 1: { "bread", "milk", "eggs", "cheese" }
Transaction Document 2: { "bread", "butter", "milk", "yogurt" }
Compute the Jaccard similarity between the two transaction documents.

The Jaccard similarity between two sets A and B is given by the formula:

where:

J(A,B)= ∣A∩B∣
∣A∪B∣

 ∣A∩B∣ is the number of common elements in both sets.


 ∣A∪B∣ is the number of unique elements in both sets combined.
Step 1: Define the Sets

Transaction Document 1:

A={"bread","milk","eggs","cheese"}

Transaction Document 2:

B={"bread","butter","milk","yogurt"}

Step 2: Compute Intersection A∩B

The common elements in both sets:

A∩B={"bread","milk"}
∣A∩B∣=2

Step 3: Compute Union A∪B

The unique elements across both sets:

A∪B={"bread","milk","eggs","cheese","butter","yogurt"}
∣A∪B∣=6|

Step 4: Compute Jaccard Similarity

J(A,B)= ∣A∩B∣
∣A∪B∣

J(A,B)= 2
6

J(A,B)≈ 0.3333

Final Answer

The Jaccard similarity between the two transaction documents is 0.3333 (or 33.33%).

18. Given two sets of features representing product description documents:


Product Document 1: { "smartphone", "camera", "battery", "display" }
Product Document 2: { "smartphone", "camera", "storage", "processor" }
Compute the Jaccard similarity between the features of the two product documents.

The Jaccard similarity between two sets A and B is given by the formula:

J(A,B)= ∣A∩B∣
∣A∪B∣
where:

 ∣A∩B∣ is the number of common elements in both sets.


 ∣A∪B∣ is the number of unique elements in both sets combined.

Step 1: Define the Sets

Product Document 1:

A={"smartphone","camera","battery","display"}

Product Document 2:

B={"smartphone","camera","storage","processor"}

Step 2: Compute Intersection A∩B

The common elements in both sets:

A∩B={"smartphone","camera"}
∣A∩B∣=2|

Step 3: Compute Union A∪B

The unique elements across both sets:

A∪B={"smartphone","camera","battery","display","storage","processor"}
∣A∪B∣=6

Step 4: Compute Jaccard Similarity

J(A,B)= ∣A∩B∣
∣A∪B∣

J(A,B)= 2
6

J(A,B)≈ 0.3333

Final Answer

The Jaccard similarity between the two product documents is 0.3333 (or 33.33%).

You might also like