0% found this document useful (0 votes)
11 views23 pages

Index: SR. NO. Practical Name Date of Perform NO. Sign

The document outlines a series of practical exercises related to Information Retrieval (IR) systems, covering topics such as document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, and link analysis. Each practical includes aims, programming tasks, and sample code implementations. The exercises are designed for students at Satish Pradhan Dnyansadhna College for the academic year 2025-2026.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views23 pages

Index: SR. NO. Practical Name Date of Perform NO. Sign

The document outlines a series of practical exercises related to Information Retrieval (IR) systems, covering topics such as document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, and link analysis. Each practical includes aims, programming tasks, and sample code implementations. The exercises are designed for students at Satish Pradhan Dnyansadhna College for the academic year 2025-2026.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

INDEX

SR. DATE OF PAGE


PRACTICAL NAME SIGN.
NO. PERFORM NO.
1. Document Indexing and Retrieval
 Implement an inverted index construction
algorithm. 11-12-2024 3-4
 Build a simple document retrieval system
using the constructed index.
2. Retrieval Models
 Implement the Boolean retrieval model and
process queries. 11-12-2024 5-8
 Implement the vector space model with
TF-IDF weighting and cosine similarity.
3. Spelling Correction in IR Systems
 Develop a spelling correction module
using edit distance algorithms. 08-01-2025 9-10
 Integrate the spelling correction module
into an information retrieval system.
4. Evaluation Metrics for IR Systems
 Calculate precision, recall, and F-measure
for a given set of retrieval results.
15-01-2025 11-12
 Use an evaluation toolkit to measure
average precision and other evaluation
metrics.
5. Text Categorization
 Implement a text classification algorithm
(e.g., Naive Bayes or Support Vector
Machines).
22-01-2025 13-14
 Train the classifier on a labelled dataset
and evaluate its performance.
6. Clustering for Information Retrieval
 Implement a clustering algorithm (e.g., K- 05-02-2025 15-16
means or hierarchical clustering).
 Apply the clustering algorithm to a set of
documents and evaluate the clustering
results.
7. Web Crawling and Indexing
 Develop a web crawler to fetch and index
web pages. 05-02-2025 17-18
 Handle challenges such as robots.txt,
dynamic content, and crawling delays.
8. Link Analysis and PageRank
 Implement the PageRank algorithm to rank
web pages based on link analysis. 12-02-2025 19-20
 Apply the PageRank algorithm to a small
web graph and analyze the results.
9. Learning to Rank
 Implement a learning to rank algorithm
(e.g., RankSVM or RankBoost). 19-02-2025 21-22
 Train the ranking model using labelled data
and evaluate its effectiveness.
10. Advanced Topics in Information Retrieval
 Implement a text summarization algorithm
(e.g., extractive or abstractive). 26-02-2025 23-24
 Build a question-answering system using
techniques such as information extraction.

Satish Pradhan Dnyansadhna College (2025-2026) 2


Practical No. 1
Aim: Document Indexing and Retrieval
 Implement an inverted index construction algorithm.
 Build a simple document retrieval system using the constructed index.

Program:-
# Define the documents
document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

# Step 1: Tokenize the documents


# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index


# Create an empty dictionary to store the inverted index
inverted_index = {}
# For each term, find the documents that contain it
for term in terms:
documents = []
if term in tokens1:
documents.append("Document 1")
if term in tokens2:
documents.append("Document 2")
inverted_index[term] = documents

# Step 3: Print the inverted index


for term, documents in inverted_index.items():
print(term, "->", ",".join(documents))

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 3


Satish Pradhan Dnyansadhna College (2025-2026) 4
Practical No. 2
Aim: Retrieval Models
 Implement the Boolean retrieval model and process queries.
 Implement the vector space model with TF-IDF weighting and cosine
similarity.

Program:
import pandas
from contextlib import redirect_stdout
terms = []
keys = []
vec_Dic = {}
dicti = {}
dummy_List = []
# list for performing some operations and clearing them
def filter(documents, rows, cols):
for i in range(rows):
for j in range(cols):
if(j == 0):
# first column has the name of the document in the csv file
keys.append(documents.loc[i].iat[j])
else:
dummy_List.append(documents.loc[i].iat[j])
# dummy list to update the terms in the dictionary
if documents.loc[i].iat[j] not in terms:
# add the terms to the list if it is not present else continue
terms.append(documents.loc[i].iat[j])
copy = dummy_List.copy()
dicti.update({documents.loc[i].iat[0]: copy})
# adding the key value pair to a dictionary
dummy_List.clear()
# clearing the dummy list
def bool_Representation(dicti, rows, cols):
terms.sort()
for i in (dicti):
for j in terms:
# if the string is present in the list we append 1 else we append 0
if j in dicti[i]:
dummy_List.append(1)
else:
dummy_List.append(0)
# appending 1 or 0 for obtaining the boolean representation
copy = dummy_List.copy()
# copying the dummy list to a different list
vec_Dic.update({i: copy})

Satish Pradhan Dnyansadhna College (2025-2026) 5


# adding the key value pair to a dictionary
dummy_List.clear()
# clearing the dummy list
def query_Vector(query):
'''In this function we represent the query in the form of boolean vector'''
qvect = []
# query vector which is returned at the end of the function
for i in terms:
if i in query:
qvect.append(1)
else:
qvect.append(0)
return qvect
# return the query vector which is obtained in the boolean form
def prediction(q_Vect):
'''In this function we make the prediction regarding which document is related
to the given query by performing the boolean operations'''
dictionary = {}
listi = []
count = 0
term_Len = len(terms)
# number of terms present in the term list
for i in vec_Dic:
for t in range(term_Len):
if(q_Vect[t] == vec_Dic[i][t]):
count += 1
dictionary.update({i: count})
count = 0
# reinitialisaion of count variable to 0
for i in dictionary:
listi.append(dictionary[i])
# here we append the count value to list
listi = sorted(listi, reverse=True)
ans = '
with open('output.txt', 'w') as f:
with redirect_stdout(f):
print("ranking of the documents")
for count, i in enumerate(listi):
key = check(dictionary, i)
# Function call to get the key when the value is known
if count == 0:
ans = key
# to store the name of the document which is most relevant
print(key, "rank is", count+1)
dictionary.pop(key)

Satish Pradhan Dnyansadhna College (2025-2026) 6


print(ans, "is the most relevant document for the given query")
# to print the name of the document which is most relevant
def check(dictionary, val):
'''Function to return the key when the value is known'''
for key, value in dictionary.items():
if(val == value):
return key
def main():
documents = pandas.read_csv(r'D:\deore\documents.csv')
# to read the data from the csv file as a dataframe
rows = len(documents)
# to get the number of rows
cols = len(documents.columns)
# to get the number of columns
filter(documents, rows, cols)
bool_Representation(dicti, rows, cols)
print("Enter query")
query = input()
query = query.split(' ')
# splitting the query as a list of strings
q_Vect = query_Vector(query)
# function call to represent the query in the form of boolean vector
prediction(q_Vect)
main()

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 7


Satish Pradhan Dnyansadhna College (2025-2026) 8
Practical no: 3
Aim Spelling Correction in IR Systems
 Develop a spelling correction module using edit distance algorithms.
 Integrate the spelling correction module into an information retrieval
system.
Program:
# Importing necessary libraries
import nltk
from nltk.metrics.distance import edit_distance
from nltk.corpus import words

# Downloading and importing the 'words' corpus


nltk.download('words')

# List of correct words from the 'words' corpus


correct_words = words.words()

# List of incorrect spellings that need to be corrected


incorrect_words = ['happpy', 'azmaing', 'intelliengt', 'natuer', 'ashy']

# Printing the incorrect words


print("Incorrect Words:", incorrect_words)
print("========= Result =========")

# Loop to find correct spellings based on edit distance


for word in incorrect_words:
# Calculate the edit distance between the word and all correct words
temp = [(edit_distance(word, w), w) for w in correct_words]
# Print the closest correct word (sorted by minimum edit distance)
print(f"Incorrect word: {word} => Corrected word: {sorted(temp, key=lambda val:
val[0])[0][1]}")

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 9


Satish Pradhan Dnyansadhna College (2025-2026) 10
Practical No. 4
Aim: Evaluation Metrics for IR Systems
 Calculate precision, recall, and F-measure for a given set of retrieval
results.
 Use an evaluation toolkit to measure average precision and other
evaluation metrics.
Program 1:
from sklearn.metrics import precision_score

# define actual
act_pos = [1 for _ in range(100)]
act_neg = [0 for _ in range(10000)]
y_true = act_pos + act_neg

# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)] # 90 positive predictions, 10 negative
predictions
pred_neg = [1 for _ in range(30)] + [0 for _ in range(9970)] # 30 positive predictions, 9970
negative predictions
y_pred = pred_pos + pred_neg

# calculate precision
precision = precision_score(y_true, y_pred, average='binary')
print('Precision: %.3f' % precision)

Output:

Recall
from sklearn.metrics import recall_score

# define actual
act_pos = [1 for _ in range(100)]
act_neg = [0 for _ in range(10000)]
y_true = act_pos + act_neg

# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)]
pred_neg = [0 for _ in range(10000)]
y_pred = pred_pos + pred_neg

Satish Pradhan Dnyansadhna College (2025-2026) 11


# calculate precision
recall = recall_score(y_true, y_pred, average='binary')
print('Recall: %.3f' % recall)

F-measure
from sklearn.metrics import f1_score

# define actual labels


act_pos = [1 for _ in range(100)] # 100 positive instances
act_neg = [0 for _ in range(10000)] # 10000 negative instances
y_true = act_pos + act_neg # combine actual positive and negative labels

# define predictions
pred_pos = [0 for _ in range(5)] + [1 for _ in range(95)] # 95 positive predictions, 5 negative
predictions
pred_neg = [1 for _ in range(55)] + [0 for _ in range(9945)] # 55 positive predictions, 9945
negative predictions
y_pred = pred_pos + pred_neg # combine predicted positive and negative labels

# calculate F1 score
score = f1_score(y_true, y_pred, average='binary')

# print the F1 score (F-measure)


print('F-Measure: %.3f' % score)

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 12


Practical No. 5
Aim: Text Categorization
Implement a text classification algorithm (e.g., Naive Bayes or Support
Vector Machines).
 Train the classifier on a labelled dataset and evaluate its performance.
Program :
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
texts = [
"I love this movie, it's amazing!",
"What a great day, I feel so happy!",
"This is the worst product I've ever bought.",
"I'm so sad and disappointed.",
"Absolutely fantastic experience, will buy again!",
"Terrible customer service, very unhappy.",
"I'm excited about this new opportunity!",
"The food was awful and I got sick."
]

# Corresponding labels (1 for positive, 0 for negative)


labels = [1, 1, 0, 0, 1, 0, 1, 0]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.25, random_state=42
)

# Convert text data into feature vectors


vectorizer = CountVectorizer()

X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Train a Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train_vectors, y_train)

# Make predictions
y_pred = classifier.predict(X_test_vectors)

Satish Pradhan Dnyansadhna College (2025-2026) 13


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 14


Practical no: 6
Aim: Clustering for Information Retrieval
 Implement a clustering algorithm (e.g., K-means or hierarchical
clustering).
 Apply the clustering algorithm to a set of documents and evaluate the
clustering results.
Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("D:/Amruta/Mall_Customers.csv")
print(dataset)
x = dataset.iloc[:, [3, 4]].values
print(x)
from sklearn.cluster import KMeans
wcss_list = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss_list)
plt.title("The Elbow Method Graph")
plt.xlabel("Number of clusters (k)")
plt.ylabel("WCSS (Within-Cluster Sum of Squares)")
plt.show()

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 15


Satish Pradhan Dnyansadhna College (2025-2026) 16
Practical No. 7
Aim: Web Crawling and Indexing
 Develop a web crawler to fetch and index web pages.
 Handle challenges such as robots.txt, dynamic content, and crawling
delays.
Program:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Simple Web Crawler


class SimpleWebCrawler:
def init (self, base_url, max_pages=10):
self.base_url = base_url
self.max_pages = max_pages
self.visited = set()
def crawl(self, url, depth=0):
if url in self.visited or depth >= self.max_pages:
return
try:
response = requests.get(url)
if response.status_code != 200:
return

print(f"Crawling: {url}")
self.visited.add(url)

soup = BeautifulSoup(response.text, 'html.parser')


for link in soup.find_all('a', href=True):
next_url = urljoin(url, link['href'])
if next_url.startswith(self.base_url):
self.crawl(next_url, depth + 1)

except Exception as e:
print(f"Failed to crawl {url}: {e}")

# Usage example
if name == " main ":
start_url = "https://www.tpointtech.com/"
crawler = SimpleWebCrawler(base_url=start_url, max_pages=5)
crawler.crawl(start_url)

Satish Pradhan Dnyansadhna College (2025-2026) 17


Output:

Satish Pradhan Dnyansadhna College (2025-2026) 18


Practical no 8
Aim: - Link Analysis and PageRank
 Implement the PageRank algorithm to rank web pages based on link
analysis.
 Apply the PageRank algorithm to a small web graph and analyze the
results.

Program: -
import numpy as np
import networkx as nx
def pagerank(graph, alpha=0.85, tol=1.0e-6, max_iter=100):
"""Computes PageRank scores for a directed graph."""
n = len(graph)
if n == 0:
return {}

# Initialize ranks
ranks = np.ones(n) / n

# Create transition matrix M


M = np.zeros((n, n))

# Construct the transition matrix


for i, node in enumerate(graph):
links = graph[node]
if links:
M[i, [list(graph.keys()).index(dest) for dest in links]] = 1 / len(links)
else:
M[i, :] = 1 / n # Distribute rank to all pages (handling dangling nodes)

# Power iteration method


for _ in range(max_iter):
new_ranks = alpha * np.dot(M.T, ranks) + (1 - alpha) / n * np.ones(n)

# Check for convergence (if the change in ranks is less than tolerance)
if np.linalg.norm(new_ranks - ranks, ord=1) < tol:
break

ranks = new_ranks

return {node: ranks[i] for i, node in enumerate(graph)}

if name == " main ":

Satish Pradhan Dnyansadhna College (2025-2026) 19


# Example web graph
web_graph = {
"A": ["B", "C"],
"B": ["C", "D"],
"C": ["A"],
"D": ["C"]
}

# Compute PageRank
ranks = pagerank(web_graph)

# Print PageRank Scores


print("PageRank Scores:")
for page, rank in ranks.items():
print(f"{page}: {rank:.4f}")
Output: -

Satish Pradhan Dnyansadhna College (2025-2026) 20


Practical no 9
Aim: - Learning to Rank
 Implement a learning to rank algorithm (e.g., RankSVM or
RankBoost).
 Train the ranking model using labelled data and evaluate its
effectiveness.
Program: -
from gensim.summarization import summarize
def extractive_summary(text, ratio=0.2):
"""
Generate an extractive summary of the given text using Gensim's summarize function.
Parameters:
- text (str): The input text to summarize.
- ratio (float): The ratio of sentences to include in the summary (default is 0.2).
Returns:
- str: The extractive summary.
"""
try:
summary = summarize(text, ratio=ratio)
return summary
except ValueError:
return "Input text is too short to summarize."
# Example usage:
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural
intelligence displayed by humans. Leading AI textbooks define the
field as the study of "intelligent agents": any device that perceives its environment and takes
actions to maximize its chance of success at some goal. Colloquially, the term "artificial
intelligence" is applied when a machine mimics "cognitive" functions that humans associate
with the human mind, such as "learning" and "problem-solving". As machines become
increasingly capable, tasks considered to require "intelligence" are often removed from the
definition of AI. A quip in Tesler's Theorem says "AI is whatever hasn't been done yet."
For instance, optical character recognition is frequently excluded from things considered to
be AI, having become a routine technology.
Modern machine capabilities generally classified as AI include successfully understanding
human speech, competing at the highest level in strategic game systems (such as chess and
Go), self-driving cars, and more.
"""
print("Extractive Summary:")
print(extractive_summary(text))

Satish Pradhan Dnyansadhna College (2025-2026) 21


Output:

Satish Pradhan Dnyansadhna College (2025-2026) 22


Practical no: 10
Aim: - Advanced Topics in Information Retrieval
 Implement a text summarization algorithm (e.g., extractive or
abstractive).
 Build a question-answering system using techniques such as
information extraction
Program: -
from transformers import pipeline
def abstractive_summary(text, min_length=30, max_length=130):
"""
Generate an abstractive summary of the given text using Hugging Face's transformers
pipeline.
Parameters:
- text (str): The input text to summarize.
- min_length (int): Minimum length of the summary.
- max_length (int): Maximum length of the summary.
Returns:
- str: The abstractive summary.
"""
summarizer = pipeline("summarization")
summary = summarizer(text, min_length=min_length, max_length=max_length)
return summary[0]['summary_text']
# Example usage:
print("Abstractive Summary:")
print(abstractive_summary(text))

Output:

Satish Pradhan Dnyansadhna College (2025-2026) 23

You might also like