IR
IR
1
- Document Indexing and Retrieval
- Implement an interval index construction algorithm.
- Build a simple document retrieval system using the constructed index.
Practical No. 2
- Retrieval Models
- Implement the boolean retrieval model and process queries
- Implement the vector space model with TF-Idf weighting and cosine
similrty
import pandas
from contextlib import redirect_stdout
terms = []
keys = []
vec_Dic = {}
dicti = {}
dummy_List = []
def query_Vector(query):
'''In this function we represent the query in the form of boolean vector'''
qvect = []
# query vector which is returned at the end of the function
for i in terms:
if i in query:
qvect.append(1)
else:
qvect.append(0)
return qvect
# return the query vector which is obtained in the boolean form
def prediction(q_Vect):
'''In this function we make the prediction regarding which document is related
to the given query by performing the boolean operations'''
dictionary = {}
listi = []
count = 0
term_Len = len(terms)
# number of terms present in the term list
for i in vec_Dic:
for t in range(term_Len):
if(q_Vect[t] == vec_Dic[i][t]):
count += 1
dictionary.update({i: count})
count = 0
# reinitialisation of count variable to 0
for i in dictionary:
listi.append(dictionary[i])
# here we append the count value to list
listi = sorted(listi, reverse=True)
ans = ''
with open('C:\\my\\output.txt', 'w') as f:
with redirect_stdout(f):
print("ranking of the documents")
for count, i in enumerate(listi):
key = check(dictionary, i)
# Function call to get the key when the value is known
if count == 0:
ans = key
# to store the name of the document which is most relevant
print(key, "rank is", count+1)
dictionary.pop(key)
print(ans, "is the most relevant document for the given query")
# to print the name of the document which is most relevant
def main():
documents = pandas.read_csv(r'C:\\my\\Book1.csv')
# to read the data from the csv file as a dataframe
rows = len(documents)
# to get the number of rows
cols = len(documents.columns)
# to get the number of columns
filter(documents, rows, cols)
bool_Representation(dicti, rows, cols)
print("Enter query")
query = input()
query = query.split(' ')
# splitting the query as a list of strings
q_Vect = query_Vector(query)
# function call to represent the query in the form of boolean vector
prediction(q_Vect)
main()
Practical No. 3
- Spelling correction in IR syatem
- Develop a spelling correction module using edit distance algorithms.
- Integrate the spelling correction module into an information retrieval
system
Practical No. 4
- Evaluation Metrics for IR system
- Calculate precision, recall and f- measure for a given set of retrieval results
- Use evaluation toolkit to measure average precision and other metrics
Precision 4.1
# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)] # 90 positive predictions, 10 negative
predictions
pred_neg = [1 for _ in range(30)] + [0 for _ in range(9970)] # 30 positive predictions, 9970
negative predictions
y_pred = pred_pos + pred_neg
# calculate precision
precision = precision_score(y_true, y_pred, average='binary')
print('Precision: %.3f' % precision)
Recall 4.2
# define actual
act_pos = [1 for _ in range(100)]
act_neg = [0 for _ in range(10000)]
y_true = act_pos + act_neg
# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)]
pred_neg = [0 for _ in range(10000)]
y_pred = pred_pos + pred_neg
# calculate precision
recall = recall_score(y_true, y_pred, average='binary')
print('Recall: %.3f' % recall)
F1 score 4.3
# define predictions
pred_pos = [0 for _ in range(5)] + [1 for _ in range(95)] # 95 positive predictions, 5 negative
predictions
pred_neg = [1 for _ in range(55)] + [0 for _ in range(9945)] # 55 positive predictions, 9945
negative predictions
y_pred = pred_pos + pred_neg # combine predicted positive and negative labels
# calculate F1 score
score = f1_score(y_true, y_pred, average='binary')
Practical No. 5
- Text categorization
- Implement a text classification algorithm
- Train the classifier on a labelled dataset and evaluate its performance
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample dataset
texts = [
"I love this movie, it's amazing!",
"What a great day, I feel so happy!",
"This is the worst product I've ever bought.",
"I'm so sad and disappointed.",
"Absolutely fantastic experience, will buy again!",
"Terrible customer service, very unhappy.",
"I'm excited about this new opportunity!",
"The food was awful and I got sick."
]
X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)
# Make predictions
y_pred = classifier.predict(X_test_vectors)
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)
Practical No. 6
- Clustering for Information Retrieval
- Implement clustering algorithm
- Apply the clustering algorithm to a set of documents and evaluate the
result
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("C:\my\mall_customers.csv")
print(dataset)
x = dataset.iloc[:, [3, 4]].values
print(x)
from sklearn.cluster import KMeans
wcss_list = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss_list)
plt.title("The Elbow Method Graph")
plt.xlabel("Number of clusters (k)")
plt.ylabel("WCSS (Within-Cluster Sum of Squares)")
plt.show()
Practical No. 7
- Web Crawlilng and Indexing
- Develop a web crawler to fetch and index web pages
- Handle challenges such as robots.txt, dynamic content,crawling delays
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
print(f"Crawling: {url}")
self.visited.add(url)
except Exception as e:
print(f"Failed to crawl {url}: {e}")
# Usage example
if _name_ == "_main_":
start_url = "https://www.tpointtech.com/"
crawler = SimpleWebCrawler(base_url=start_url, max_pages=5)
crawler.crawl(start_url)
Practical No. 8
- Link Analysis and pagerank
- Implement the Pagerank algorithm to rank the web pages based on link analysis
- Apply the page rank algorithm to a small web graph and analysis the result
import numpy as np
import networkx as nx
def pagerank(graph, alpha=0.85, tol=1.0e-6, max_iter=100):
"""Computes PageRank scores for a directed graph."""
n = len(graph)
if n == 0:
return {}
# Initialize ranks
ranks = np.ones(n) / n
# Create transition matrix M
M = np.zeros((n, n))
# Check for convergence (if the change in ranks is less than tolerance)
if np.linalg.norm(new_ranks - ranks, ord=1) < tol:
break
ranks = new_ranks
if _name_ == "_main_":
# Compute PageRank
ranks = pagerank(web_graph)
# Example usage:
text = """
Machine learning is a subset of artificial intelligence that focuses on building systems that learn
from data.
These systems improve their performance as they are exposed to more data over time, without
being explicitly programmed.
"""
print("Abstractive Summary:")
print(abstractive_summary(text))
# Example usage:
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural
intelligence displayed by humans. Leading AI textbooks define the
field as the study of "intelligent agents": any device that perceives its environment and takes
actions to maximize its chance of success at some goal. Colloquially, the term "artificial
intelligence" is applied when a machine mimics "cognitive" functions that humans associate with
the human mind, such as "learning" and "problem-solving". As machines become increasingly
capable, tasks considered to require "intelligence" are often removed from the definition of AI. A
quip in Tesler's Theorem says "AI is whatever hasn't been done yet."
For instance, optical character recognition is frequently excluded from things considered to be
AI, having become a routine technology.
Modern machine capabilities generally classified as AI include successfully understanding
human speech, competing at the highest level in strategic game systems (such as chess and
Go), self-driving cars, and more.
"""
print("Extractive Summary:")
print(extractive_summary(text))