0% found this document useful (0 votes)
14 views12 pages

IR

The document outlines various practical exercises related to Information Retrieval (IR) systems, including document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, link analysis, and advanced topics like text summarization and learning to rank algorithms. Each practical includes code implementations and methodologies for tasks such as building inverted indexes, calculating precision and recall, and developing web crawlers. The exercises aim to provide hands-on experience with key concepts and techniques in the field of IR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

IR

The document outlines various practical exercises related to Information Retrieval (IR) systems, including document indexing, retrieval models, spelling correction, evaluation metrics, text categorization, clustering, web crawling, link analysis, and advanced topics like text summarization and learning to rank algorithms. Each practical includes code implementations and methodologies for tasks such as building inverted indexes, calculating precision and recall, and developing web crawlers. The exercises aim to provide hands-on experience with key concepts and techniques in the field of IR.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Practical No.

1
-​ Document Indexing and Retrieval
-​ Implement an interval index construction algorithm.
-​ Build a simple document retrieval system using the constructed index.

# Define the documents


document1 = "The quick brown fox jumped over the lazy dog."
document2 = "The lazy dog slept in the sun."

# Step 1: Tokenize the documents


# Convert each document to lowercase and split it into words
tokens1 = document1.lower().split()
tokens2 = document2.lower().split()
# Combine the tokens into a list of unique terms
terms = list(set(tokens1 + tokens2))

# Step 2: Build the inverted index


# Create an empty dictionary to store the inverted index
inverted_index = {}
# For each term, find the documents that contain it
for term in terms:
documents = []
if term in tokens1:
documents.append("Document 1")
if term in tokens2:
documents.append("Document 2")
inverted_index[term] = documents

# Step 3: Print the inverted index


for term, documents in inverted_index.items():
print(term, "->", ",".join(documents))

Practical No. 2
-​ Retrieval Models
-​ Implement the boolean retrieval model and process queries
-​ Implement the vector space model with TF-Idf weighting and cosine
similrty

import pandas
from contextlib import redirect_stdout

terms = []
keys = []
vec_Dic = {}
dicti = {}
dummy_List = []

# list for performing some operations and clearing them


def filter(documents, rows, cols):
for i in range(rows):
for j in range(cols):
if(j == 0):
# first column has the name of the document in the csv file
keys.append(documents.loc[i].iat[j])
else:
dummy_List.append(documents.loc[i].iat[j])
# dummy list to update the terms in the dictionary
if documents.loc[i].iat[j] not in terms:
# add the terms to the list if it is not present else continue
terms.append(documents.loc[i].iat[j])
copy = dummy_List.copy()
dicti.update({documents.loc[i].iat[0]: copy})
# adding the key value pair to a dictionary
dummy_List.clear()
# clearing the dummy list

def bool_Representation(dicti, rows, cols):


terms.sort()
for i in (dicti):
for j in terms:
# if the string is present in the list we append 1 else we append 0
if j in dicti[i]:
dummy_List.append(1)
else:
dummy_List.append(0)
# appending 1 or 0 for obtaining the boolean representation
copy = dummy_List.copy()
# copying the dummy list to a different list
vec_Dic.update({i: copy})
# adding the key value pair to a dictionary
dummy_List.clear()
# clearing the dummy list

def query_Vector(query):
'''In this function we represent the query in the form of boolean vector'''
qvect = []
# query vector which is returned at the end of the function
for i in terms:
if i in query:​
qvect.append(1)
else:
qvect.append(0)
return qvect
# return the query vector which is obtained in the boolean form

def prediction(q_Vect):
'''In this function we make the prediction regarding which document is related
to the given query by performing the boolean operations'''
dictionary = {}
listi = []
count = 0
term_Len = len(terms)
# number of terms present in the term list
for i in vec_Dic:
for t in range(term_Len):
if(q_Vect[t] == vec_Dic[i][t]):
count += 1
dictionary.update({i: count})
count = 0
# reinitialisation of count variable to 0
for i in dictionary:
listi.append(dictionary[i])
# here we append the count value to list
listi = sorted(listi, reverse=True)
ans = ''
with open('C:\\my\\output.txt', 'w') as f:
with redirect_stdout(f):
print("ranking of the documents")
for count, i in enumerate(listi):
key = check(dictionary, i)
# Function call to get the key when the value is known
if count == 0:
ans = key
# to store the name of the document which is most relevant
print(key, "rank is", count+1)
dictionary.pop(key)
print(ans, "is the most relevant document for the given query")
# to print the name of the document which is most relevant

def check(dictionary, val):


'''Function to return the key when the value is known'''
for key, value in dictionary.items():
if(val == value):
return key

def main():
documents = pandas.read_csv(r'C:\\my\\Book1.csv')
# to read the data from the csv file as a dataframe
rows = len(documents)
# to get the number of rows
cols = len(documents.columns)
# to get the number of columns
filter(documents, rows, cols)
bool_Representation(dicti, rows, cols)
print("Enter query")
query = input()
query = query.split(' ')
# splitting the query as a list of strings
q_Vect = query_Vector(query)
# function call to represent the query in the form of boolean vector
prediction(q_Vect)
main()

Practical No. 3
-​ Spelling correction in IR syatem
-​ Develop a spelling correction module using edit distance algorithms.
-​ Integrate the spelling correction module into an information retrieval
system

# Importing necessary libraries


import nltk
from nltk.metrics.distance import edit_distance
from nltk.corpus import words

# Downloading and importing the 'words' corpus


nltk.download('words')

# List of correct words from the 'words' corpus


correct_words = words.words()

# List of incorrect spellings that need to be corrected


incorrect_words = ['happpy', 'azmaing', 'intelliengt', 'natuer', 'ashy']
# Printing the incorrect words
print("Incorrect Words:", incorrect_words)
print("========= Result =========")

# Loop to find correct spellings based on edit distance


for word in incorrect_words:
# Calculate the edit distance between the word and all correct words
temp = [(edit_distance(word, w), w) for w in correct_words]
# Print the closest correct word (sorted by minimum edit distance)
print(f"Incorrect word: {word} => Corrected word: {sorted(temp, key=lambda val:
val[0])[0][1]}")

Practical No. 4
-​ Evaluation Metrics for IR system
-​ Calculate precision, recall and f- measure for a given set of retrieval results
-​ Use evaluation toolkit to measure average precision and other metrics

Precision 4.1

from sklearn.metrics import precision_score


# define actual
act_pos = [1 for _ in range(100)]
act_neg = [0 for _ in range(10000)]
y_true = act_pos + act_neg

# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)] # 90 positive predictions, 10 negative
predictions
pred_neg = [1 for _ in range(30)] + [0 for _ in range(9970)] # 30 positive predictions, 9970
negative predictions
y_pred = pred_pos + pred_neg

# calculate precision
precision = precision_score(y_true, y_pred, average='binary')
print('Precision: %.3f' % precision)

Recall 4.2

from sklearn.metrics import recall_score

# define actual
act_pos = [1 for _ in range(100)]
act_neg = [0 for _ in range(10000)]
y_true = act_pos + act_neg

# define predictions
pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)]
pred_neg = [0 for _ in range(10000)]
y_pred = pred_pos + pred_neg

# calculate precision
recall = recall_score(y_true, y_pred, average='binary')
print('Recall: %.3f' % recall)

F1 score 4.3

from sklearn.metrics import f1_score

# define actual labels


act_pos = [1 for _ in range(100)] # 100 positive instances
act_neg = [0 for _ in range(10000)] # 10000 negative instances
y_true = act_pos + act_neg # combine actual positive and negative labels

# define predictions
pred_pos = [0 for _ in range(5)] + [1 for _ in range(95)] # 95 positive predictions, 5 negative
predictions
pred_neg = [1 for _ in range(55)] + [0 for _ in range(9945)] # 55 positive predictions, 9945
negative predictions
y_pred = pred_pos + pred_neg # combine predicted positive and negative labels

# calculate F1 score
score = f1_score(y_true, y_pred, average='binary')

# print the F1 score (F-measure)


print('F-Measure: %.3f' % score)

Practical No. 5
-​ Text categorization
-​ Implement a text classification algorithm
-​ Train the classifier on a labelled dataset and evaluate its performance

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
texts = [
"I love this movie, it's amazing!",
"What a great day, I feel so happy!",
"This is the worst product I've ever bought.",
"I'm so sad and disappointed.",
"Absolutely fantastic experience, will buy again!",
"Terrible customer service, very unhappy.",
"I'm excited about this new opportunity!",
"The food was awful and I got sick."
]

# Corresponding labels (1 for positive, 0 for negative)


labels = [1, 1, 0, 0, 1, 0, 1, 0]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(
texts, labels, test_size=0.25, random_state=42
)

# Convert text data into feature vectors


vectorizer = CountVectorizer()

X_train_vectors = vectorizer.fit_transform(X_train)
X_test_vectors = vectorizer.transform(X_test)

# Train a Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train_vectors, y_train)

# Make predictions
y_pred = classifier.predict(X_test_vectors)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:")
print(report)
Practical No. 6
-​ Clustering for Information Retrieval
-​ Implement clustering algorithm
-​ Apply the clustering algorithm to a set of documents and evaluate the
result

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv("C:\my\mall_customers.csv")
print(dataset)
x = dataset.iloc[:, [3, 4]].values
print(x)
from sklearn.cluster import KMeans
wcss_list = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(x)
wcss_list.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss_list)
plt.title("The Elbow Method Graph")
plt.xlabel("Number of clusters (k)")
plt.ylabel("WCSS (Within-Cluster Sum of Squares)")
plt.show()

Practical No. 7
-​ Web Crawlilng and Indexing
-​ Develop a web crawler to fetch and index web pages
-​ Handle challenges such as robots.txt, dynamic content,crawling delays

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Simple Web Crawler


class SimpleWebCrawler:
def _init_(self, base_url, max_pages=10):
self.base_url = base_url
self.max_pages = max_pages
self.visited = set()
def crawl(self, url, depth=0):
if url in self.visited or depth >= self.max_pages:
return
try:
response = requests.get(url)
if response.status_code != 200:
return

print(f"Crawling: {url}")
self.visited.add(url)

soup = BeautifulSoup(response.text, 'html.parser')


for link in soup.find_all('a', href=True):
next_url = urljoin(url, link['href'])
if next_url.startswith(self.base_url):
self.crawl(next_url, depth + 1)

except Exception as e:
print(f"Failed to crawl {url}: {e}")

# Usage example
if _name_ == "_main_":
start_url = "https://www.tpointtech.com/"
crawler = SimpleWebCrawler(base_url=start_url, max_pages=5)
crawler.crawl(start_url)

Practical No. 8
-​ Link Analysis and pagerank
-​ Implement the Pagerank algorithm to rank the web pages based on link analysis
-​ Apply the page rank algorithm to a small web graph and analysis the result

import numpy as np
import networkx as nx
def pagerank(graph, alpha=0.85, tol=1.0e-6, max_iter=100):
"""Computes PageRank scores for a directed graph."""
n = len(graph)
if n == 0:
return {}

# Initialize ranks
ranks = np.ones(n) / n
# Create transition matrix M
M = np.zeros((n, n))

# Construct the transition matrix


for i, node in enumerate(graph):
links = graph[node]
if links:
M[i, [list(graph.keys()).index(dest) for dest in links]] = 1 / len(links)
else:
M[i, :] = 1 / n # Distribute rank to all pages (handling dangling nodes)

# Power iteration method


for _ in range(max_iter):
new_ranks = alpha * np.dot(M.T, ranks) + (1 - alpha) / n * np.ones(n)

# Check for convergence (if the change in ranks is less than tolerance)
if np.linalg.norm(new_ranks - ranks, ord=1) < tol:
break

ranks = new_ranks

return {node: ranks[i] for i, node in enumerate(graph)}

if _name_ == "_main_":

# Example web graph


web_graph = {
"A": ["B", "C"],
"B": ["C", "D"],
"C": ["A"],
"D": ["C"]
}

# Compute PageRank
ranks = pagerank(web_graph)

# Print PageRank Scores


print("PageRank Scores:")
for page, rank in ranks.items():
print(f"{page}: {rank:.4f}")
Practical No. 9
-​ Advanced Topics in Information Retrieval

a)​ Implement a text summarization algorithm


Build a question answering system using techniques such as information
ectraction

from transformers import pipeline

def abstractive_summary(text, min_length=30, max_length=130):


summarizer = pipeline("summarization")
summary = summarizer(text, min_length=min_length, max_length=max_length)
return summary[0]['summary_text']

# Example usage:
text = """
Machine learning is a subset of artificial intelligence that focuses on building systems that learn
from data.
These systems improve their performance as they are exposed to more data over time, without
being explicitly programmed.
"""
print("Abstractive Summary:")
print(abstractive_summary(text))

b)​ Implement a learning to rank algorithm


Train the ranking model using labelled data nad evaluate its effectiveness.

from gensim.summarization import summarize

def extractive_summary(text, ratio=0.2):


"""
Generate an extractive summary of the given text using Gensim's summarize function.
Parameters:
- text (str): The input text to summarize.
- ratio (float): The ratio of sentences to include in the summary (default is 0.2).
Returns:
- str: The extractive summary.
"""
try:
summary = summarize(text, ratio=ratio)
return summary
except ValueError:
return "Input text is too short to summarize."

# Example usage:
text = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural
intelligence displayed by humans. Leading AI textbooks define the
field as the study of "intelligent agents": any device that perceives its environment and takes
actions to maximize its chance of success at some goal. Colloquially, the term "artificial
intelligence" is applied when a machine mimics "cognitive" functions that humans associate with
the human mind, such as "learning" and "problem-solving". As machines become increasingly
capable, tasks considered to require "intelligence" are often removed from the definition of AI. A
quip in Tesler's Theorem says "AI is whatever hasn't been done yet."
For instance, optical character recognition is frequently excluded from things considered to be
AI, having become a routine technology.
Modern machine capabilities generally classified as AI include successfully understanding
human speech, competing at the highest level in strategic game systems (such as chess and
Go), self-driving cars, and more.
"""

print("Extractive Summary:")
print(extractive_summary(text))

You might also like