0% found this document useful (0 votes)
17 views24 pages

IR Practical

The document outlines various practical exercises in Information Retrieval, including the implementation of document indexing, retrieval models, spelling correction, evaluation metrics, clustering, web crawling, and link analysis. Each practical includes code snippets demonstrating algorithms such as inverted index construction, Boolean retrieval, TF-IDF, K-means clustering, and PageRank. The exercises aim to provide hands-on experience with key concepts and techniques in information retrieval systems.

Uploaded by

Ahlam Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

IR Practical

The document outlines various practical exercises in Information Retrieval, including the implementation of document indexing, retrieval models, spelling correction, evaluation metrics, clustering, web crawling, and link analysis. Each practical includes code snippets demonstrating algorithms such as inverted index construction, Boolean retrieval, TF-IDF, K-means clustering, and PageRank. The exercises aim to provide hands-on experience with key concepts and techniques in information retrieval systems.

Uploaded by

Ahlam Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

PRACTICAL NO : 1

Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Document Indexing and Retrieval

Implement an inverted index construction algorithm.


Build a simple document retrieval system using the constructed index.

Code:

import nltk

from nltk.corpus import stopwords

document1 = "The quick brown fox jumped over the lazy dog"

document2 = "The lazy dog slept in the sun"

nltk.download('stopwords')

stopWords = stopwords.words('english')

tokens1 = document1.lower().split()

tokens2 = document2.lower().split()

terms = list(set(tokens1 + tokens2))

inverted_index = {}

occ_num_doc1 = {}

occ_num_doc2 = {}

for term in terms:


if term in stopWords:

continue

documents =

[]

if term in tokens1:

documents.append("Document 1")

occ_num_doc1[term] =

tokens1.count(term)

if term in tokens2:

documents.append("Document 2")

occ_num_doc2[term] =

tokens2.count(term) inverted_index[term]

= documents

for term, documents in inverted_index.items():

print(term, "->", end=" ")

for doc in documents:

if doc == "Document 1":

print(f"{doc} ({occ_num_doc1.get(term, 0)}),", end=" ")

else:

print(f"{doc} ({occ_num_doc2.get(term, 0)}),", end=" ")

print()
Output:
PRACTICAL NO : 2
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Retrieval Models

● Implement the Boolean retrieval model and process queries.

● Implement the vector space model with TF-IDF weighting and


cosine similarity.

A. Implement the Boolean retrieval model and process queries.

Code:

documents = {

1: "apple banana orange",

2: "apple banana",

3: "banana orange",

4: "apple"

def

build_index(docs):

index = {}

for doc_id, text in docs.items():

terms = set(text.split())
for term in terms:

if term not in index:

index[term] =

{doc_id}

else:

index[term].add(doc_id)

return index

inverted_index =

build_index(documents) def

boolean_and(operands, index):

if not operands:

return list(range(1, len(documents) + 1))

result = index.get(operands[0], set())

for term in operands[1:]:

result = result.intersection(index.get(term, set()))

return list(result)

def boolean_or(operands,

index): result = set()

for term in operands:

result = result.union(index.get(term, set()))

return list(result)

def boolean_not(operand, index, total_docs):

operand_set = set(index.get(operand, set()))

all_docs_set = set(range(1, total_docs + 1))

return

list(all_docs_set.difference(operand_set))

query1 = ["apple", "banana"]


query2 = ["apple", "orange"]

result1 = boolean_and(query1, inverted_index)

result2 = boolean_or(query2, inverted_index)

result3 = boolean_not("orange", inverted_index, len(documents))

print("Documents containing 'apple' and 'banana':", result1)

print("Documents containing 'apple' or 'orange':", result2)

print("Documents not containing 'orange':", result3)

Output:
B. Implement the vector space model with TF-IDF weighting and
cosine similarity.

Code:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

import nltk

from nltk.corpus import stopwords

import numpy as np

from numpy.linalg import norm

train_set = ["The sky is blue.", "The sun is bright."]

test_set = ["The sun in the sky is bright."]

nltk.download('stopwords')

stopWords = stopwords.words('english')

vectorizer =

CountVectorizer(stop_words=stopWords)

transformer = TfidfTransformer()

trainVectorizerArray =

vectorizer.fit_transform(train_set).toarray() testVectorizerArray =

vectorizer.transform(test_set).toarray() print('Fit Vectorizer to

train set', trainVectorizerArray) print('Transform Vectorizer to test

set', testVectorizerArray)

cx = lambda a, b: round(np.inner(a, b) / (norm(a) * norm(b)), 3)

for vector in trainVectorizerArray:

print(vector)
for testV in testVectorizerArray:

print(testV)

cosine = cx(vector,

testV) print(cosine)

transformer.fit(trainVectorizerArray)

print()

print(transformer.transform(trainVectorizerArray).toarray())

transformer.fit(testVectorizerArray)

print()

tfidf =

transformer.transform(testVectorizerArray)

print(tfidf.todense())

Output:
PRACTICAL NO : 3
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Spelling Correction in IR Systems

● Develop a spelling correction module using edit distance algorithms.

● Integrate the spelling correction module into an information


retrieval system.

Code:

def editDistance(str1,str2,m,n):

if m==0:

return n

if n==0:

return m
if str1[m-1]==str2[n-1]:

return editDistance(str1,str2,m-1,n-1)

return 1+min(editDistance(str1,str2,m,n-

1), editDistance(str1,str2,m-1,n),

editDistance(str1,str2,m-1,n-1))

str1="saturday"

str2="monday"
print("Edit Distance is:",editDistance(str1,str2,len(str1),len(str2)))

Output:
PRACTICAL NO : 4
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Evaluation Metrics for IR Systems

A. Calculate precision, recall, and F-measure for a given set of


retrieval results.

B. Use an evaluation toolkit to measure average precision and


other evaluation metrics.

A. Calculate precision, recall, and F-measure for a given set of retrieval results.

Code:

def calculate_metrics(retrieved_set, relevant_set):

true_positive =

len(retrieved_set.intersection(relevant_set)) false_positive

= len(retrieved_set.difference(relevant_set)) false_negative

= len(relevant_set.difference(retrieved_set)) '''

(Optional)

PPT values:

true_positive = 20

false_positive = 10

false_negative =

30 '''
print("True Positive: ", true_positive

,"\nFalse Positive: ", false_positive

,"\nFalse Negative: ", false_negative ,"\n")

precision = true_positive / (true_positive + false_positive)

recall = true_positive / (true_positive + false_negative)

f_measure = 2 * precision * recall / (precision + recall)

return precision, recall, f_measure

retrieved_set = set(["doc1", "doc2", "doc3"])

relevant_set = set(["doc1", "doc4"])

precision, recall, f_measure = calculate_metrics(retrieved_set, relevant_set)

print(f"Precision: {precision}")

print(f"Recall: {recall}")

print(f"F-measure: {f_measure}")

Output:
B. Use an evaluation toolkit to measure average precision and other
evaluation metrics.

Code:

from sklearn.metrics import average_precision_score

y_true = [0, 1, 1, 0, 1, 1]

y_scores = [0.1, 0.4, 0.35, 0.8, 0.65, 0.9]

average_precision = average_precision_score(y_true, y_scores)

print(f"Average precision-recall score: {average_precision}")

Output:
PRACTICAL NO : 5
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Clustering for Information Retrieval

Implement a clustering algorithm (e.g., K-means or hierarchical

clustering).

Apply the clustering algorithm to a set of documents and evaluate the

clustering results.

Code:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

documents = ["Cats are known for their agility and

grace", "Dogs are often called ‘man’s best friend’.",

"Some dogs are trained to assist people with disabilities.",

"The sun rises in the east and sets in the west.",

"Many cats enjoy climbing trees and chasing toys.", ]

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(documents)

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

print(kmeans.labels_)
Output:
PRACTICAL NO : 6
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Web Crawling and Indexing

A) Develop a web crawler to fetch and index web pages.

B) Handle challenges such as robots.txt, dynamic content, and


crawling delays.

Code:

import requests

from bs4 import BeautifulSoup

import time

from urllib.parse import urljoin, urlparse

from urllib.robotparser import RobotFileParser

def get_html(url):

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64


AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110
Safari/537.3"}

try:

response = requests.get(url, headers=headers)

response.raise_for_status()

return response.text

except requests.exceptions.HTTPError as errh:


print(f"HTTP Error: {errh}")

except requests.exceptions.RequestException as err:

print(f"Request Error: {err}")

return None

def

save_robots_txt(url):

try:

robots_url = urljoin(url, '/robots.txt')

robots_content =

get_html(robots_url) if

robots_content:

with open('robots.txt', 'wb') as file:

file.write(robots_content.encode('utf-8-sig'))

except Exception as e:

print(f"Error saving robots.txt: {e}")

def

load_robots_txt():

try:

with open('robots.txt', 'rb') as file:

return file.read().decode('utf-8-

sig')

except

FileNotFoundError:
return None

def extract_links(html, base_url):

soup = BeautifulSoup(html, 'html.parser')


links = []

for link in soup.find_all('a', href=True):

absolute_url = urljoin(base_url, link['href'])

links.append(absolute_url)

return links

def is_allowed_by_robots(url,

robots_content): parser = RobotFileParser()

parser.parse(robots_content.split('\n'))

return parser.can_fetch('*', url)

def crawl(start_url, max_depth=3, delay=1):

visited_urls = set()

def recursive_crawl(url, depth, robots_content):

if depth > max_depth or url in visited_urls or not is_allowed_by_robots(url,


robots_content):

return

visited_urls.add(url)

time.sleep(delay)

html =

get_html(url) if

html:

print(f"Crawling {url}")

links = extract_links(html, url)

for link in links:

recursive_crawl(link, depth + 1, robots_content)


save_robots_txt(start_url)

robots_content =

load_robots_txt() if not

robots_content:

print("Unable to retrieve robots.txt. Crawling without restrictions.")

recursive_crawl(start_url, 1, robots_content)

crawl('https://wikipedia.com', max_depth=2, delay=2)

Output:
PRACTICAL NO : 7
Name: Class:TYCS Date:

Subject: Information Retrieval Roll no:

Aim: Link Analysis and PageRank

A) Implement the PageRank algorithm to rank web pages based on


link analysis.

B) Apply the PageRank algorithm to a small web graph and analyse


the results.

Code:

import numpy as np

def page_rank(graph, damping_factor=0.85, max_iterations=100,


tolerance=1e6):

num_nodes = len(graph)

page_ranks = np.ones(num_nodes) / num_nodes

for _ in range(max_iterations):

prev_page_ranks = np.copy(page_ranks)

for node in range(num_nodes):

incoming_links = [i for i, v in enumerate(graph) if node in v]

if not incoming_links:
continue

page_ranks[node] = (1 - damping_factor) / num_nodes +


damping_factor * sum(prev_page_ranks[link] / len(graph[link]) for link in
incoming_links)

if np.linalg.norm(page_ranks - prev_page_ranks, 2) < tolerance: break

return page_ranks

if name == " main ":

web_graph =

[ [1, 2],

[0, 2],

[0, 1] ,

[1,2],

result = page_rank(web_graph)

for i, pr in enumerate(result):

print(f"Page {i}: {pr}")


output:

You might also like