0% found this document useful (0 votes)

374 views13 pages

Ir Practical

The document provides code for several practical exercises involving natural language processing and information retrieval tasks: 1. It shows code to demonstrate bitwise operations by creating a document-term matrix from a corpus of plays and calculating the bitwise AND of selected vectors. 2. It implements the PageRank algorithm to calculate the PageRank values for nodes in a graph over multiple iterations. 3. It provides a dynamic programming solution to calculate the Levenshtein distance between two strings. 4. It includes a program to calculate the cosine similarity between two text documents after removing stop words. 5. It presents MapReduce code to count the frequency of each letter across a dataset in a case-insensitive manner.

Uploaded by

Ravishankar Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

374 views13 pages

Ir Practical

Uploaded by

Ravishankar Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-01
AIM: WRITE A PROGRAM TO DEMONSTRATE BITWISE OPERATION.

CODE:
plays={"Anthony and Cleopatra":"Anthony is there, Brutus is Caeser is with Cleopatra mercy
worser.",
"Julius Ceaser":"Anthony is there, Brutus is Caeser is but Calpurnia is.",
"The Tempest":"mercy worser","Hamlet":"Caeser and Brutus are present with mercy and
worser",
"Othello":"Caeser is present with mercy and worser","Macbeth":"Anthony is there,
Caeser, mercy."}
words=["Anthony","Brutus","Caeser","Calpurnia","Cleopatra","mercy","worser"]
vector_matrix=[[0 for i in range(len(plays))] for j in range(len(words))]

text_list=list((plays.values()))

for i in range(len(words)):
for j in range(len(text_list)):
if words[i] in text_list[j]:
vector_matrix[i][j]=1
else:
vector_matrix[i][j]=0

for i in vector_matrix:
print(i)
result=[]

string_list=[]
for vector in vector_matrix:
mystring = ""
for digit in vector:
mystring += str(digit)
string_list.append(int(mystring,2))
#print(string_list)

print("The output is ",bin(string_list[0]&string_list[1]&(string_list[2])).replace("0b",""))

OUTPUT :

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-02
AIM: IMPLEMENT PAGE RANK ALGORITHM.

CODE:

import numpy as np

import scipy as sc

import pandas as pd

from fractions import Fraction

def display_format(my_vector, my_decimal):

return np.round((my_vector).astype(np.float),
decimals=my_decimal)

my_dp = Fraction(1,1)

Mat = np.matrix([[0,0,1],

[Fraction(1,2),0,0],

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

[Fraction(1,2),1,0]])

Ex = np.zeros((3,3))

Ex[:] = my_dp

Damp = 0.7

Al = Damp * Mat + ((1-Damp) * Ex)

r = np.matrix([my_dp, my_dp, my_dp])

r = np.transpose(r)

previous_r = r

for i in range(1,10):

r = Al * r

print (display_format(r,3))

if (previous_r==r).all():

break

previous_r = r

print ("Final:\n", display_format(r,3))

print ("sum", np.sum(r))

Output

[[0.333]

[0.217]

[0.45 ]]

[[0.415]

[0.217]

[0.368]]

[[0.358]

[0.245]

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

[0.397]]

[[0.378]

[0.225]

[0.397]]

[[0.378]

[0.232]

[0.39 ]]

[[0.373]

[0.232]

[0.395]]

[[0.376]

[0.231]

[0.393]]

[[0.375]

[0.232]

[0.393]]

[[0.375]

[0.231]

[0.394]]

[[0.375]

[0.231]

[0.393]]

[[0.375]

[0.231]

[0.393]]

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-03
AIM: IMPLEMENT DYNAMIC PROGRAMMING ALGORITHM FOR COMPUTING
THE EDIT DISTANCE BETWEEN STRINGS S1 AND S2. (HINT. LEVENSHTEIN
DISTANCE)

CODE:
def editDistance(str1, str2, m, n):
if m == 0:
return n
if n == 0:
return m
if str1[m-1]== str2[n-1]:
return editDistance(str1, str2, m-1, n-1)
return 1 + min(editDistance(str1, str2, m, n-1), # Insert
editDistance(str1, str2, m-1, n), # Remove
editDistance(str1, str2, m-1, n-1) # Replace )
str1 = "sunday"
str2 = "saturday"
print (editDistance(str1, str2, len(str1), len(str2)) )
OUTPUT:

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-04
AIM: WRITE A PROGRAM TO COMPUTE SIMILARITY BETWEEN TWO TEXT
DOCUMENTS.
CODE:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X =open('file1.txt','r').read()
Y =open('file2.txt','r').read()
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c=0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)

OUTPUT:

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-05
AIM: WRITE A MAP-REDUCE PROGRAM TO COUNT THE NUMBER OF
OCCURRENCES OF EACH ALPHABETIC CHARACTER IN THE GIVEN DATASET.
THE COUNT FOR EACH LETTER SHOULD BE CASE-INSENSITIVE (I.E., INCLUDE
BOTH UPPER-CASE AND LOWER-CASE VERSIONS OF THE LETTER; IGNORE
NON-ALPHABETIC CHARACTERS).

CODE:

Text="""MapReduce is a processing technique and a program model for distributed

computing based on java. The MapReduce algorithm contains two important tasks, namely
Map and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce task,
which takes the output from a map as an input and combines those data tuples into a smaller
set of tuples. As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
Map stage − The map or mapper’s job is to process the input
data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes
the data and creates several small chunks of data.
Reduce stage − This stage is the combination of the Shuffle
stage and the Reduce stage. The Reducer’s job is to process the
data that comes from the mapper. After processing, it produces a
new set of output, which will be stored in the HDFS.
"""
# Cleaning text and lower casing all words
for char in '-.,\n':
Text=Text.replace(char,' ')
Text = Text.lower()# split returns a list of words delimited by sequences of whitespace
(including tabs, newlines, etc, like re's \s)

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

word_list = Text.split()
from collections import Counter
Counter(word_list).most_common()
# Initializing Dictionary
d = {}
# counting number of times each word comes up in list of words (in dictionary)
for word in word_list:
d[word] = d.get(word, 0) + 1
#reverse the key and values so they can be sorted using tuples.
word_freq = []
for key, value in d.items():
word_freq.append((value, key))
word_freq.sort(reverse=True)
print(word_freq)

OUTPUT:

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

PRACTICAL NO:-06
AIM: WRITE A PROGRAM FOR PRE-PROCESSING OF A TEXT DOCUMENT: STOP
WORD REMOVAL.
CODE:

1.Install nltk
!pip install nltk

2. download stopwords in nltk

import nltk
nltk.download("stopwords")

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

import nltk
from nltk.corpus import stopwords
set(stopwords.words('english'))

now download punkt in nltk

import nltk
nltk.download('punkt')

4. Stopwords coding
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent="This is a sample sentence,showing off the stop words filtration."
stop_words=set(stopwords.words('english'))
word_tokens=word_tokenize(example_sent)
filtered_sentence=[w for w in word_tokens if not w in stop_words]
filtered_sentence=[]
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

print(word_tokens)
print(filtered_sentence)

OUTPUT:

PRACTICAL NO:-07
AIM: WRITE A PROGRAM TO IMPLEMENT SIMPLE WEB CRAWLER.
A Web Crawler is a program that navigates the Web and finds new or updated pages for
indexing. The Crawler starts with seed websites or a wide range of popular URLs (also
known as the frontier) and searches in depth and width for hyperlinks to extract.

A Web Crawler must be kind and robust. Kindness for a Crawler means that it respects the
rules set by the robots.txt and avoids visiting a website too often. Robustness refers to the
ability to avoid spider traps and other malicious behavior. Other good attributes for a Web
Crawler is distributivity amongst multiple distributed machines, expandability, continuity and
ability to prioritize based on page quality.
Steps to create web crawler
The basic steps to write a Web Crawler are:

● Pick a URL from the frontier

● Fetch the HTML code
● Parse the HTML to extract links to other URLs
● Check if you have already crawled the URLs and/or if you have seen the same content
before
If not add it to the index
For each extracted URL
Confirm that it agrees to be checked (robots.txt, crawling frequency)

CODE:

import requests
from bs4 import BeautifulSoup
URL = "https://en.wikipedia.org/wiki/States_and_union_territories_of_India"
res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
states=[]
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
data = items.find_all(['th','td'])
#print(data[0].text)

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

states.append(data[0].text)
print(states)

OUTPUT:

PRACTICAL NO:-08
AIM: WRITE A PROGRAM TO PARSE XML TEXT, GENERATE WEB GRAPH AND
COMPUTE TOPIC SPECIFIC PAGE RANK.

CODE:

Xml file:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<root testAttr="testValue">
The Tree
<children>
<child name="Jack">First</child>

Vinayak Jadhav roll no 50

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

<child name="Rose">Second</child>
<child name="Blue Ivy">
Third
<grandchildren>
<data>One</data>
<data>Two</data>
<unique>Twins</unique>
</grandchildren>
</child>
<child name="Jane">Fourth</child>
</children>
</root>

import xml.etree.ElementTree as ET
tree = ET.parse('items.xml')
root = tree.getroot()
# all items data
print('Expertise Data:')
for elem in root:
for subelem in elem:
print(subelem.text)

OUTPUT:

Expertise Data:

First

Second

Third

Vinayak Jadhav roll no 50

Spring Material by Ashok
82% (11)
Spring Material by Ashok
284 pages
Introduction To GIS Programming and Fundamentals With Python and ArcGIS
100% (7)
Introduction To GIS Programming and Fundamentals With Python and ArcGIS
381 pages
Lora-Based Mesh Network For Iot Applications: Heon Huh Jeong Yeol Kim
No ratings yet
Lora-Based Mesh Network For Iot Applications: Heon Huh Jeong Yeol Kim
4 pages
IR Practical B1
No ratings yet
IR Practical B1
15 pages
IR Assignment11
No ratings yet
IR Assignment11
3 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
PDA Lab Prog (Short)
No ratings yet
PDA Lab Prog (Short)
11 pages
Cs Journal
No ratings yet
Cs Journal
43 pages
Python Exam Paper Solved1
No ratings yet
Python Exam Paper Solved1
6 pages
Ibm Ai
No ratings yet
Ibm Ai
10 pages
Manual
No ratings yet
Manual
24 pages
Practical-1: Aim: Write A Function To Find All Prime Numbers Occur Between 1 To 100
No ratings yet
Practical-1: Aim: Write A Function To Find All Prime Numbers Occur Between 1 To 100
8 pages
Experiment 8 & 9
No ratings yet
Experiment 8 & 9
14 pages
Ds Practical ..
No ratings yet
Ds Practical ..
55 pages
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
No ratings yet
Docdist1: 6.006 Intro To Algorithms Recitation 2 September 14, 2011
6 pages
Pythonn 1 To 8 & 11
No ratings yet
Pythonn 1 To 8 & 11
19 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Computer Science - New (083) MARKING SCHEME - SQP (2019-20) Class-Xii
No ratings yet
Computer Science - New (083) MARKING SCHEME - SQP (2019-20) Class-Xii
14 pages
Laksh Paaji
No ratings yet
Laksh Paaji
12 pages
Saanp pt.3-1
No ratings yet
Saanp pt.3-1
21 pages
Python Manual
No ratings yet
Python Manual
22 pages
Updated ISE Python Lab 21CSL46
No ratings yet
Updated ISE Python Lab 21CSL46
49 pages
A1 Sol
No ratings yet
A1 Sol
13 pages
Python
No ratings yet
Python
3 pages
PPS - Practical File Format
No ratings yet
PPS - Practical File Format
21 pages
Computer Science Class 11 - Sultan Chand - ModelTestPaper1 PDF
No ratings yet
Computer Science Class 11 - Sultan Chand - ModelTestPaper1 PDF
5 pages
Umar Mustafa 200365 Lab 9
No ratings yet
Umar Mustafa 200365 Lab 9
8 pages
Krish 5-12 PR
No ratings yet
Krish 5-12 PR
16 pages
Labmanual
No ratings yet
Labmanual
15 pages
Cycle 1 Programs
No ratings yet
Cycle 1 Programs
20 pages
Pset1 Prompt - Algorithms
No ratings yet
Pset1 Prompt - Algorithms
5 pages
Rohan Panda 1841012123 CSE D IR LAB ASSIGNMENT
No ratings yet
Rohan Panda 1841012123 CSE D IR LAB ASSIGNMENT
32 pages
EXPT3 DataStructures C3cD2dTlNG NSeMlYxmut
No ratings yet
EXPT3 DataStructures C3cD2dTlNG NSeMlYxmut
8 pages
DataVisulization With Python.123
No ratings yet
DataVisulization With Python.123
41 pages
Information Retrieval Journal
No ratings yet
Information Retrieval Journal
33 pages
Cs Prac Xii 25-26
No ratings yet
Cs Prac Xii 25-26
39 pages
Unit 6
No ratings yet
Unit 6
39 pages
Python Final
No ratings yet
Python Final
11 pages
DVP Manual
No ratings yet
DVP Manual
43 pages
Program File Main
No ratings yet
Program File Main
43 pages
Cs Practical File
No ratings yet
Cs Practical File
14 pages
Python Lab Manual 2023 24
No ratings yet
Python Lab Manual 2023 24
15 pages
DV With Python-1-5
No ratings yet
DV With Python-1-5
12 pages
Fycs Daa Practical Manual
No ratings yet
Fycs Daa Practical Manual
25 pages
Class 12 PYTHON FILE PDF
No ratings yet
Class 12 PYTHON FILE PDF
29 pages
DSA Program Print-Out
No ratings yet
DSA Program Print-Out
58 pages
FINALailabfile
No ratings yet
FINALailabfile
26 pages
FPT Lab Manual 2024 - Nep 2
No ratings yet
FPT Lab Manual 2024 - Nep 2
15 pages
File of Pyhton Removed
No ratings yet
File of Pyhton Removed
31 pages
Program File Vanshika
No ratings yet
Program File Vanshika
45 pages
Output
No ratings yet
Output
60 pages
Final DSL Lab Manual 2020
No ratings yet
Final DSL Lab Manual 2020
45 pages
Python Oops Function
No ratings yet
Python Oops Function
14 pages
Practical Main
No ratings yet
Practical Main
22 pages
Computer Practicals
0% (1)
Computer Practicals
30 pages
Python Programming
No ratings yet
Python Programming
15 pages
360 - Python Practical File Final
No ratings yet
360 - Python Practical File Final
30 pages
Data Structure Final Lab Manual
No ratings yet
Data Structure Final Lab Manual
57 pages
Algorithms Worksheet
No ratings yet
Algorithms Worksheet
12 pages
Pylab Manual
No ratings yet
Pylab Manual
25 pages
Python Assignment 3 AMAN GAUTAM 039
No ratings yet
Python Assignment 3 AMAN GAUTAM 039
5 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
IM4A5-32 Lattice
No ratings yet
IM4A5-32 Lattice
62 pages
27604MangeshGhonge MS
No ratings yet
27604MangeshGhonge MS
402 pages
Topic #4 - SIM ARM - Part 3
No ratings yet
Topic #4 - SIM ARM - Part 3
31 pages
LESSON PLAN TEMPLATE For BCC
No ratings yet
LESSON PLAN TEMPLATE For BCC
10 pages
g13mft MDB s1 GB
No ratings yet
g13mft MDB s1 GB
57 pages
Additive Manufacturing Module 5 Notes
No ratings yet
Additive Manufacturing Module 5 Notes
30 pages
Intro Prompt Design - Ipynb
No ratings yet
Intro Prompt Design - Ipynb
18 pages
Ad Config
No ratings yet
Ad Config
80 pages
QB - Python Basics - Ver 7.0
No ratings yet
QB - Python Basics - Ver 7.0
49 pages
Microcontrollers: Multiplexed External Bus Interface (MEBI)
No ratings yet
Microcontrollers: Multiplexed External Bus Interface (MEBI)
34 pages
Onrfile Sample
No ratings yet
Onrfile Sample
97 pages
Chapter 7. Managing Object-Oriented Software Engineering
No ratings yet
Chapter 7. Managing Object-Oriented Software Engineering
19 pages
Davin Quarshie: Experience
No ratings yet
Davin Quarshie: Experience
3 pages
Update IP Particulars Through IP Portal ESIC
0% (1)
Update IP Particulars Through IP Portal ESIC
55 pages
Process Orchestrationreport 2024
No ratings yet
Process Orchestrationreport 2024
10 pages
100 Essential Resources For Hardware & Electrical Engineers Ebook
No ratings yet
100 Essential Resources For Hardware & Electrical Engineers Ebook
65 pages
Sti College - Gensan, Inc.: Ict Laboratory Exercise
No ratings yet
Sti College - Gensan, Inc.: Ict Laboratory Exercise
4 pages
1 SP - PP Gold Model Design1-2
No ratings yet
1 SP - PP Gold Model Design1-2
221 pages
CHAPTER 2 - Cyber Security and Laws
No ratings yet
CHAPTER 2 - Cyber Security and Laws
66 pages
Individual Assignment
No ratings yet
Individual Assignment
2 pages
Dark Web
No ratings yet
Dark Web
6 pages
An Introduction To Programming For Hackers
No ratings yet
An Introduction To Programming For Hackers
62 pages
Bell Printers Installs India's First Varimatrix 105 CS Die-Cutter
No ratings yet
Bell Printers Installs India's First Varimatrix 105 CS Die-Cutter
3 pages
System Analysis Toolkit Users Guide
No ratings yet
System Analysis Toolkit Users Guide
124 pages
Data Models
No ratings yet
Data Models
40 pages
Yashika Vohra CV.
No ratings yet
Yashika Vohra CV.
4 pages
CC880
No ratings yet
CC880
75 pages

Ir Practical

Uploaded by

Ir Practical

Uploaded by

TYCS SEMESTER VI INFORMATION RETRIEVAL PRACTICAL MANUAL

print("The output is ",bin(string_list[0]&string_list[1]&(string_list[2])).replace("0b",""))

Vinayak Jadhav roll no 50

from fractions import Fraction

def display_format(my_vector, my_decimal):

Vinayak Jadhav roll no 50

Al = Damp * Mat + ((1-Damp) * Ex)

r = np.matrix([my_dp, my_dp, my_dp])

print ("Final:\n", display_format(r,3))

print ("sum", np.sum(r))

Vinayak Jadhav roll no 50

Vinayak Jadhav roll no 50

Vinayak Jadhav roll no 50

Vinayak Jadhav roll no 50

Text="""MapReduce is a processing technique and a program model for distributed

Vinayak Jadhav roll no 50

Vinayak Jadhav roll no 50

2. download stopwords in nltk

Vinayak Jadhav roll no 50

now download punkt in nltk

Vinayak Jadhav roll no 50

● Pick a URL from the frontier

Vinayak Jadhav roll no 50

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Vinayak Jadhav roll no 50

Vinayak Jadhav roll no 50

You might also like