0% found this document useful (0 votes)
0 views5 pages

IR Assignment4

The document presents an assignment on implementing an Information Retrieval (IR) system using the vector space model. It explains the theory behind the vector model, including term frequency and inverse document frequency, and provides a Python program that executes the IR system. The program includes functions for processing documents, calculating term weights, and performing searches based on user queries.

Uploaded by

vinayostwal707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views5 pages

IR Assignment4

The document presents an assignment on implementing an Information Retrieval (IR) system using the vector space model. It explains the theory behind the vector model, including term frequency and inverse document frequency, and provides a Python program that executes the IR system. The program includes functions for processing documents, calculating term weights, and performing searches based on user queries.

Uploaded by

vinayostwal707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

WALCHAND INSTITUTE OF TECHNOLOGY, SOLAPUR

INFORMATION TECHNOLOGY
2021-22 SEMESTER - II
ASSIGNMENT - 4
Subject: Information Retrieval

Name: Ayush pande Roll no: 74 Class: Final Year Btech IT

Title: Implementation of IR system using Vector model

Theory: A representation that is often used for text documents is the vector space
model. In the vector space model, a document D is represented as an m-dimensional
vector, where each dimension corresponds to a distinct term and m are the total number
of terms used in the collection of documents. The document vector is written as, where
is the weight of term that indicates its importance. If document D does not contain term
then weight is zero.
In the vector approach the term weights are determined by indicating whether or
not a term appears in a document. The term is assigned value 1 if the term does occur
in the document, otherwise the term is assigned value 0. A more sophisticated measure
is the tf-idf scheme. In this approach the terms are assigned a weight that is based on
how often a term appears in a particular document and how frequently it occurs in the
entire document collection. The first part of the tf-idf scheme is called the term
frequency, the number of occurrences of term in document D. The second part is called
the inverse
Document frequency and is calculated as follows:
• Where n is the total number of documents in the collection and the number of
documents in which term appears at least once. The weighting factor of
document i is determined by the product of the term frequency and the inverse
document frequency:
• The assumptions behind tf-idf are based on two characteristics of text
documents. First, the more times a term appears in a document, the more relevant
it is to the topic of the document. Second, the more times a term occurs in all
documents in the collection, the more poorly it discriminates between
documents.

Program Code:
import glob
import math
import re
import sys
from collections import defaultdict
from functools import reduce

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
from tkinter import *
from tkinter import filedialog

STOPWORDS = set(stopwords.words("english"))
CORPUS = "docs/*"
document_filenames = dict()
N=0
vocabulary = set()
postings = defaultdict(dict)
document_frequency = defaultdict(int)
length = defaultdict(float)

def main():
get_corpus()
initialize_terms_and_postings()
initialize_document_frequencies()
initialize_lengths()
while True:
scores = do_search()
print_scores(scores)

def get_corpus():
global document_filenames, N
documents = glob.glob(CORPUS)
N = len(documents)
document_filenames = dict(zip(range(N), documents))

def initialize_terms_and_postings():
global vocabulary, postings
for id in document_filenames:
with open(document_filenames[id], "r") as f:
document = f.read()
document = remove_special_characters(document)
document = remove_digits(document)
terms = tokenize(document)
unique_terms = set(terms)
vocabulary = vocabulary.union(unique_terms)
for term in unique_terms:
postings[term][id] = terms.count(term)

def tokenize(document):
terms = word_tokenize(document)

2
terms = [term.lower() for term in terms if term not in STOPWORDS]

return terms

def initialize_document_frequencies():
global document_frequency
for term in vocabulary:
document_frequency[term] = len(postings[term])

def initialize_lengths():
global length
for id in document_filenames:
l=0
for term in vocabulary:
l += term_frequency(term, id) ** 2
length[id] = math.sqrt(l)

def term_frequency(term, id):


if id in postings[term]:
return postings[term][id]
else:
return 0.0

def inverse_document_frequency(term):
if term in vocabulary:
return math.log(N / document_frequency[term], 2)
else:
return 0.0

def print_scores(scores):
print("-" * 42)
print("| %s | %-30s |" % ("Score", "Document"))
print("-" * 42)

for (id, score) in scores:


if score != 0.0:
print("| %s | %-30s |" % (str(score)[:5], document_filenames[id]))

print("-" * 42, end="\n\n")

def do_search():
query = tokenize(input("Search query >> "))

if query == []:
sys.exit()

3
scores = sorted(
[(id, similarity(query, id)) for id in range(N)],
key=lambda x: x[1],
reverse=True,
)

return scores

def intersection(sets):
return reduce(set.intersection, [s for s in sets])

def similarity(query, id):


similarity = 0.0

for term in query:

if term in vocabulary:
similarity += term_frequency(term, id) * inverse_document_frequency(term)

similarity = similarity / length[id]

return similarity

def remove_special_characters(text):
regex = re.compile(r"[^a-zA-Z0-9\s]")
return re.sub(regex, "", text)

def remove_digits(text):
regex = re.compile(r"\d")
return re.sub(regex, "", text)

if __name__ == "__main__":
main()

Screenshots/Output:

4
5

You might also like