IR Assignment4
IR Assignment4
INFORMATION TECHNOLOGY
2021-22 SEMESTER - II
ASSIGNMENT - 4
Subject: Information Retrieval
Theory: A representation that is often used for text documents is the vector space
model. In the vector space model, a document D is represented as an m-dimensional
vector, where each dimension corresponds to a distinct term and m are the total number
of terms used in the collection of documents. The document vector is written as, where
is the weight of term that indicates its importance. If document D does not contain term
then weight is zero.
In the vector approach the term weights are determined by indicating whether or
not a term appears in a document. The term is assigned value 1 if the term does occur
in the document, otherwise the term is assigned value 0. A more sophisticated measure
is the tf-idf scheme. In this approach the terms are assigned a weight that is based on
how often a term appears in a particular document and how frequently it occurs in the
entire document collection. The first part of the tf-idf scheme is called the term
frequency, the number of occurrences of term in document D. The second part is called
the inverse
Document frequency and is calculated as follows:
• Where n is the total number of documents in the collection and the number of
documents in which term appears at least once. The weighting factor of
document i is determined by the product of the term frequency and the inverse
document frequency:
• The assumptions behind tf-idf are based on two characteristics of text
documents. First, the more times a term appears in a document, the more relevant
it is to the topic of the document. Second, the more times a term occurs in all
documents in the collection, the more poorly it discriminates between
documents.
Program Code:
import glob
import math
import re
import sys
from collections import defaultdict
from functools import reduce
STOPWORDS = set(stopwords.words("english"))
CORPUS = "docs/*"
document_filenames = dict()
N=0
vocabulary = set()
postings = defaultdict(dict)
document_frequency = defaultdict(int)
length = defaultdict(float)
def main():
get_corpus()
initialize_terms_and_postings()
initialize_document_frequencies()
initialize_lengths()
while True:
scores = do_search()
print_scores(scores)
def get_corpus():
global document_filenames, N
documents = glob.glob(CORPUS)
N = len(documents)
document_filenames = dict(zip(range(N), documents))
def initialize_terms_and_postings():
global vocabulary, postings
for id in document_filenames:
with open(document_filenames[id], "r") as f:
document = f.read()
document = remove_special_characters(document)
document = remove_digits(document)
terms = tokenize(document)
unique_terms = set(terms)
vocabulary = vocabulary.union(unique_terms)
for term in unique_terms:
postings[term][id] = terms.count(term)
def tokenize(document):
terms = word_tokenize(document)
2
terms = [term.lower() for term in terms if term not in STOPWORDS]
return terms
def initialize_document_frequencies():
global document_frequency
for term in vocabulary:
document_frequency[term] = len(postings[term])
def initialize_lengths():
global length
for id in document_filenames:
l=0
for term in vocabulary:
l += term_frequency(term, id) ** 2
length[id] = math.sqrt(l)
def inverse_document_frequency(term):
if term in vocabulary:
return math.log(N / document_frequency[term], 2)
else:
return 0.0
def print_scores(scores):
print("-" * 42)
print("| %s | %-30s |" % ("Score", "Document"))
print("-" * 42)
def do_search():
query = tokenize(input("Search query >> "))
if query == []:
sys.exit()
3
scores = sorted(
[(id, similarity(query, id)) for id in range(N)],
key=lambda x: x[1],
reverse=True,
)
return scores
def intersection(sets):
return reduce(set.intersection, [s for s in sets])
if term in vocabulary:
similarity += term_frequency(term, id) * inverse_document_frequency(term)
return similarity
def remove_special_characters(text):
regex = re.compile(r"[^a-zA-Z0-9\s]")
return re.sub(regex, "", text)
def remove_digits(text):
regex = re.compile(r"\d")
return re.sub(regex, "", text)
if __name__ == "__main__":
main()
Screenshots/Output:
4
5