Lab3 IR BIM
Lab3 IR BIM
Assignment 3
Submitted by:
Saqlain Nawaz 2020-CS-135
Supervised by:
Sir Khaldoon Syed Khurshid
NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk
Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
python text_search.py
Imports (Libraries)
import os
import string
import math
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
The "Imports" section includes various Python libraries and modules that are used in the
program. Each import statement serves a specific purpose and contributes to the
functionality of the code. Here's an explanation of each import statement and its role:
import os
● os is a Python module that provides a way to interact with the operating system. In
this program, it is used to manipulate file paths and directories, specifically to access
and process text documents stored in a directory.
import string
import math
import nltk
● nltk stands for the Natural Language Toolkit, which is a powerful library for natural
language processing (NLP) and text analysis in Python. It is used extensively in this
program for text preprocessing, tokenization, part-of-speech tagging, and stemming.
Variables
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.
Functions
def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)
except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")
def represent_query(query,inverted_index)
words = word_tokenize(query.lower())
query_vector = {}
if term in stemmed_words:
query_vector[term] = 1
else:
query_vector[term] = 0
return query_vector
Explanation
1. The represent_query function takes two parameters: query, which is the user's
search query, and inverted_index, which is the dictionary that stores the inverted
index of terms in the documents.
2. In this line, the query is converted to lowercase using query.lower(). This
ensures that the query is case-insensitive, so it can match terms regardless of their
letter casing. The result is a lowercase version of the query.
3. The word_tokenize function is used to tokenize the lowercase query. It breaks the
query into individual words and stores them in the words list.
4. In this line, the code iterates through each word in the words list. For each word, it
checks if it's not in the stop_words set, ensuring that common stopwords are
filtered out.
5. If a word is not in stop_words, it is stemmed using the stemmer.stem(word)
function, which reduces the word to its root form. The list of stemmed words is stored
in the stemmed_words variable.
query_vector = {}
if term in stemmed_words:
query_vector[term] = 1
else:
query_vector[term] = 0
7. Here, a loop iterates through each term in the inverted_index. It checks if each
term is present in the stemmed_words list, which contains the stemmed and filtered
words from the user's query.
8. If a term is found in stemmed_words, it's assigned a weight of 1 in the
query_vector, indicating that it's part of the query.
9. If a term is not found in stemmed_words, it's assigned a weight of 0 in the
query_vector, indicating that it's not part of the query.
return query_vector
rank_documents(scores)
def rank_documents(scores):
return ranked_documents
1. The rank_documents function takes a scores dictionary as input, where the keys
are document names and the values are their corresponding scores.
2. The sorted function is used to sort the scores dictionary items (document, score)
based on the score (x[1]) in descending order (reverse=True).
3. The sorted items are stored in the ranked_documents variable.
4. The function returns the ranked_documents, which is a list of document-score
pairs sorted by score in descending order.
retrieve_top_k_documents(ranked_documents, k)
top_k_documents = ranked_documents[:k]
return top_k_documents
def present_results(top_k_documents):
Flow of Code:
This part of the code snippet is where the main functionality of the program is executed. It
allows users to interact with the search system, enter queries, and retrieve relevant
documents. Let's break it down:
dir_path = os.path.dirname(os.path.abspath(__file__))
1. dir_path is assigned the absolute path of the directory containing the program file
(__file__ represents the current script's file path).
2. inverted_index and binary_td_matrix are assigned the results returned by
the create_index function, which creates an inverted index and binary
term-document matrix for the documents in the specified directory.
3. The inverted index stores terms as keys and the documents they appear in as
values. The binary term-document matrix stores terms as keys and documents with
binary weights (1 for presence, 0 for absence) as values.
print(inverted_index)
4. The code prints the inverted index to the console. This is for debugging or
informational purposes, displaying the terms and their associated documents in the
inverted index.
while True:
query = input("Enter a search query (or 'exit' to quit): ")
if query.lower() == 'exit':
break
query_vector=represent_query(query, inverted_index)
scores=score_documents(query_vector, binary_td_matrix)
ranked_documents=rank_documents(scores)
top_k_documents=retrieve_top_k_documents(ranked_documents, 2)
present_results(top_k_documents)
5. This part starts a loop that allows the user to interact with the search system until
they decide to exit by typing 'exit.'
6. Inside the loop, it reads the user's search query, which is input using the input
function. If the query is 'exit,' the loop breaks, and the program ends.
7. If the user enters a search query, the program proceeds with the following steps:
○ query_vector is generated by calling the represent_query function,
which creates a vector representation of the query.
○ scores are calculated by calling the score_documents function, which
computes relevance scores for documents based on the query.
○ ranked_documents contains the top documents, sorted by their relevance
scores, as returned by the rank_documents function.
○ top_k_documents retrieves the top two documents from the ranked list
using the retrieve_top_k_documents function.
○ Finally, the relevant documents and their scores are presented to the user
using the present_results function.
This loop allows users to perform multiple search queries and obtain results for each query
until they decide to exit the program by typing 'exit'.
Data Flow Diagram
Block Diagram