0% found this document useful (0 votes)
29 views14 pages

Lab3 IR BIM

This document summarizes a program that creates an inverted index and term-document matrix for basic document retrieval. It allows users to input queries and retrieves relevant documents based on query terms. The program tokenizes documents, removes stopwords and punctuation, stems words, and builds the inverted index and term-document matrix. It then scores documents based on query term matches and ranks the top results for the user. The purpose is to demonstrate key concepts in information retrieval through a simple search system.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views14 pages

Lab3 IR BIM

This document summarizes a program that creates an inverted index and term-document matrix for basic document retrieval. It allows users to input queries and retrieves relevant documents based on query terms. The program tokenizes documents, removes stopwords and punctuation, stems words, and builds the inverted index and term-document matrix. It then scores documents based on query term matches and ranks the top results for the user. The purpose is to demonstrate key concepts in information retrieval through a simple search system.

Uploaded by

Pac SaQii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Introduction
The program provided is a document retrieval and ranking system, which includes the
creation of an inverted index and a binary term-document matrix to facilitate document
retrieval based on user queries. It allows users to input search queries and retrieves relevant
documents based on the presence of query terms within the indexed documents.

Purpose of the Program:


The main purpose of this program is to showcase a basic document retrieval and ranking
system. It serves as a simple example to demonstrate key information retrieval concepts.
The primary goals and components of the program include:

1. Document Preprocessing: The program processes a set of text documents located


in a specified directory. It tokenizes the documents into sentences and words,
removes punctuation and unwanted characters, and tags words with their parts of
speech using the Natural Language Toolkit (NLTK).
2. Inverted Index and Binary Term-Document Matrix: The program creates an
inverted index, which is a data structure that maps terms to the documents in which
they appear. It also constructs a binary term-document matrix, which represents the
presence or absence of terms in each document. These data structures enable
efficient document retrieval.
3. Query Processing: Users can input search queries. The program tokenizes and
processes these queries, preparing them for matching against the indexed
documents.
4. Scoring Documents: The program scores documents based on the number of
query terms they contain. Documents with more query terms receive higher scores,
indicating their potential relevance to the query.
5. Ranking and Presentation: Documents are ranked based on their scores, and the
top-K documents are presented to the user. This feature allows users to quickly
identify and access the most relevant documents for their queries.

Installation and Setup:


Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.

NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk

Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
python text_search.py

Explanation and Guide

Imports (Libraries)
import os
import string
import math
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

The "Imports" section includes various Python libraries and modules that are used in the
program. Each import statement serves a specific purpose and contributes to the
functionality of the code. Here's an explanation of each import statement and its role:

import os

● os is a Python module that provides a way to interact with the operating system. In
this program, it is used to manipulate file paths and directories, specifically to access
and process text documents stored in a directory.

import string

● string is a module in Python's standard library that provides a collection of string


constants and functions for text manipulation. In this program, it is used to access a
set of punctuation characters and unwanted characters to remove from text
documents.

import math

● math is a standard Python module that provides mathematical functions and


constants. In this program, it is used to perform mathematical operations, such as
calculating the Euclidean norm for score normalization.

import nltk

● nltk stands for the Natural Language Toolkit, which is a powerful library for natural
language processing (NLP) and text analysis in Python. It is used extensively in this
program for text preprocessing, tokenization, part-of-speech tagging, and stemming.

from collections import defaultdict


● collections is a Python module that provides specialized container datatypes. In
this program, it imports the defaultdict class, which is used to create dictionaries
with default values for inverted indexing and word counting.

from nltk.corpus import stopwords

● nltk.corpus.stopwords provides a list of common English stopwords.


Stopwords are words that are often removed from text data because they are
considered non-informative (e.g., "the," "and," "in"). These stopwords are used for
filtering out common words during text preprocessing.

from nltk.stem import PorterStemmer

● nltk.stem.PorterStemmer is a stemming algorithm included in the NLTK library.


Stemming is the process of reducing words to their root form (e.g., "running" to "run").
The Porter stemmer is used to normalize words in the text for indexing and retrieval.

from nltk.tokenize import word_tokenize

● nltk.tokenize.word_tokenize is a function from NLTK for tokenizing text into


words. It breaks text into individual words, making it easier to process and analyze
the content.

Variables

stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}
stemmer = PorterStemmer()

stop_words = set(stopwords.words('english'))

● Explanation: The variable stop_words is assigned a set of English stopwords


using NLTK's stopwords.words('english'). These stopwords will be used to
filter out common words from the text documents being processed. This filtering
helps reduce the size of the inverted index and focuses on the content-carrying
words.

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are


considered unwanted and should be removed from the text before processing. The
characters include various forms of quotes, dashes, and ellipses. If additional
unwanted characters are identified, they can be added to this set.

stemmer = PorterStemmer()
● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.

Functions

def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}

1. def create_index(dir_path): This line defines a Python function called


create_index. It takes one argument, dir_path, which is the path to the directory
containing the text documents that you want to index. This function will be
responsible for building the inverted index and word counts for each document.
2. # Initialize an empty dictionary for the inverted index: This is a
comment that explains the purpose of the next line of code. It's initializing an empty
dictionary named inverted_index, which will be used to store the inverted index.
3. inverted_index = {}: This line creates an empty Python dictionary called
inverted_index. Inverted indexing is a technique used for text retrieval, where
words are associated with the documents they appear in. This dictionary will store
those associations.
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# Initialize a dictionary to store word counts per document
word_counts_per_document = {}

1. # Initialize a dictionary to store word counts per document:


This comment explains that the following line initializes a dictionary to store word
counts for each document in the directory.
2. word_counts_per_document = {}: This line creates an empty dictionary called
word_counts_per_document. This dictionary will be used to keep track of the
frequency of each word within each document, essentially counting how many times
each word appears in each text file. It is crucial for later search and retrieval
operations.

# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)

# Update word counts for this document


if stemmed_word not in word_counts:
word_counts[stemmed_word] = 1
else:
word_counts[stemmed_word] += 1

# Store word counts for this document


word_counts_per_document[filename] = word_counts

except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that


iterates over each file in the directory specified by dir_path. The os.listdir()
function returns a list of all files and directories in the given directory, and this loop
iterates through the file names.
2. if filename.endswith('.txt'): This line checks if the current filename
ends with the ".txt" extension, which typically indicates a text file.
3. try: This line begins a try-except block to handle potential errors during file
processing.
4. with open(os.path.join(dir_path, filename), 'r',
encoding='utf8') as file: Within the try block, this line opens the current text
file for reading. It uses os.path.join() to create the full path to the file by
combining dir_path with the filename. The file is opened in text mode ('r') and
with the 'utf8' encoding to handle text files encoded in UTF-8.
5. sentences = sent_tokenize(file.read().lower()): This line reads the
content of the file using file.read(), converts the content to lowercase using
.lower(), and then uses sent_tokenize (from NLTK) to split the content into a
list of sentences. This step prepares the text for further processing.
6. word_counts = {}: This line creates an empty dictionary called word_counts to
store word frequencies for the current document. This dictionary will be populated in
the following steps.
7. for sentence in sentences: This line sets up a loop to iterate over each
sentence in the sentences list.
8. sentence_without_punctuation = "".join([char for char in
sentence if char not in string.punctuation and char not in
unwanted_chars]): This line removes punctuation and unwanted characters from
the current sentence. It creates a new string called
sentence_without_punctuation by joining characters that are not in
string.punctuation or unwanted_chars.
9. words = word_tokenize(sentence_without_punctuation): This line
tokenizes the sentence_without_punctuation into a list of words using the
word_tokenize function from NLTK.
10. tagged_words = nltk.pos_tag(words): This line uses nltk.pos_tag to tag
each word in words with its part of speech. The result is stored in the
tagged_words list of word-tag pairs.
11. # For each word, if it's a noun or verb, stem it and add an
entry in the inverted index pointing to this filename: This
comment explains that the code will process each word in the current sentence,
checking if it's a noun or verb, and then stemming it before associating it with the
current filename in the inverted index.
12. for word, pos in tagged_words: This line sets up a loop to iterate over each
word and its corresponding part of speech in the tagged_words list.
13. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words:This line checks two
conditions for each word:
● Whether the word's part of speech (pos) is in the specified list of noun and
verb POS tags. If it is, it's considered for further processing.
● Whether the word is not in the set of stop_words, which are common words
that are often filtered out in text analysis.
14. stemmed_word = stemmer.stem(word): If a word passes the previous
conditions, it is stemmed using the Porter Stemmer. The stemmed word is stored in
the variable stemmed_word.
15. if stemmed_word not in inverted_index: This line checks if the
stemmed_word is not already in the inverted_index.
16. inverted_index[stemmed_word] = []: If the word is not in the inverted index,
it initializes an empty list as the value for that word in the inverted index.
17. inverted_index[stemmed_word].append(filename): Regardless of whether
the word was already in the inverted index or not, it appends the filename of the
current document to the list associated with the stemmed_word. This associates the
word with the document where it appears in the inverted index.
18. if stemmed_word not in word_counts: This line checks if the
stemmed_word is not in the word_counts dictionary.
19. word_counts[stemmed_word] = 1: If the word is not in word_counts, it
initializes it with a count of 1, indicating that this word has been found once in the
current document.
20. else: If the word is already in word_counts, this block of code is executed.
21. word_counts[stemmed_word] += 1: It increments the count for the word in
word_counts to indicate that the word has been found again in the current
document.
22. # Store word counts for this document: This comment explains that the
code is about to store the word counts for the current document.
23. word_counts_per_document[filename] = word_counts: This line stores
the word_counts dictionary (word counts for the current document) in the
word_counts_per_document dictionary with the filename as the key. This
associates the word counts with the document.
24. except UnicodeDecodeError: This is an exception handler that catches
UnicodeDecodeError exceptions. This exception occurs when a file cannot be
decoded using the specified encoding, which can happen when processing text files
with non-standard encodings.
25. print(f"Skipping file {filename} due to UnicodeDecodeError"): If
a UnicodeDecodeError is raised, this line prints a message indicating that the file
is being skipped due to this encoding-related error.
26. return inverted_index, word_counts_per_document: This line returns two
values as a tuple:
● inverted_index: This is a dictionary containing the inverted index, where
each stemmed word is associated with a list of filenames where it appears.
● word_counts_per_document: This is a dictionary containing word counts
for each document, showing how many times each word appears in each
document.

def represent_query(query,inverted_index)

def represent_query(query, inverted_index):

words = word_tokenize(query.lower())

stemmed_words = [stemmer.stem(word) for word in words if word not


in stop_words]

query_vector = {}

for term in inverted_index.keys():

if term in stemmed_words:

query_vector[term] = 1

else:

query_vector[term] = 0

return query_vector
Explanation

def represent_query(query, inverted_index):


# Tokenize and stem the query
words = word_tokenize(query.lower())

1. The represent_query function takes two parameters: query, which is the user's
search query, and inverted_index, which is the dictionary that stores the inverted
index of terms in the documents.
2. In this line, the query is converted to lowercase using query.lower(). This
ensures that the query is case-insensitive, so it can match terms regardless of their
letter casing. The result is a lowercase version of the query.
3. The word_tokenize function is used to tokenize the lowercase query. It breaks the
query into individual words and stores them in the words list.

stemmed_words = [stemmer.stem(word) for word in words if word not in


stop_words]

4. In this line, the code iterates through each word in the words list. For each word, it
checks if it's not in the stop_words set, ensuring that common stopwords are
filtered out.
5. If a word is not in stop_words, it is stemmed using the stemmer.stem(word)
function, which reduces the word to its root form. The list of stemmed words is stored
in the stemmed_words variable.

query_vector = {}

6. An empty dictionary, query_vector, is created. This dictionary will represent the


query vector, where each term from the inverted index corresponds to its presence or
absence in the query.

for term in inverted_index.keys():

if term in stemmed_words:

query_vector[term] = 1

else:

query_vector[term] = 0

7. Here, a loop iterates through each term in the inverted_index. It checks if each
term is present in the stemmed_words list, which contains the stemmed and filtered
words from the user's query.
8. If a term is found in stemmed_words, it's assigned a weight of 1 in the
query_vector, indicating that it's part of the query.
9. If a term is not found in stemmed_words, it's assigned a weight of 0 in the
query_vector, indicating that it's not part of the query.

return query_vector

10. Finally, the query_vector is returned, representing the presence or absence of


each term from the inverted index in the user's query.

rank_documents(scores)

def rank_documents(scores):

# Sort the documents by their scores in descending order

ranked_documents = sorted(scores.items(), key=lambda x: x[1],


reverse=True)

return ranked_documents

1. The rank_documents function takes a scores dictionary as input, where the keys
are document names and the values are their corresponding scores.
2. The sorted function is used to sort the scores dictionary items (document, score)
based on the score (x[1]) in descending order (reverse=True).
3. The sorted items are stored in the ranked_documents variable.
4. The function returns the ranked_documents, which is a list of document-score
pairs sorted by score in descending order.

retrieve_top_k_documents(ranked_documents, k)

def retrieve_top_k_documents(ranked_documents, k):

# Select the top-K documents from the ranked list

top_k_documents = ranked_documents[:k]

return top_k_documents

1. The retrieve_top_k_documents function takes two parameters:


ranked_documents, which is a list of document-score pairs sorted by score in
descending order, and k, which is the number of top documents to retrieve.
2. It uses list slicing (ranked_documents[:k]) to select the first k elements from the
ranked_documents list, which represents the top-K documents based on their
scores.
3. The selected top-K documents are stored in the top_k_documents variable.
4. The function returns top_k_documents, which is a list of the top-K document-score
pairs.
present_results(top_k_documents)

def present_results(top_k_documents):

# Print the top-K documents and their scores

for doc, score in top_k_documents:

print(f"Document: {doc}, Score: {score}")

1. The present_results function takes top_k_documents as input, which is a list


of document-score pairs representing the top-K documents.
2. It iterates through each item in the top_k_documents list using a for loop. Each
item consists of a document name (doc) and its corresponding score (score).
3. For each document-score pair, the function prints a formatted message using an
f-string. The message displays the document name and its score.
4. As a result, the function prints information about the top-K documents and their
scores.

Flow of Code:
This part of the code snippet is where the main functionality of the program is executed. It
allows users to interact with the search system, enter queries, and retrieve relevant
documents. Let's break it down:

dir_path = os.path.dirname(os.path.abspath(__file__))

inverted_index, binary_td_matrix =create_index(dir_path)

1. dir_path is assigned the absolute path of the directory containing the program file
(__file__ represents the current script's file path).
2. inverted_index and binary_td_matrix are assigned the results returned by
the create_index function, which creates an inverted index and binary
term-document matrix for the documents in the specified directory.
3. The inverted index stores terms as keys and the documents they appear in as
values. The binary term-document matrix stores terms as keys and documents with
binary weights (1 for presence, 0 for absence) as values.

print(inverted_index)

4. The code prints the inverted index to the console. This is for debugging or
informational purposes, displaying the terms and their associated documents in the
inverted index.

while True:
query = input("Enter a search query (or 'exit' to quit): ")

if query.lower() == 'exit':

break

query_vector=represent_query(query, inverted_index)

scores=score_documents(query_vector, binary_td_matrix)

ranked_documents=rank_documents(scores)

top_k_documents=retrieve_top_k_documents(ranked_documents, 2)

present_results(top_k_documents)

5. This part starts a loop that allows the user to interact with the search system until
they decide to exit by typing 'exit.'
6. Inside the loop, it reads the user's search query, which is input using the input
function. If the query is 'exit,' the loop breaks, and the program ends.
7. If the user enters a search query, the program proceeds with the following steps:
○ query_vector is generated by calling the represent_query function,
which creates a vector representation of the query.
○ scores are calculated by calling the score_documents function, which
computes relevance scores for documents based on the query.
○ ranked_documents contains the top documents, sorted by their relevance
scores, as returned by the rank_documents function.
○ top_k_documents retrieves the top two documents from the ranked list
using the retrieve_top_k_documents function.
○ Finally, the relevant documents and their scores are presented to the user
using the present_results function.

This loop allows users to perform multiple search queries and obtain results for each query
until they decide to exit the program by typing 'exit'.
Data Flow Diagram
Block Diagram

You might also like