0% found this document useful (0 votes)

29 views14 pages

Lab3 IR BIM

This document summarizes a program that creates an inverted index and term-document matrix for basic document retrieval. It allows users to input queries and retrieves relevant documents based on query terms. The program tokenizes documents, removes stopwords and punctuation, stems words, and builds the inverted index and term-document matrix. It then scores documents based on query term matches and ranks the top results for the user. The purpose is to demonstrate key concepts in information retrieval through a simple search system.

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views14 pages

Lab3 IR BIM

Uploaded by

Pac SaQii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Information Retrieval

Assignment 3

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science

University of Engineering and Technology
Lahore Pakistan
Introduction
The program provided is a document retrieval and ranking system, which includes the
creation of an inverted index and a binary term-document matrix to facilitate document
retrieval based on user queries. It allows users to input search queries and retrieves relevant
documents based on the presence of query terms within the indexed documents.

Purpose of the Program:

The main purpose of this program is to showcase a basic document retrieval and ranking
system. It serves as a simple example to demonstrate key information retrieval concepts.
The primary goals and components of the program include:

1. Document Preprocessing: The program processes a set of text documents located

in a specified directory. It tokenizes the documents into sentences and words,
removes punctuation and unwanted characters, and tags words with their parts of
speech using the Natural Language Toolkit (NLTK).
2. Inverted Index and Binary Term-Document Matrix: The program creates an
inverted index, which is a data structure that maps terms to the documents in which
they appear. It also constructs a binary term-document matrix, which represents the
presence or absence of terms in each document. These data structures enable
efficient document retrieval.
3. Query Processing: Users can input search queries. The program tokenizes and
processes these queries, preparing them for matching against the indexed
documents.
4. Scoring Documents: The program scores documents based on the number of
query terms they contain. Documents with more query terms receive higher scores,
indicating their potential relevance to the query.
5. Ranking and Presentation: Documents are ranked based on their scores, and the
top-K documents are presented to the user. This feature allows users to quickly
identify and access the most relevant documents for their queries.

Installation and Setup:

Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.

NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk

Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
python text_search.py

Explanation and Guide

Imports (Libraries)
import os
import string
import math
import nltk
from collections import defaultdict
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

The "Imports" section includes various Python libraries and modules that are used in the
program. Each import statement serves a specific purpose and contributes to the
functionality of the code. Here's an explanation of each import statement and its role:

import os

● os is a Python module that provides a way to interact with the operating system. In
this program, it is used to manipulate file paths and directories, specifically to access
and process text documents stored in a directory.

import string

● string is a module in Python's standard library that provides a collection of string

constants and functions for text manipulation. In this program, it is used to access a
set of punctuation characters and unwanted characters to remove from text
documents.

import math

● math is a standard Python module that provides mathematical functions and

constants. In this program, it is used to perform mathematical operations, such as
calculating the Euclidean norm for score normalization.

import nltk

● nltk stands for the Natural Language Toolkit, which is a powerful library for natural
language processing (NLP) and text analysis in Python. It is used extensively in this
program for text preprocessing, tokenization, part-of-speech tagging, and stemming.

from collections import defaultdict

● collections is a Python module that provides specialized container datatypes. In
this program, it imports the defaultdict class, which is used to create dictionaries
with default values for inverted indexing and word counting.

from nltk.corpus import stopwords

● nltk.corpus.stopwords provides a list of common English stopwords.

Stopwords are words that are often removed from text data because they are
considered non-informative (e.g., "the," "and," "in"). These stopwords are used for
filtering out common words during text preprocessing.

from nltk.stem import PorterStemmer

● nltk.stem.PorterStemmer is a stemming algorithm included in the NLTK library.

Stemming is the process of reducing words to their root form (e.g., "running" to "run").
The Porter stemmer is used to normalize words in the text for indexing and retrieval.

from nltk.tokenize import word_tokenize

● nltk.tokenize.word_tokenize is a function from NLTK for tokenizing text into

words. It breaks text into individual words, making it easier to process and analyze
the content.

Variables

stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'}
stemmer = PorterStemmer()

stop_words = set(stopwords.words('english'))

● Explanation: The variable stop_words is assigned a set of English stopwords

using NLTK's stopwords.words('english'). These stopwords will be used to
filter out common words from the text documents being processed. This filtering
helps reduce the size of the inverted index and focuses on the content-carrying
words.

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are

considered unwanted and should be removed from the text before processing. The
characters include various forms of quotes, dashes, and ellipses. If additional
unwanted characters are identified, they can be added to this set.

stemmer = PorterStemmer()
● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable
stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.

Functions

def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}

1. def create_index(dir_path): This line defines a Python function called

create_index. It takes one argument, dir_path, which is the path to the directory
containing the text documents that you want to index. This function will be
responsible for building the inverted index and word counts for each document.
2. # Initialize an empty dictionary for the inverted index: This is a
comment that explains the purpose of the next line of code. It's initializing an empty
dictionary named inverted_index, which will be used to store the inverted index.
3. inverted_index = {}: This line creates an empty Python dictionary called
inverted_index. Inverted indexing is a technique used for text retrieval, where
words are associated with the documents they appear in. This dictionary will store
those associations.
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# Initialize a dictionary to store word counts per document
word_counts_per_document = {}

1. # Initialize a dictionary to store word counts per document:

This comment explains that the following line initializes a dictionary to store word
counts for each document in the directory.
2. word_counts_per_document = {}: This line creates an empty dictionary called
word_counts_per_document. This dictionary will be used to keep track of the
frequency of each word within each document, essentially counting how many times
each word appears in each text file. It is crucial for later search and retrieval
operations.

# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)

# Update word counts for this document

if stemmed_word not in word_counts:
word_counts[stemmed_word] = 1
else:
word_counts[stemmed_word] += 1

# Store word counts for this document

word_counts_per_document[filename] = word_counts

except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that

iterates over each file in the directory specified by dir_path. The os.listdir()
function returns a list of all files and directories in the given directory, and this loop
iterates through the file names.
2. if filename.endswith('.txt'): This line checks if the current filename
ends with the ".txt" extension, which typically indicates a text file.
3. try: This line begins a try-except block to handle potential errors during file
processing.
4. with open(os.path.join(dir_path, filename), 'r',
encoding='utf8') as file: Within the try block, this line opens the current text
file for reading. It uses os.path.join() to create the full path to the file by
combining dir_path with the filename. The file is opened in text mode ('r') and
with the 'utf8' encoding to handle text files encoded in UTF-8.
5. sentences = sent_tokenize(file.read().lower()): This line reads the
content of the file using file.read(), converts the content to lowercase using
.lower(), and then uses sent_tokenize (from NLTK) to split the content into a
list of sentences. This step prepares the text for further processing.
6. word_counts = {}: This line creates an empty dictionary called word_counts to
store word frequencies for the current document. This dictionary will be populated in
the following steps.
7. for sentence in sentences: This line sets up a loop to iterate over each
sentence in the sentences list.
8. sentence_without_punctuation = "".join([char for char in
sentence if char not in string.punctuation and char not in
unwanted_chars]): This line removes punctuation and unwanted characters from
the current sentence. It creates a new string called
sentence_without_punctuation by joining characters that are not in
string.punctuation or unwanted_chars.
9. words = word_tokenize(sentence_without_punctuation): This line
tokenizes the sentence_without_punctuation into a list of words using the
word_tokenize function from NLTK.
10. tagged_words = nltk.pos_tag(words): This line uses nltk.pos_tag to tag
each word in words with its part of speech. The result is stored in the
tagged_words list of word-tag pairs.
11. # For each word, if it's a noun or verb, stem it and add an
entry in the inverted index pointing to this filename: This
comment explains that the code will process each word in the current sentence,
checking if it's a noun or verb, and then stemming it before associating it with the
current filename in the inverted index.
12. for word, pos in tagged_words: This line sets up a loop to iterate over each
word and its corresponding part of speech in the tagged_words list.
13. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words:This line checks two
conditions for each word:
● Whether the word's part of speech (pos) is in the specified list of noun and
verb POS tags. If it is, it's considered for further processing.
● Whether the word is not in the set of stop_words, which are common words
that are often filtered out in text analysis.
14. stemmed_word = stemmer.stem(word): If a word passes the previous
conditions, it is stemmed using the Porter Stemmer. The stemmed word is stored in
the variable stemmed_word.
15. if stemmed_word not in inverted_index: This line checks if the
stemmed_word is not already in the inverted_index.
16. inverted_index[stemmed_word] = []: If the word is not in the inverted index,
it initializes an empty list as the value for that word in the inverted index.
17. inverted_index[stemmed_word].append(filename): Regardless of whether
the word was already in the inverted index or not, it appends the filename of the
current document to the list associated with the stemmed_word. This associates the
word with the document where it appears in the inverted index.
18. if stemmed_word not in word_counts: This line checks if the
stemmed_word is not in the word_counts dictionary.
19. word_counts[stemmed_word] = 1: If the word is not in word_counts, it
initializes it with a count of 1, indicating that this word has been found once in the
current document.
20. else: If the word is already in word_counts, this block of code is executed.
21. word_counts[stemmed_word] += 1: It increments the count for the word in
word_counts to indicate that the word has been found again in the current
document.
22. # Store word counts for this document: This comment explains that the
code is about to store the word counts for the current document.
23. word_counts_per_document[filename] = word_counts: This line stores
the word_counts dictionary (word counts for the current document) in the
word_counts_per_document dictionary with the filename as the key. This
associates the word counts with the document.
24. except UnicodeDecodeError: This is an exception handler that catches
UnicodeDecodeError exceptions. This exception occurs when a file cannot be
decoded using the specified encoding, which can happen when processing text files
with non-standard encodings.
25. print(f"Skipping file {filename} due to UnicodeDecodeError"): If
a UnicodeDecodeError is raised, this line prints a message indicating that the file
is being skipped due to this encoding-related error.
26. return inverted_index, word_counts_per_document: This line returns two
values as a tuple:
● inverted_index: This is a dictionary containing the inverted index, where
each stemmed word is associated with a list of filenames where it appears.
● word_counts_per_document: This is a dictionary containing word counts
for each document, showing how many times each word appears in each
document.

def represent_query(query,inverted_index)

def represent_query(query, inverted_index):

words = word_tokenize(query.lower())

stemmed_words = [stemmer.stem(word) for word in words if word not

in stop_words]

query_vector = {}

for term in inverted_index.keys():

if term in stemmed_words:

query_vector[term] = 1

else:

query_vector[term] = 0

return query_vector
Explanation

def represent_query(query, inverted_index):

# Tokenize and stem the query
words = word_tokenize(query.lower())

1. The represent_query function takes two parameters: query, which is the user's
search query, and inverted_index, which is the dictionary that stores the inverted
index of terms in the documents.
2. In this line, the query is converted to lowercase using query.lower(). This
ensures that the query is case-insensitive, so it can match terms regardless of their
letter casing. The result is a lowercase version of the query.
3. The word_tokenize function is used to tokenize the lowercase query. It breaks the
query into individual words and stores them in the words list.

stemmed_words = [stemmer.stem(word) for word in words if word not in

stop_words]

4. In this line, the code iterates through each word in the words list. For each word, it
checks if it's not in the stop_words set, ensuring that common stopwords are
filtered out.
5. If a word is not in stop_words, it is stemmed using the stemmer.stem(word)
function, which reduces the word to its root form. The list of stemmed words is stored
in the stemmed_words variable.

query_vector = {}

6. An empty dictionary, query_vector, is created. This dictionary will represent the

query vector, where each term from the inverted index corresponds to its presence or
absence in the query.

for term in inverted_index.keys():

if term in stemmed_words:

query_vector[term] = 1

else:

query_vector[term] = 0

7. Here, a loop iterates through each term in the inverted_index. It checks if each
term is present in the stemmed_words list, which contains the stemmed and filtered
words from the user's query.
8. If a term is found in stemmed_words, it's assigned a weight of 1 in the
query_vector, indicating that it's part of the query.
9. If a term is not found in stemmed_words, it's assigned a weight of 0 in the
query_vector, indicating that it's not part of the query.

return query_vector

10. Finally, the query_vector is returned, representing the presence or absence of

each term from the inverted index in the user's query.

rank_documents(scores)

def rank_documents(scores):

# Sort the documents by their scores in descending order

ranked_documents = sorted(scores.items(), key=lambda x: x[1],

reverse=True)

return ranked_documents

1. The rank_documents function takes a scores dictionary as input, where the keys
are document names and the values are their corresponding scores.
2. The sorted function is used to sort the scores dictionary items (document, score)
based on the score (x[1]) in descending order (reverse=True).
3. The sorted items are stored in the ranked_documents variable.
4. The function returns the ranked_documents, which is a list of document-score
pairs sorted by score in descending order.

retrieve_top_k_documents(ranked_documents, k)

def retrieve_top_k_documents(ranked_documents, k):

# Select the top-K documents from the ranked list

top_k_documents = ranked_documents[:k]

return top_k_documents

1. The retrieve_top_k_documents function takes two parameters:

ranked_documents, which is a list of document-score pairs sorted by score in
descending order, and k, which is the number of top documents to retrieve.
2. It uses list slicing (ranked_documents[:k]) to select the first k elements from the
ranked_documents list, which represents the top-K documents based on their
scores.
3. The selected top-K documents are stored in the top_k_documents variable.
4. The function returns top_k_documents, which is a list of the top-K document-score
pairs.
present_results(top_k_documents)

def present_results(top_k_documents):

# Print the top-K documents and their scores

for doc, score in top_k_documents:

print(f"Document: {doc}, Score: {score}")

1. The present_results function takes top_k_documents as input, which is a list

of document-score pairs representing the top-K documents.
2. It iterates through each item in the top_k_documents list using a for loop. Each
item consists of a document name (doc) and its corresponding score (score).
3. For each document-score pair, the function prints a formatted message using an
f-string. The message displays the document name and its score.
4. As a result, the function prints information about the top-K documents and their
scores.

Flow of Code:
This part of the code snippet is where the main functionality of the program is executed. It
allows users to interact with the search system, enter queries, and retrieve relevant
documents. Let's break it down:

dir_path = os.path.dirname(os.path.abspath(__file__))

inverted_index, binary_td_matrix =create_index(dir_path)

1. dir_path is assigned the absolute path of the directory containing the program file
(__file__ represents the current script's file path).
2. inverted_index and binary_td_matrix are assigned the results returned by
the create_index function, which creates an inverted index and binary
term-document matrix for the documents in the specified directory.
3. The inverted index stores terms as keys and the documents they appear in as
values. The binary term-document matrix stores terms as keys and documents with
binary weights (1 for presence, 0 for absence) as values.

print(inverted_index)

4. The code prints the inverted index to the console. This is for debugging or
informational purposes, displaying the terms and their associated documents in the
inverted index.

while True:
query = input("Enter a search query (or 'exit' to quit): ")

if query.lower() == 'exit':

break

query_vector=represent_query(query, inverted_index)

scores=score_documents(query_vector, binary_td_matrix)

ranked_documents=rank_documents(scores)

top_k_documents=retrieve_top_k_documents(ranked_documents, 2)

present_results(top_k_documents)

5. This part starts a loop that allows the user to interact with the search system until
they decide to exit by typing 'exit.'
6. Inside the loop, it reads the user's search query, which is input using the input
function. If the query is 'exit,' the loop breaks, and the program ends.
7. If the user enters a search query, the program proceeds with the following steps:
○ query_vector is generated by calling the represent_query function,
which creates a vector representation of the query.
○ scores are calculated by calling the score_documents function, which
computes relevance scores for documents based on the query.
○ ranked_documents contains the top documents, sorted by their relevance
scores, as returned by the rank_documents function.
○ top_k_documents retrieves the top two documents from the ranked list
using the retrieve_top_k_documents function.
○ Finally, the relevant documents and their scores are presented to the user
using the present_results function.

This loop allows users to perform multiple search queries and obtain results for each query
until they decide to exit the program by typing 'exit'.
Data Flow Diagram
Block Diagram

Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Term Frequency and Inverse Document Frequency
No ratings yet
Term Frequency and Inverse Document Frequency
26 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Record
No ratings yet
NLP Record
16 pages
Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Report
No ratings yet
Report
102 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Ir Journal
No ratings yet
Ir Journal
41 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
Assignment 1: A) Create The Tables With The Appropriate Integrity Constraints
No ratings yet
Assignment 1: A) Create The Tables With The Appropriate Integrity Constraints
16 pages
Lab - Manual - IR - BE AI&DS CL II
No ratings yet
Lab - Manual - IR - BE AI&DS CL II
38 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
All Practicals
No ratings yet
All Practicals
33 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
TSA Student
No ratings yet
TSA Student
20 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
Dbms Question Bank Full Solution
No ratings yet
Dbms Question Bank Full Solution
41 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Ir Lab 2 Ir Learning Outcomes: Pyterrier
No ratings yet
Ir Lab 2 Ir Learning Outcomes: Pyterrier
7 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
IBM Watsonx - Data Level 2 Quiz - Attempt Review
No ratings yet
IBM Watsonx - Data Level 2 Quiz - Attempt Review
17 pages
DBMS Detailed Answers Improved
No ratings yet
DBMS Detailed Answers Improved
5 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
IR Journal (Printable)
No ratings yet
IR Journal (Printable)
20 pages
CS 3308 Programming Assignment 2
No ratings yet
CS 3308 Programming Assignment 2
3 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Ccs369-Lab Ex 3,4,5
No ratings yet
Ccs369-Lab Ex 3,4,5
8 pages
Project Report
No ratings yet
Project Report
5 pages
Batch 2
No ratings yet
Batch 2
13 pages
Assignment 1 IR
No ratings yet
Assignment 1 IR
4 pages
Assignment 3 BIM IR
No ratings yet
Assignment 3 BIM IR
5 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Microsoft Power BI Syllabus
No ratings yet
Microsoft Power BI Syllabus
9 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Lab2 IR
No ratings yet
Lab2 IR
16 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
CCS334 Big Data Analytics
No ratings yet
CCS334 Big Data Analytics
20 pages
20BCE1779 - Web Mining - Lab-1
No ratings yet
20BCE1779 - Web Mining - Lab-1
9 pages
Inverted Index-Unit-3
No ratings yet
Inverted Index-Unit-3
11 pages
End-to-End Data Analytics Project
No ratings yet
End-to-End Data Analytics Project
18 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Final Exam Semester 2 - Part I
0% (1)
Final Exam Semester 2 - Part I
19 pages
Multimedia Information Systems (MMIS) : INSY4111
No ratings yet
Multimedia Information Systems (MMIS) : INSY4111
57 pages
Task 1 Data Science With Documentation
No ratings yet
Task 1 Data Science With Documentation
11 pages
Assignment 2 IR
No ratings yet
Assignment 2 IR
6 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
MongoDB Queries
No ratings yet
MongoDB Queries
17 pages
COURSEWORK1 Details
No ratings yet
COURSEWORK1 Details
3 pages
IAT Paper Jan-June 22 DMBI DIV A&B Solution
No ratings yet
IAT Paper Jan-June 22 DMBI DIV A&B Solution
10 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Google Dork
No ratings yet
Google Dork
3 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Personal Reflection
67% (3)
Personal Reflection
2 pages
115 Ir 9
No ratings yet
115 Ir 9
4 pages
10 Oracle Trigger
No ratings yet
10 Oracle Trigger
7 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Dad Assignment - Esoft
33% (3)
Dad Assignment - Esoft
11 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Arango DB
No ratings yet
Arango DB
12 pages
Dbms and Rdbms Questions
No ratings yet
Dbms and Rdbms Questions
16 pages
B. C. (Sem. VI) (CBCS) (W.E.F. 2016) Examination: Faculty Code: 003 Subject Code: 1036001
No ratings yet
B. C. (Sem. VI) (CBCS) (W.E.F. 2016) Examination: Faculty Code: 003 Subject Code: 1036001
4 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Network Monitoring Using Splunk
No ratings yet
Network Monitoring Using Splunk
9 pages
Web Mining
No ratings yet
Web Mining
13 pages
Answer Key For SQL Questions
No ratings yet
Answer Key For SQL Questions
2 pages
Homogeneous System
No ratings yet
Homogeneous System
27 pages
Database Is Everywhere in Every Person
No ratings yet
Database Is Everywhere in Every Person
1 page
Dashboard in A Day Slides
No ratings yet
Dashboard in A Day Slides
40 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
No ratings yet
UT Dallas Syllabus For cs4347.501 05s Taught by Latifur Khan (Lkhan)
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
SD Related Tables & Structures
No ratings yet
SD Related Tables & Structures
4 pages
Installation Instruction For SIESTA V 3
No ratings yet
Installation Instruction For SIESTA V 3
2 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet

Lab3 IR BIM

Uploaded by

Lab3 IR BIM

Uploaded by

Information Retrieval

Session: 2020 – 2024

Department of Computer Science

Purpose of the Program:

1. Document Preprocessing: The program processes a set of text documents located

Installation and Setup:

Explanation and Guide

● string is a module in Python's standard library that provides a collection of string

● math is a standard Python module that provides mathematical functions and

from collections import defaultdict

from nltk.corpus import stopwords

● nltk.corpus.stopwords provides a list of common English stopwords.

from nltk.stem import PorterStemmer

● nltk.stem.PorterStemmer is a stemming algorithm included in the NLTK library.

from nltk.tokenize import word_tokenize

● nltk.tokenize.word_tokenize is a function from NLTK for tokenizing text into

● Explanation: The variable stop_words is assigned a set of English stopwords

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are

1. def create_index(dir_path): This line defines a Python function called

1. # Initialize a dictionary to store word counts per document:

# Update word counts for this document

# Store word counts for this document

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that

def represent_query(query, inverted_index):

stemmed_words = [stemmer.stem(word) for word in words if word not

for term in inverted_index.keys():

def represent_query(query, inverted_index):

stemmed_words = [stemmer.stem(word) for word in words if word not in

6. An empty dictionary, query_vector, is created. This dictionary will represent the

for term in inverted_index.keys():

10. Finally, the query_vector is returned, representing the presence or absence of

# Sort the documents by their scores in descending order

ranked_documents = sorted(scores.items(), key=lambda x: x[1],

def retrieve_top_k_documents(ranked_documents, k):

# Select the top-K documents from the ranked list

1. The retrieve_top_k_documents function takes two parameters:

# Print the top-K documents and their scores

for doc, score in top_k_documents:

print(f"Document: {doc}, Score: {score}")

1. The present_results function takes top_k_documents as input, which is a list

inverted_index, binary_td_matrix =create_index(dir_path)

You might also like