0% found this document useful (0 votes)
2 views6 pages

Ir Task

The document outlines a lab activity on Boolean retrieval methods in information retrieval, comparing approaches with and without preprocessing. It details the processes of creating inverted indexes, performing Boolean queries, and the impact of text preprocessing techniques like stemming and stopword removal on search accuracy. The conclusion emphasizes that preprocessing enhances search efficiency and accuracy by standardizing text input.

Uploaded by

mariamafzaal45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views6 pages

Ir Task

The document outlines a lab activity on Boolean retrieval methods in information retrieval, comparing approaches with and without preprocessing. It details the processes of creating inverted indexes, performing Boolean queries, and the impact of text preprocessing techniques like stemming and stopword removal on search accuracy. The conclusion emphasizes that preprocessing enhances search efficiency and accuracy by standardizing text input.

Uploaded by

mariamafzaal45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DEPARTMENT OF CREATIVE TECHNOLOGIES

NAME: MARIAM AFZAAL


REG ID: 231139
CLASS: BS AI IV ‘B’
SUBJECT: INFORMATION RETRIEVAL
SUBMITTED TO: MA’AM FAIZA QAMAR
LAB ACTIVITY:

Boolean Retrieval without Preprocessing:


1. index_without_preprocessing = {}
for doc_id, text in Chapter_1.items():
for word in text.split():
index_without_preprocessing.setdefault(word, []).append(doc_id)

Explanation:

• It loops through all documents in Chapter_1 and processes each word.


• The words are directly stored in index_without_preprocessing without any
modi cations.
• Each word is mapped to a list of document IDs in which it appears.

2. allah_docs = set(index_without_preprocessing.get("Allah", []))


compassionate_docs = set(index_without_preprocessing.get("Compassionate", []))
result = allah_docs & compassionate_docs
print("Without Preprocessing (Allah & Compassionate):", result)

Explanation:

• Finds documents containing "Allah" and "Compassionate" separately.


• Uses the & (AND) operator to nd the intersection of both sets.
• The result is a set of document IDs containing both words.

3. def create_inverted_index(documents):
inv_index = {}
for doc_id, text in documents.items():
for word in text.lower().split():
word = word.strip(".,!?")
inv_index.setdefault(word, set()).add(doc_id)
return inv_index

Explanation:

• Converts all words to lowercase.


• Removes punctuation marks like . , ! ?.
• Stores document IDs in a dictionary with words as keys.

4. def boolean_retrieval(query, inv_index):


query_terms = query.lower().split()
if "and" in query_terms:
term_1_docs = inv_index.get(query_terms[0], set())
term_2_docs = inv_index.get(query_terms[2], set())
return term_1_docs & term_2_docs
elif "or" in query_terms:
term_1_docs = inv_index.get(query_terms[0], set())
term_2_docs = inv_index.get(query_terms[2], set())
return term_1_docs | term_2_docs
fi
fi
else:
return inv_index.get(query_terms[0], set())
Explanation:

• Converts the query to lowercase.


• If the query contains "AND", it nds the intersection of documents.
• If the query contains "OR", it nds the union of documents.
• If there is only one word, it returns the documents containing that word.

5. inv_index = create_inverted_index(Chapter_1)
print("\nInverted Index:\n", inv_index)

Explanation:

• Calls create_inverted_index() to generate the inverted index with


preprocessing.
• Prints the generated index

6. query = "compassionate and merciful"


results = boolean_retrieval(query, inv_index)
print(f"\nBoolean Retrieval Results for '{query}': {results}")

Explanation:
• Retrieves documents containing both "compassionate" and "merciful" using AND.

7. query = "compassionate or merciful"


results = boolean_retrieval(query, inv_index)
print(f"\nBoolean Retrieval Results for '{query}': {results}")

Explanation:
• Retrieves documents containing either "compassionate" or "merciful" using Orr
fi
fi
Boolean Retrieval with Preprocessing:
1.import re
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
nltk.download(‘punkt')

Explanation:

• re is used for text cleaning.


• PorterStemmer is used for stemming words to their root form.
• stopwords are common words (like "and", "the") that are removed.
• nltk.download() ensures required resources are available.

2. stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

Explanation:

• PorterStemmer() is initialized for word stemming.


• stopwords.words('english') loads a list of common words to ignore.

3.def preprocess_text(text):
text = text.lower() # Convert text to lowercase
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text

Explanation:

• Converts text to lowercase to ensure case insensitivity.


• Removes punctuation marks using regex.

4.def create_inverted_index_with_preprocessing(chapter):
index = {}
for doc_id, text in chapter.items():
text = preprocess_text(text)
for word in text.split():
if word not in stop_words and word: # Stop word removal and empty word check
stemmed_word = stemmer.stem(word) # Stemming
index.setdefault(stemmed_word, []).append(doc_id)
return index

Explanation:

• Preprocesses the text using preprocess_text().


• Splits text into words.
• Removes stopwords and performs stemming.
• Stores words in index along with document IDs.
5.def boolean_retrieval_preprocessed(query, index):
terms = query.lower().split()
processed_terms = [stemmer.stem(term) for term in terms if term not in stop_words and term]

if not processed_terms:
return set()…

Explanation:

• Converts query to lowercase and removes stopwords.


• Stems each query word.
• If "AND" is present, it nds documents common to all terms.
• If "OR" is present, it nds documents containing any of the terms.
• If only one word is given, it returns matching documents.

6.inv_index_preprocessed = create_inverted_index_with_preprocessing(Chapter_1)
print("\nInverted Index with Preprocessing:\n", inv_index_preprocessed)

Explanation:

• Calls create_inverted_index_with_preprocessing() to generate the


index.
• Prints the processed inverted index.

7.query = "compassionate and merciful"


results = boolean_retrieval_preprocessed(query, inv_index_preprocessed)
print(f"\nBoolean Retrieval Results for '{query}': {results}”)

Explanation:

Retrieves documents containing both "compassionate" and "merciful".


fi
fi
CONCLUSION:
1.Without preprocessing Boolean retrieval:

• The text is used as it is, without changing the case, removing punctuation, or ltering
common words.
• This can give inaccurate results because “Allah” and “allah” would be treated as different
words.
2.With preprocessing Boolean retrieval:

• The text is cleaned (lowercased, punctuation removed, stopwords removed, and words
stemmed to their root forms).
• This makes the search more accurate and ef cient, nding documents even with slight word
variations.
fi
fi
fi

You might also like