0% found this document useful (0 votes)
44 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Yash Pahlani D17B 49

Aim: To Implement any Information Retrieval Modeling technique

Theory:

Information Retrieval (IR) modeling techniques are essential for efficient and
accurate extraction of relevant information from vast document repositories. By
analyzing and structuring data, these techniques facilitate ranking and presentation
of documents in alignment with a user's query. The choice of IR technique depends
on factors such as query complexity, document collection size, and desired
precision-recall trade-offs, highlighting the diverse strategies available to optimize
information retrieval processes.

Here are some commonly used IR modeling techniques:

Boolean Model: The Boolean Model is a fundamental and straightforward


approach to information retrieval. It treats documents and queries as sets of terms
(words), and it uses Boolean operators (AND, OR, NOT) to combine these sets. In
this model, a document is either considered relevant (1) or not relevant (0) to a
query. The Boolean Model provides a way to express complex queries using
logical operators.

Vector Space Model (VSM): The Vector Space Model represents documents and
queries as vectors in a high-dimensional space, where each dimension corresponds
to a term. Terms are typically weighted using techniques like TF-IDF to reflect
their importance in the document. The relevance between a query vector and a
document vector is often computed using the cosine similarity.

Probabilistic Models: Probabilistic models approach information retrieval from a


statistical perspective, estimating the probability that a document is relevant to a
given query. These models aim to find a balance between precision and recall by
ranking documents based on their likelihood of relevance.

1
Yash Pahlani D17B 49

Vector Space Model (VSM):

The Vector Space Model (VSM) is a fundamental technique in Information


Retrieval (IR) that transforms textual data into a geometric framework. In this
model, documents and queries are represented as vectors in a high-dimensional
space, with each dimension corresponding to a unique term. By measuring the
similarity between these vectors, the VSM assesses the relevance of documents to
user queries. Originally proposed by Gerard Salton in the 1960s, the VSM has
since become a cornerstone of modern IR systems.

Working of Vector Space Model:

The VSM transforms documents and queries into numerical vectors within a
high-dimensional space. Each dimension corresponds to a unique term in the
vocabulary. The key steps in the VSM's working are:

​ Term Frequency (TF) Calculation: For each document and query, the frequency of
each term is computed. This forms the term frequency vector.
The Term-Frequency is computed with respect to the i-th term and j-th document :


​ Inverse Document Frequency (IDF) Calculation: The inverse document frequency
of each term is determined, representing its importance in the entire document
collection.
​ The Inverse-Document-Frequency takes into consideration the i-th terms and all
the documents in the collection :

Term Weighting (TF-IDF): The product of term frequency and inverse document
frequency results in the TF-IDF score, which reflects the significance of each term
within a document or query.

2
Yash Pahlani D17B 49

​ Vector Creation: Each document and query is represented as a vector, where each
dimension corresponds to a term and the value is its corresponding TF-IDF score.

​ Cosine Similarity: The relevance between documents and queries is assessed using
the cosine similarity between their respective vectors. Documents with higher
cosine similarities are considered more relevant.
Cosine Similarity is computed using:

Advantages of Vector Space Model

​ Partial Matching: The VSM accommodates partial keyword matches, allowing


relevant documents to be retrieved even if they share only a subset of terms with
the query.

​ Term Importance: TF-IDF captures term importance, emphasizing rare and


distinctive terms over common ones.

​ Ranking: Cosine similarity provides a natural ranking mechanism, presenting the


most relevant documents first.

Disadvantages of Vector Space Model

​ Semantic Gap: The VSM lacks understanding of word semantics, leading to


challenges in capturing context and meaning.

​ High-Dimensional Space: As the vocabulary grows, the dimensionality of the
space increases, which can lead to computational complexities.

​ Query Sparsity: Short queries or those with few relevant terms may result in
imprecise retrieval.

3
Yash Pahlani D17B 49

Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')
# Sample corpus
corpus = [
'In computer science artificial intelligence sometimes called
machine intelligence is intelligence demonstrated by machines',
'Experimentation calculation and Observation is called science',
'Physics is a natural science that involves the study of matter
and its motion through space and time, along with related concepts
such as energy and force',
'In mathematics and computer science an algorithm is a finite
sequence of well-defined computer-implementable instructions',
'Chemistry is the scientific discipline involved with elements
and compounds composed of atoms, molecules and ions',
'Biochemistry is the branch of science that explores the chemical
processes within and related to living organisms',
'Sociology is the study of society, patterns of social
relationships, social interaction, and culture that surrounds
everyday life',
]

# Preprocess and clean the corpus


cleaned_corpus = []
for doc in corpus:
tokens = word_tokenize(doc.lower())
tokens = [word for word in tokens if word.isalnum()]
cleaned_corpus.append(' '.join(tokens))

# Create a TF-IDF vectorizer


vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(cleaned_corpus)

# Query
query = 'computer science'
query_tokens = word_tokenize(query.lower())

4
Yash Pahlani D17B 49

query = ' '.join([word for word in query_tokens if word not in


stopwords.words('english')])

# Transform the query into a vector using the same vectorizer


query_vector = vectorizer.transform([query])

# Calculate cosine similarity


from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(doc_vectors,
query_vector).flatten()

# Get the indices of related documents


related_docs_indices = cosine_similarities.argsort()[::-1]

# Print related documents with cosine similarity values


for i in related_docs_indices:
tokens = word_tokenize(cleaned_corpus[i])
filtered_tokens = [word for word in tokens if word not in
stopwords.words('english')]
data = ' '.join(filtered_tokens)
similarity_value = cosine_similarities[i]
print(f"Similarity: {similarity_value:.4f} - {data}")

Output:

5
Yash Pahlani D17B 49

Boolean Information Retrieval Model

The Boolean Information Retrieval Model is a fundamental approach in the field of


information retrieval, which involves searching for and retrieving relevant documents
from a collection based on user-defined queries. It operates on the principles of Boolean
logic, which was developed by George Boole in the mid-19th century. The Boolean
model is particularly effective for precise, structured searches, making it well-suited for
certain types of information retrieval tasks.

Working of Boolean Information Retrieval Model

The Boolean Information Retrieval Model operates based on a set of principles derived
from Boolean logic. The key components that define how the model works include
Boolean operators, queries, and documents. Here's a step-by-step explanation of how the
Boolean Information Retrieval Model works:

​ Document Indexing: Before searching can begin, a collection of documents is indexed.


This involves parsing each document to extract individual terms (words), removing
stopwords (common words like "and," "the," "is"), and creating an index that maps each
term to the documents where it appears.

​ Query Formulation: Users create queries by combining terms and Boolean operators
(AND, OR, NOT). The terms are the keywords or phrases that users want to search for
within the document collection. The Boolean operators define how the terms are related
and help narrow down or broaden the search.

​ Boolean Operators:

AND: When users use the AND operator, they are specifying that documents must
contain all the terms connected by AND. This narrows down the search to documents that
satisfy all the conditions.

OR: The OR operator retrieves documents that contain at least one of the terms connected
by OR. It broadens the search by including documents that meet any of the specified
conditions.

6
Yash Pahlani D17B 49

NOT: The NOT operator excludes documents that contain the term following it. It refines
the search results by excluding unwanted documents.

​ Retrieval of Documents: The indexing structure allows the system to efficiently retrieve
documents that match the terms and Boolean operators specified in the query.

Advantages of Boolean Information Retrieval Model

​ Precision Control: Users can precisely define search criteria using Boolean operators,
ensuring accurate retrieval of specific information.

​ Structured Queries: Ideal for systematic searches where exact term matches are critical,
such as legal or scientific research.

​ Consistent Results: The same query always produces the same results, ensuring
reproducibility.

Disadvantages of Boolean Information Retrieval Model

​ No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to


potential difficulties in identifying more important results.
​ Limited Language Handling: Struggles with variations in language, such as synonyms or
related terms, which can result in missed information.

​ Complex Query Construction: Formulating intricate queries with multiple terms and
operators can be complex and error-prone.

​ Binary Output: Documents are classified as either relevant or irrelevant, lacking the
nuance of degrees of relevance.

Code:
import nltk
nltk.download('stopwords')
from typing import OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

7
Yash Pahlani D17B 49

# Sample documents in the corpus


documents = [
"Taj Mahal is a beautiful monument",
"Victoria Memorial is also a monument",
"I like to visit Agra"
]

stemmer = PorterStemmer()

txtFiles = []
stemmedwords = []
dictionary = {}
OrderedDictionary = []

for doc_id, doc_text in enumerate(documents):


tokens = word_tokenize(doc_text.lower())

# Remove stopwords and punctuations


stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words and t not in
string.punctuation]

stemmedwords = [stemmer.stem(token) for token in tokens]

for term in stemmedwords:


if term not in dictionary:
dictionary[term] = []
dictionary[term].append(doc_id + 1) # Adding 1 to doc_id to match
document numbering

OrderedDictionary = OrderedDict(sorted(dictionary.items()))

print("Inverted Index:")
for term, posting_list in OrderedDictionary.items():
print(term, posting_list)

query = input("Enter query: ")


query = word_tokenize(query.lower())
query = [stemmer.stem(word) for word in query]

print("Processed Query:", query)


result_set = set(range(1, len(documents) + 1))

8
Yash Pahlani D17B 49

i = 0
while i < len(query):
term = query[i]

if term == 'and':
i += 1
next_term = query[i]
result_set &= set(OrderedDictionary.get(next_term, []))
elif term == 'or':
i += 1
next_term = query[i]
result_set |= set(OrderedDictionary.get(next_term, []))
elif term == 'not':
i += 1
next_term = query[i]
result_set -= set(OrderedDictionary.get(next_term, []))
else:
result_set = set(OrderedDictionary.get(term, []))

i += 1

if result_set:
print("\nMatching Documents:", result_set)
else:
print("\nNo matching documents.")

Output:

AND

9
Yash Pahlani D17B 49

NOT

OR

Conclusion
While the Boolean Information Retrieval Model offers precision and structured searches,
it falls short in adapting to contextual nuances and ranking relevance. As the Vector
Space Model quantifies text and addresses these limitations, it stands as a more adaptable
and nuanced approach, though challenges in contextual understanding and user intent
remain for further development.

10

You might also like