Information Retrieval Practical
Information Retrieval Practical
Theory:
Information Retrieval (IR) modeling techniques are essential for efficient and
accurate extraction of relevant information from vast document repositories. By
analyzing and structuring data, these techniques facilitate ranking and presentation
of documents in alignment with a user's query. The choice of IR technique depends
on factors such as query complexity, document collection size, and desired
precision-recall trade-offs, highlighting the diverse strategies available to optimize
information retrieval processes.
Vector Space Model (VSM): The Vector Space Model represents documents and
queries as vectors in a high-dimensional space, where each dimension corresponds
to a term. Terms are typically weighted using techniques like TF-IDF to reflect
their importance in the document. The relevance between a query vector and a
document vector is often computed using the cosine similarity.
1
Yash Pahlani D17B 49
The VSM transforms documents and queries into numerical vectors within a
high-dimensional space. Each dimension corresponds to a unique term in the
vocabulary. The key steps in the VSM's working are:
Term Frequency (TF) Calculation: For each document and query, the frequency of
each term is computed. This forms the term frequency vector.
The Term-Frequency is computed with respect to the i-th term and j-th document :
Inverse Document Frequency (IDF) Calculation: The inverse document frequency
of each term is determined, representing its importance in the entire document
collection.
The Inverse-Document-Frequency takes into consideration the i-th terms and all
the documents in the collection :
Term Weighting (TF-IDF): The product of term frequency and inverse document
frequency results in the TF-IDF score, which reflects the significance of each term
within a document or query.
2
Yash Pahlani D17B 49
Vector Creation: Each document and query is represented as a vector, where each
dimension corresponds to a term and the value is its corresponding TF-IDF score.
Cosine Similarity: The relevance between documents and queries is assessed using
the cosine similarity between their respective vectors. Documents with higher
cosine similarities are considered more relevant.
Cosine Similarity is computed using:
3
Yash Pahlani D17B 49
Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
nltk.download('punkt')
nltk.download('stopwords')
# Sample corpus
corpus = [
'In computer science artificial intelligence sometimes called
machine intelligence is intelligence demonstrated by machines',
'Experimentation calculation and Observation is called science',
'Physics is a natural science that involves the study of matter
and its motion through space and time, along with related concepts
such as energy and force',
'In mathematics and computer science an algorithm is a finite
sequence of well-defined computer-implementable instructions',
'Chemistry is the scientific discipline involved with elements
and compounds composed of atoms, molecules and ions',
'Biochemistry is the branch of science that explores the chemical
processes within and related to living organisms',
'Sociology is the study of society, patterns of social
relationships, social interaction, and culture that surrounds
everyday life',
]
# Query
query = 'computer science'
query_tokens = word_tokenize(query.lower())
4
Yash Pahlani D17B 49
Output:
5
Yash Pahlani D17B 49
The Boolean Information Retrieval Model operates based on a set of principles derived
from Boolean logic. The key components that define how the model works include
Boolean operators, queries, and documents. Here's a step-by-step explanation of how the
Boolean Information Retrieval Model works:
Query Formulation: Users create queries by combining terms and Boolean operators
(AND, OR, NOT). The terms are the keywords or phrases that users want to search for
within the document collection. The Boolean operators define how the terms are related
and help narrow down or broaden the search.
Boolean Operators:
AND: When users use the AND operator, they are specifying that documents must
contain all the terms connected by AND. This narrows down the search to documents that
satisfy all the conditions.
OR: The OR operator retrieves documents that contain at least one of the terms connected
by OR. It broadens the search by including documents that meet any of the specified
conditions.
6
Yash Pahlani D17B 49
NOT: The NOT operator excludes documents that contain the term following it. It refines
the search results by excluding unwanted documents.
Retrieval of Documents: The indexing structure allows the system to efficiently retrieve
documents that match the terms and Boolean operators specified in the query.
Precision Control: Users can precisely define search criteria using Boolean operators,
ensuring accurate retrieval of specific information.
Structured Queries: Ideal for systematic searches where exact term matches are critical,
such as legal or scientific research.
Consistent Results: The same query always produces the same results, ensuring
reproducibility.
Complex Query Construction: Formulating intricate queries with multiple terms and
operators can be complex and error-prone.
Binary Output: Documents are classified as either relevant or irrelevant, lacking the
nuance of degrees of relevance.
Code:
import nltk
nltk.download('stopwords')
from typing import OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
7
Yash Pahlani D17B 49
stemmer = PorterStemmer()
txtFiles = []
stemmedwords = []
dictionary = {}
OrderedDictionary = []
OrderedDictionary = OrderedDict(sorted(dictionary.items()))
print("Inverted Index:")
for term, posting_list in OrderedDictionary.items():
print(term, posting_list)
8
Yash Pahlani D17B 49
i = 0
while i < len(query):
term = query[i]
if term == 'and':
i += 1
next_term = query[i]
result_set &= set(OrderedDictionary.get(next_term, []))
elif term == 'or':
i += 1
next_term = query[i]
result_set |= set(OrderedDictionary.get(next_term, []))
elif term == 'not':
i += 1
next_term = query[i]
result_set -= set(OrderedDictionary.get(next_term, []))
else:
result_set = set(OrderedDictionary.get(term, []))
i += 1
if result_set:
print("\nMatching Documents:", result_set)
else:
print("\nNo matching documents.")
Output:
AND
9
Yash Pahlani D17B 49
NOT
OR
Conclusion
While the Boolean Information Retrieval Model offers precision and structured searches,
it falls short in adapting to contextual nuances and ranking relevance. As the Vector
Space Model quantifies text and addresses these limitations, it stands as a more adaptable
and nuanced approach, though challenges in contextual understanding and user intent
remain for further development.
10