0% found this document useful (0 votes)

44 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views10 pages

Information Retrieval Practical

These are the practicals and documentation of the Information Retrieval subject

Uploaded by

dummyvesit49

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Yash Pahlani D17B 49

Aim: To Implement any Information Retrieval Modeling technique

Theory:

Information Retrieval (IR) modeling techniques are essential for efficient and
accurate extraction of relevant information from vast document repositories. By
analyzing and structuring data, these techniques facilitate ranking and presentation
of documents in alignment with a user's query. The choice of IR technique depends
on factors such as query complexity, document collection size, and desired
precision-recall trade-offs, highlighting the diverse strategies available to optimize
information retrieval processes.

Here are some commonly used IR modeling techniques:

Boolean Model: The Boolean Model is a fundamental and straightforward

approach to information retrieval. It treats documents and queries as sets of terms
(words), and it uses Boolean operators (AND, OR, NOT) to combine these sets. In
this model, a document is either considered relevant (1) or not relevant (0) to a
query. The Boolean Model provides a way to express complex queries using
logical operators.

Vector Space Model (VSM): The Vector Space Model represents documents and
queries as vectors in a high-dimensional space, where each dimension corresponds
to a term. Terms are typically weighted using techniques like TF-IDF to reflect
their importance in the document. The relevance between a query vector and a
document vector is often computed using the cosine similarity.

Probabilistic Models: Probabilistic models approach information retrieval from a

statistical perspective, estimating the probability that a document is relevant to a
given query. These models aim to find a balance between precision and recall by
ranking documents based on their likelihood of relevance.

1
Yash Pahlani D17B 49

Vector Space Model (VSM):

The Vector Space Model (VSM) is a fundamental technique in Information

Retrieval (IR) that transforms textual data into a geometric framework. In this
model, documents and queries are represented as vectors in a high-dimensional
space, with each dimension corresponding to a unique term. By measuring the
similarity between these vectors, the VSM assesses the relevance of documents to
user queries. Originally proposed by Gerard Salton in the 1960s, the VSM has
since become a cornerstone of modern IR systems.

Working of Vector Space Model:

The VSM transforms documents and queries into numerical vectors within a
high-dimensional space. Each dimension corresponds to a unique term in the
vocabulary. The key steps in the VSM's working are:

Term Frequency (TF) Calculation: For each document and query, the frequency of
each term is computed. This forms the term frequency vector.
The Term-Frequency is computed with respect to the i-th term and j-th document :

Inverse Document Frequency (IDF) Calculation: The inverse document frequency
of each term is determined, representing its importance in the entire document
collection.
The Inverse-Document-Frequency takes into consideration the i-th terms and all
the documents in the collection :

Term Weighting (TF-IDF): The product of term frequency and inverse document
frequency results in the TF-IDF score, which reflects the significance of each term
within a document or query.

2
Yash Pahlani D17B 49

Vector Creation: Each document and query is represented as a vector, where each
dimension corresponds to a term and the value is its corresponding TF-IDF score.

Cosine Similarity: The relevance between documents and queries is assessed using
the cosine similarity between their respective vectors. Documents with higher
cosine similarities are considered more relevant.
Cosine Similarity is computed using:

Advantages of Vector Space Model

Partial Matching: The VSM accommodates partial keyword matches, allowing

relevant documents to be retrieved even if they share only a subset of terms with
the query.

Term Importance: TF-IDF captures term importance, emphasizing rare and

distinctive terms over common ones.

Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

most relevant documents first.

Disadvantages of Vector Space Model

Semantic Gap: The VSM lacks understanding of word semantics, leading to

challenges in capturing context and meaning.

High-Dimensional Space: As the vocabulary grows, the dimensionality of the
space increases, which can lead to computational complexities.

Query Sparsity: Short queries or those with few relevant terms may result in
imprecise retrieval.

3
Yash Pahlani D17B 49

Code:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

nltk.download('punkt')
nltk.download('stopwords')
# Sample corpus
corpus = [
'In computer science artificial intelligence sometimes called
machine intelligence is intelligence demonstrated by machines',
'Experimentation calculation and Observation is called science',
'Physics is a natural science that involves the study of matter
and its motion through space and time, along with related concepts
such as energy and force',
'In mathematics and computer science an algorithm is a finite
sequence of well-defined computer-implementable instructions',
'Chemistry is the scientific discipline involved with elements
and compounds composed of atoms, molecules and ions',
'Biochemistry is the branch of science that explores the chemical
processes within and related to living organisms',
'Sociology is the study of society, patterns of social
relationships, social interaction, and culture that surrounds
everyday life',
]

# Preprocess and clean the corpus

cleaned_corpus = []
for doc in corpus:
tokens = word_tokenize(doc.lower())
tokens = [word for word in tokens if word.isalnum()]
cleaned_corpus.append(' '.join(tokens))

# Create a TF-IDF vectorizer

vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(cleaned_corpus)

# Query
query = 'computer science'
query_tokens = word_tokenize(query.lower())

4
Yash Pahlani D17B 49

query = ' '.join([word for word in query_tokens if word not in

stopwords.words('english')])

# Transform the query into a vector using the same vectorizer

query_vector = vectorizer.transform([query])

# Calculate cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarities = cosine_similarity(doc_vectors,
query_vector).flatten()

# Get the indices of related documents

related_docs_indices = cosine_similarities.argsort()[::-1]

# Print related documents with cosine similarity values

for i in related_docs_indices:
tokens = word_tokenize(cleaned_corpus[i])
filtered_tokens = [word for word in tokens if word not in
stopwords.words('english')]
data = ' '.join(filtered_tokens)
similarity_value = cosine_similarities[i]
print(f"Similarity: {similarity_value:.4f} - {data}")

Output:

5
Yash Pahlani D17B 49

Boolean Information Retrieval Model

The Boolean Information Retrieval Model is a fundamental approach in the field of

information retrieval, which involves searching for and retrieving relevant documents
from a collection based on user-defined queries. It operates on the principles of Boolean
logic, which was developed by George Boole in the mid-19th century. The Boolean
model is particularly effective for precise, structured searches, making it well-suited for
certain types of information retrieval tasks.

Working of Boolean Information Retrieval Model

The Boolean Information Retrieval Model operates based on a set of principles derived
from Boolean logic. The key components that define how the model works include
Boolean operators, queries, and documents. Here's a step-by-step explanation of how the
Boolean Information Retrieval Model works:

Document Indexing: Before searching can begin, a collection of documents is indexed.

This involves parsing each document to extract individual terms (words), removing
stopwords (common words like "and," "the," "is"), and creating an index that maps each
term to the documents where it appears.

Query Formulation: Users create queries by combining terms and Boolean operators
(AND, OR, NOT). The terms are the keywords or phrases that users want to search for
within the document collection. The Boolean operators define how the terms are related
and help narrow down or broaden the search.

Boolean Operators:

AND: When users use the AND operator, they are specifying that documents must
contain all the terms connected by AND. This narrows down the search to documents that
satisfy all the conditions.

OR: The OR operator retrieves documents that contain at least one of the terms connected
by OR. It broadens the search by including documents that meet any of the specified
conditions.

6
Yash Pahlani D17B 49

NOT: The NOT operator excludes documents that contain the term following it. It refines
the search results by excluding unwanted documents.

Retrieval of Documents: The indexing structure allows the system to efficiently retrieve
documents that match the terms and Boolean operators specified in the query.

Advantages of Boolean Information Retrieval Model

Precision Control: Users can precisely define search criteria using Boolean operators,
ensuring accurate retrieval of specific information.

Structured Queries: Ideal for systematic searches where exact term matches are critical,
such as legal or scientific research.

Consistent Results: The same query always produces the same results, ensuring
reproducibility.

Disadvantages of Boolean Information Retrieval Model

No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to

potential difficulties in identifying more important results.
Limited Language Handling: Struggles with variations in language, such as synonyms or
related terms, which can result in missed information.

Complex Query Construction: Formulating intricate queries with multiple terms and
operators can be complex and error-prone.

Binary Output: Documents are classified as either relevant or irrelevant, lacking the
nuance of degrees of relevance.

Code:
import nltk
nltk.download('stopwords')
from typing import OrderedDict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

7
Yash Pahlani D17B 49

# Sample documents in the corpus

documents = [
"Taj Mahal is a beautiful monument",
"Victoria Memorial is also a monument",
"I like to visit Agra"
]

stemmer = PorterStemmer()

txtFiles = []
stemmedwords = []
dictionary = {}
OrderedDictionary = []

for doc_id, doc_text in enumerate(documents):

tokens = word_tokenize(doc_text.lower())

# Remove stopwords and punctuations

stop_words = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stop_words and t not in
string.punctuation]

stemmedwords = [stemmer.stem(token) for token in tokens]

for term in stemmedwords:

if term not in dictionary:
dictionary[term] = []
dictionary[term].append(doc_id + 1) # Adding 1 to doc_id to match
document numbering

OrderedDictionary = OrderedDict(sorted(dictionary.items()))

print("Inverted Index:")
for term, posting_list in OrderedDictionary.items():
print(term, posting_list)

query = input("Enter query: ")

query = word_tokenize(query.lower())
query = [stemmer.stem(word) for word in query]

print("Processed Query:", query)

result_set = set(range(1, len(documents) + 1))

8
Yash Pahlani D17B 49

i = 0
while i < len(query):
term = query[i]

if term == 'and':
i += 1
next_term = query[i]
result_set &= set(OrderedDictionary.get(next_term, []))
elif term == 'or':
i += 1
next_term = query[i]
result_set |= set(OrderedDictionary.get(next_term, []))
elif term == 'not':
i += 1
next_term = query[i]
result_set -= set(OrderedDictionary.get(next_term, []))
else:
result_set = set(OrderedDictionary.get(term, []))

i += 1

if result_set:
print("\nMatching Documents:", result_set)
else:
print("\nNo matching documents.")

Output:

AND

9
Yash Pahlani D17B 49

NOT

Conclusion
While the Boolean Information Retrieval Model offers precision and structured searches,
it falls short in adapting to contextual nuances and ranking relevance. As the Vector
Space Model quantifies text and addresses these limitations, it stands as a more adaptable
and nuanced approach, though challenges in contextual understanding and user intent
remain for further development.

BL - Awb
No ratings yet
BL - Awb
1 page
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ai For IT Coders
No ratings yet
Ai For IT Coders
18 pages
Unit 4
No ratings yet
Unit 4
17 pages
Algebraic Model in Information Retrieval Techniques
No ratings yet
Algebraic Model in Information Retrieval Techniques
3 pages
Ysio
100% (1)
Ysio
252 pages
The Geisha Memory 2
No ratings yet
The Geisha Memory 2
25 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
DAY 6 PATHFit 1
No ratings yet
DAY 6 PATHFit 1
34 pages
Guidelines To Fill Student Data University Marksheet v2.9
No ratings yet
Guidelines To Fill Student Data University Marksheet v2.9
6 pages
MAD - PRACTICAL EXAM Slips - 23 - 24
No ratings yet
MAD - PRACTICAL EXAM Slips - 23 - 24
9 pages
Project-Description-for-Scoping MCTEP
No ratings yet
Project-Description-for-Scoping MCTEP
33 pages
Amazonico London A La Carte Menu
No ratings yet
Amazonico London A La Carte Menu
2 pages
Abstract WCPC - The 'I Think' As Gluon
No ratings yet
Abstract WCPC - The 'I Think' As Gluon
2 pages
Unit 2
No ratings yet
Unit 2
13 pages
Dynamic Fluid Pulsation
No ratings yet
Dynamic Fluid Pulsation
17 pages
FITA - Academy - UI UX Design
No ratings yet
FITA - Academy - UI UX Design
17 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
LIBS 894 Assignment Three Classic Models
No ratings yet
LIBS 894 Assignment Three Classic Models
8 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Chapter I: Introduction To Project Management: True False
No ratings yet
Chapter I: Introduction To Project Management: True False
74 pages
Introduction To Computer Graphics
No ratings yet
Introduction To Computer Graphics
2 pages
1 Overview
No ratings yet
1 Overview
44 pages
A New Way To PFC and An Even Better Way To LLC
No ratings yet
A New Way To PFC and An Even Better Way To LLC
30 pages
Introduction To Industrial Relations: Lecture 1& 2
No ratings yet
Introduction To Industrial Relations: Lecture 1& 2
54 pages
Data Engineer - Ireland
No ratings yet
Data Engineer - Ireland
3 pages
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
No ratings yet
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
25 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
Chapter 5 IR
No ratings yet
Chapter 5 IR
46 pages
IR Practical Theory
No ratings yet
IR Practical Theory
9 pages
IR Journal
No ratings yet
IR Journal
36 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Ir Mod2 Notes
No ratings yet
Ir Mod2 Notes
26 pages
IR Cheatsheet Final
No ratings yet
IR Cheatsheet Final
3 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
INSIDE OUT - Reaction Paper
No ratings yet
INSIDE OUT - Reaction Paper
1 page
Web Search
No ratings yet
Web Search
30 pages
4 IRModels
No ratings yet
4 IRModels
30 pages
Principles of Public Speaking Syllabus - Ms. Catherine Linobo
No ratings yet
Principles of Public Speaking Syllabus - Ms. Catherine Linobo
7 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
NLP See
No ratings yet
NLP See
27 pages
NLP See
No ratings yet
NLP See
9 pages
Control System Configuration PDF
100% (1)
Control System Configuration PDF
2 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
What Is Failure Mode Effects Analysis
No ratings yet
What Is Failure Mode Effects Analysis
6 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
ICTU SurveyQuestionnaire SB
No ratings yet
ICTU SurveyQuestionnaire SB
2 pages
JBL Bar Studio
No ratings yet
JBL Bar Studio
2 pages
Information Retrieval Notes
No ratings yet
Information Retrieval Notes
42 pages
Ansys Beam Analysis and Cross Sections
No ratings yet
Ansys Beam Analysis and Cross Sections
17 pages
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
No ratings yet
Vector Space Model: An Information Retrieval System: Information Technology Empowering Digital India
3 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Performance Management (Final)
No ratings yet
Performance Management (Final)
16 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
5 IRModels
No ratings yet
5 IRModels
30 pages
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
No ratings yet
Theda Weberlucks Electroacoustic Voices in Vocal Performance Art A Gender Issue 1
10 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Prostitution in Victorian Era Society
No ratings yet
Prostitution in Victorian Era Society
11 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
Spring Security SAML - Documentation
No ratings yet
Spring Security SAML - Documentation
7 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Pan Conveyors PDF
100% (1)
Pan Conveyors PDF
24 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
02 Chap02a-BooleanAndvector Models
No ratings yet
02 Chap02a-BooleanAndvector Models
30 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
IR - Models
100% (3)
IR - Models
58 pages
IR Systems Usually Adopt Index Terms To Process Queries Index Term
No ratings yet
IR Systems Usually Adopt Index Terms To Process Queries Index Term
24 pages

Information Retrieval Practical

Uploaded by

Information Retrieval Practical

Uploaded by

Yash Pahlani D17B 49

Aim: To Implement any Information Retrieval Modeling technique

Here are some commonly used IR modeling techniques:

Boolean Model: The Boolean Model is a fundamental and straightforward

Probabilistic Models: Probabilistic models approach information retrieval from a

Vector Space Model (VSM):

The Vector Space Model (VSM) is a fundamental technique in Information

Working of Vector Space Model:

Advantages of Vector Space Model

​ Partial Matching: The VSM accommodates partial keyword matches, allowing

​ Term Importance: TF-IDF captures term importance, emphasizing rare and

​ Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

Disadvantages of Vector Space Model

​ Semantic Gap: The VSM lacks understanding of word semantics, leading to

# Preprocess and clean the corpus

# Create a TF-IDF vectorizer

query = ' '.join([word for word in query_tokens if word not in

# Transform the query into a vector using the same vectorizer

# Calculate cosine similarity

# Get the indices of related documents

# Print related documents with cosine similarity values

Boolean Information Retrieval Model

The Boolean Information Retrieval Model is a fundamental approach in the field of

Working of Boolean Information Retrieval Model

​ Document Indexing: Before searching can begin, a collection of documents is indexed.

Advantages of Boolean Information Retrieval Model

Disadvantages of Boolean Information Retrieval Model

​ No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to

# Sample documents in the corpus

for doc_id, doc_text in enumerate(documents):

# Remove stopwords and punctuations

stemmedwords = [stemmer.stem(token) for token in tokens]

for term in stemmedwords:

query = input("Enter query: ")

print("Processed Query:", query)

You might also like

Partial Matching: The VSM accommodates partial keyword matches, allowing

Term Importance: TF-IDF captures term importance, emphasizing rare and

Ranking: Cosine similarity provides a natural ranking mechanism, presenting the

Semantic Gap: The VSM lacks understanding of word semantics, leading to

Document Indexing: Before searching can begin, a collection of documents is indexed.

No Relevance Ranking: Lacks the ability to rank documents by relevance, leading to