What Is Information Retrieval (IR)
What Is Information Retrieval (IR)
Information Retrieval (IR) is a process used by software programs to organize, store, search,
and retrieve information from a collection of documents. These documents can include text,
images, videos, or other types of multimedia.
Think of IR as a smart search system that helps find the most useful information based on
what a user is looking for. When you enter a query (a search request), the system scans
through the stored documents and brings back the most relevant ones. This is done using
indexing and metadata (extra information about the content, like tags or keywords) to make
searching more efficient.
For example, when you search for something on Google, IR helps by finding and ranking web
pages based on how well they match your search terms.
IR is widely used in many fields to make searching for information easier and faster. Here are
some real-world applications:
In NLP (Natural Language Processing), Information Retrieval (IR) focuses on searching and
retrieving documents written in natural language (English, Hindi, etc.) based on a user’s
query.
Google doesn’t generate new answers but retrieves relevant web pages that already contain
this information. This is how an IR system works in NLP.
The most important goal of an IR system is to retrieve only relevant documents—those that
contain useful information for the user.
A perfect IR system would return only relevant documents and ignore all unrelated ones.
However, in reality, systems aren’t perfect, so they try to rank documents based on how well
they match the query.
A good IR system should not return results about smartphones or expensive laptops that cost
over $2000.
Steps in Information Retrieval
This diagram represents the Information Retrieval (IR) process in Natural Language
Processing (NLP). Let's break it down step by step:
IR in NLP Context
1. Indexing
Steps in Indexing:
Indexing Techniques:
● Boolean Model
○ Uses AND, OR, NOT to filter results.
○ Example:
■ Query: "Machine Learning AND Deep Learning"
■ Returns only documents that contain both terms.
● Vector Space Model
○ Represents documents as mathematical vectors to measure similarity.
○ Used in TF-IDF ranking (explained below).
2. Matching
Formula:
Example:
Imagine we have this document:
➡ "AI is the future. AI is powerful."
Why is TF-IDF Important?
Information Retrieval (IR) is the process of finding relevant information from a large collection of
unstructured data (e.g., text documents, web pages) based on a user’s query.
Before searching, the system needs to prepare and organize documents efficiently. This step is
called indexing and involves:
👉 Example:
If we have three documents:
2. Query Processing
The user submits a query (e.g., "fast indexing"), which must be processed similarly to the
documents:
● Tokenization
● Stopword removal
● Stemming/Lemmatization
● Conversion to a vector representation (like TF-IDF)
Now, the system compares the transformed query with the indexed documents.
This is done using similarity measures like:
👉 Example:
If the user searches for "retrieval techniques", the system ranks the documents like:
D2 is ranked highest because it contains both retrieval and techniques.
The Vector Space Model (VSM) is a mathematical representation of documents and queries as
vectors in high-dimensional space. It allows similarity calculations between queries and
documents.
How It Works
1. Each document and query is represented as a vector in an n-dimensional space, where
each dimension is a unique word (term).
2. The importance of each word in a document is calculated using TF-IDF (Term
Frequency - Inverse Document Frequency).
3. The similarity between a query and a document is measured using cosine similarity.
Each word in a document is assigned a weight using TF-IDF, which is calculated as:
Where:
To measure similarity, we calculate the cosine of the angle between query and document
vectors:
Where:
👉 Example: If the query is "deep learning", we compute cosine similarity with each
document.
Term weighting is the process of assigning numerical values to words in a document or corpus
to reflect their importance in information retrieval (IR). The higher the weight of a term, the
greater its impact on document retrieval and ranking.
Formula:
Formula:
Definition: Measures the importance of a term across multiple documents. Rare terms have
higher IDF values.
Formula:
where:
Definition: A combination of TF and IDF that balances word frequency and uniqueness.
Formula:
● TF = 1/7 ≈ 0.142
● IDF = log(3/3) = 0
● TF-IDF = 0.142 × 0 = 0 (common word, not important)
● TF = 1/6 ≈ 0.167
● IDF = 0.176
● TF-IDF = 0.167 × 0.176 ≈ 0.029 (relatively more important)
Formula:
7. Zipf’s Law
Definition: The frequency of a word is inversely proportional to its rank in the corpus.
Formula:
where:
● f = frequency of a word
● r = rank of the word in terms of frequency
Observation: The most frequent term occurs twice as often as the second-most, three times
as often as the third-most, and so on.
Text Preprocessing
Text preprocessing involves cleaning and transforming raw text data into a format suitable for
analysis and modeling. It improves the quality and efficiency of downstream NLP tasks by
addressing noise, inconsistency, and irrelevant information.
Indexing
Indexing in information retrieval involves creating data structures that efficiently store and
retrieve documents based on content or metadata. It represents documents as vectors in a
high-dimensional space, where each dimension corresponds to a unique term.
Query Processing
Query processing ensures accuracy by retrieving relevant documents based on user queries.
The effectiveness of an IR system depends on how well the query is formulated.
● Matching Query Terms with Indexed Documents: The system identifies relevant
documents by comparing indexed terms with query terms.
● Term Weighting in Query Ranking:
○ Terms with high Inverse Document Frequency (IDF) contribute significantly to
relevance.
○ Rare terms in the corpus but frequent in the query indicate high relevance.
Relevance Feedback