0% found this document useful (0 votes)

12 views15 pages

What Is Information Retrieval (IR)

Information Retrieval (IR) is a process that enables software to organize, search, and retrieve information from various document types based on user queries. It is widely used in applications such as search engines, digital libraries, and e-commerce to improve information access. The process involves indexing documents, processing user queries, and ranking results based on relevance using techniques like TF-IDF and cosine similarity.

Uploaded by

m.yadav9315

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

What Is Information Retrieval (IR)

Uploaded by

m.yadav9315

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

What is Information Retrieval (IR)?

Information Retrieval (IR) is a process used by software programs to organize, store, search,
and retrieve information from a collection of documents. These documents can include text,
images, videos, or other types of multimedia.

Think of IR as a smart search system that helps find the most useful information based on
what a user is looking for. When you enter a query (a search request), the system scans
through the stored documents and brings back the most relevant ones. This is done using
indexing and metadata (extra information about the content, like tags or keywords) to make
searching more efficient.

For example, when you search for something on Google, IR helps by finding and ranking web
pages based on how well they match your search terms.

Uses of Information Retrieval (IR)

IR is widely used in many fields to make searching for information easier and faster. Here are
some real-world applications:

1. Search Engines (Google, Bing, etc.)

○ When you type a query into a search engine, IR techniques scan millions of
web pages and bring you the most relevant results.
○ These search engines use algorithms to rank results based on relevance,
popularity, and context.
2. Digital Libraries
○ Digital libraries store books, research papers, and articles in electronic form.
○ IR helps users quickly find the right material by searching through a massive
collection of digital documents.
3. Enterprise Search (Corporate Data Management)
○ Large companies store huge amounts of documents, emails, and reports.
○ IR systems help employees find important files and knowledge without wasting
time.
4. E-commerce (Amazon, Flipkart, etc.)
○ Online shopping websites use IR to help customers find the right products
based on search queries.
○ When you type "wireless headphones," the system searches its product
database and shows you relevant results.
5. Healthcare Information Systems
○ Doctors and researchers use IR to find medical records, research papers, and
drug information.
○ For example, if a doctor searches for "treatment for diabetes," an IR system can
pull up scientific articles and case studies.
6. Legal Research
○ Lawyers need to refer to past cases, laws, and legal documents to prepare
arguments.
○ IR systems help them search through large databases of legal information
efficiently.
7. Social Media Search (Facebook, Twitter, Instagram, etc.)
○ Social media platforms store billions of posts, photos, and videos.
○ IR helps users search for people, posts, and hashtags quickly and accurately.

Information Retrieval (IR) in Natural Language Processing (NLP)

In NLP (Natural Language Processing), Information Retrieval (IR) focuses on searching and
retrieving documents written in natural language (English, Hindi, etc.) based on a user’s
query.

Imagine you type a question in Google like:

➡ “What are the symptoms of diabetes?”

Google doesn’t generate new answers but retrieves relevant web pages that already contain
this information. This is how an IR system works in NLP.

How IR Systems Work in NLP

● IR systems search through a large collection of text documents.

● They try to find the documents that best match a user’s question.
● The system does not generate new answers but only tells the user where to find
relevant documents.

Key Concept: Relevance in IR

The most important goal of an IR system is to retrieve only relevant documents—those that
contain useful information for the user.

A perfect IR system would return only relevant documents and ignore all unrelated ones.
However, in reality, systems aren’t perfect, so they try to rank documents based on how well
they match the query.

For example, if you search:

➡ "Best laptops under $1000"

A good IR system should not return results about smartphones or expensive laptops that cost
over $2000.
Steps in Information Retrieval

1. User enters a query in natural language.

2. The system searches through a collection of documents.
3. Relevant documents are ranked based on similarity to the query.
4. Results are displayed to the user (like Google’s search results).

This diagram represents the Information Retrieval (IR) process in Natural Language
Processing (NLP). Let's break it down step by step:

Key Components and Flow

1. Documents (Corpus)

○ This represents the entire set of documents available for search.
○ These documents are processed and transformed into a structured format using
a representation function.
2. Representation Function (Indexing)
○ Converts raw text documents into structured representations (e.g., term
frequency vectors, TF-IDF, word embeddings).
○ This processed data is stored in an Index, which acts as a database for quick
retrieval.
3. Query (User Input)
○ A user submits a search query (e.g., "Find articles on deep learning").
○ The representation function processes the query into the same structured
format as the indexed documents.
4. Matching Function (Retrieval & Ranking)
○ The query representation is compared with the document representations stored
in the index.
○ The matching function ranks the documents based on similarity (e.g., cosine
similarity, BM25, deep learning models).
5. Results
○ The most relevant documents are retrieved and displayed to the user as search
results.

IR in NLP Context

● Text Representation: Uses NLP techniques like tokenization, stemming, lemmatization,

word embeddings (Word2Vec, BERT).
● Indexing: Efficiently structures document information for faster retrieval.
● Query Processing: Converts user input into a machine-readable format.
● Ranking & Retrieval: Uses traditional (TF-IDF, BM25) or modern deep learning models
(BERT-based retrieval) for ranking.

Main Components of an IR System

An IR system consists of two main processes:

1. Indexing – Preparing documents for quick search.

2. Matching – Finding and ranking documents based on similarity.

1. Indexing

Indexing is the process of organizing text so that it can be searched quickly.

Steps in Indexing:

1. Tokenization – Splitting text into individual words or phrases.

○ Example:
■ Sentence: "Artificial Intelligence is amazing!"
■ Tokens: [‘Artificial’, ‘Intelligence’, ‘is’, ‘amazing’]
2. Removing Frequent Words (Stopwords Removal)
○ Some words appear too often (e.g., is, the, and, of, in) but do not add meaning.
○ These words are removed to reduce noise in search results.
○ Example:
■ Original: "The AI system is intelligent and fast."
■ After removing stopwords: "AI system intelligent fast."
3. Stemming
○ Converts words to their root form to improve matching.
○ Example:
■ ‘running’ → ‘run’
■ ‘jumping’ → ‘jump’
■ ‘studies’ → ‘study’

Indexing Techniques:

● Boolean Model
○ Uses AND, OR, NOT to filter results.
○ Example:
■ Query: "Machine Learning AND Deep Learning"
■ Returns only documents that contain both terms.
● Vector Space Model
○ Represents documents as mathematical vectors to measure similarity.
○ Used in TF-IDF ranking (explained below).

2. Matching

Matching is the process of finding how similar a document is to a query.

To do this, IR systems use mathematical formulas to measure similarity.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a common technique used to determine how important a word is in a document.

Formula:

Let’s break this formula into two parts:

1. Term Frequency (TF)

TF tells us how many times a word appears in a document.

Example:
Imagine we have this document:
➡ "AI is the future. AI is powerful."

● Total words in the document = 6

● "AI" appears 2 times
● TF(AI) = 2/6 = 0.33

This means "AI" is an important word in this document.

2. Inverse Document Frequency (IDF)

IDF checks how rare a word is across all documents.

Why is TF-IDF Important?

● Helps rank search results

● Filters out common words
● Highlights important keywords
Example: How Google Search Uses IR & TF-IDF

➡ Query: "Best programming language for AI"

● Google scans millions of documents.

● Uses Indexing (tokenization, stemming, stopword removal).
● Matching is done using TF-IDF and other algorithms.
● Documents with high TF-IDF scores for "programming language" and "AI" will appear
higher in search result.

Information Retrieval (IR) Process

Information Retrieval (IR) is the process of finding relevant information from a large collection of
unstructured data (e.g., text documents, web pages) based on a user’s query.

Step-by-Step Procedure of an IR System

1. Indexing the Collection of Documents

Before searching, the system needs to prepare and organize documents efficiently. This step is
called indexing and involves:

● Tokenization: Breaking text into words or phrases.

● Stopword Removal: Removing common words like "is", "the", "and".
● Stemming/Lemmatization: Reducing words to their root forms (e.g., "running" → "run").
● TF-IDF Calculation: Measuring the importance of words in documents.
● Inverted Index Creation: A mapping of words to the documents where they appear.

👉 Example:
If we have three documents:

1. D1: "Information retrieval is important."

2. D2: "Retrieval techniques involve indexing."
3. D3: "Indexing helps in fast search."

The system creates an inverted index like this:

Now, when a user searches for "retrieval techniques", the system can quickly find D1 and D2.

2. Query Processing

The user submits a query (e.g., "fast indexing"), which must be processed similarly to the
documents:

● Tokenization
● Stopword removal
● Stemming/Lemmatization
● Conversion to a vector representation (like TF-IDF)

👉 Example: Query: "fast retrieval"

● Stopword removal → ["fast", "retrieval"]
● Stemming → ["fast", "retrieve"]
● Convert to vector form.

3. Matching (Comparing Query with Documents)

Now, the system compares the transformed query with the indexed documents.
This is done using similarity measures like:

● Cosine Similarity (for vector-based models)

● BM25 (for ranking documents)

4. Ranking & Retrieval

● The documents are ranked based on their similarity to the query.

● The system retrieves the most relevant documents and displays them to the user.

👉 Example:
If the user searches for "retrieval techniques", the system ranks the documents like:
D2 is ranked highest because it contains both retrieval and techniques.

Vector Space Model (VSM) of Retrieval

The Vector Space Model (VSM) is a mathematical representation of documents and queries as
vectors in high-dimensional space. It allows similarity calculations between queries and
documents.

How It Works

1. Each document and query is represented as a vector in an n-dimensional space, where
each dimension is a unique word (term).
2. The importance of each word in a document is calculated using TF-IDF (Term
Frequency - Inverse Document Frequency).
3. The similarity between a query and a document is measured using cosine similarity.

TF-IDF Calculation (Weighting Terms)

Each word in a document is assigned a weight using TF-IDF, which is calculated as:

Where:

● TF (Term Frequency) = Number of times a word appears in a document.

● IDF (Inverse Document Frequency) = Measures how rare the word is across all
documents.

👉 Example: Let’s say we have the following three documents:

● D1: "machine learning and deep learning"
● D2: "machine learning algorithms"
● D3: "deep learning and AI"
Cosine Similarity (Comparing Query and Document)

To measure similarity, we calculate the cosine of the angle between query and document
vectors:

Where:

● D1⋅QD_1 \cdot QD1⋅Q = Dot product of document and query vectors.

● ∣∣D1∣∣||D_1||∣∣D1∣∣ = Magnitude of the document vector.
● ∣∣Q∣∣||Q||∣∣Q∣∣ = Magnitude of the query vector.

👉 Example: If the query is "deep learning", we compute cosine similarity with each
document.

● D1: Cosine similarity = 0.85

● D2: Cosine similarity = 0.30
● D3: Cosine similarity = 0.95

So, D3 is most relevant!

What is Term Weighting?

Term weighting is the process of assigning numerical values to words in a document or corpus
to reflect their importance in information retrieval (IR). The higher the weight of a term, the
greater its impact on document retrieval and ranking.

Importance of Term Weighting

1. Helps distinguish important terms from less significant ones.

2. Improves document ranking and relevance in IR systems.
3. Enhances search accuracy in text-based applications like search engines and NLP
tasks.
Common Word Statistics / Term Weighting Methods

1. Term Frequency (TF)

Definition: Measures how often a term appears in a document.

Formula:

Example: Consider a document: "Information retrieval is the process of retrieving information

from a database."

● TF for "information" = 2 / 10 = 0.2

● TF for "retrieval" = 1 / 10 = 0.1
● TF for "database" = 1 / 10 = 0.1

2. Document Frequency (DF)

Definition: Measures how many documents in a corpus contain a particular term.

Formula:

Example: Given a corpus with 3 documents:

1. "Data structures and retrieval are important in CS."

2. "Information retrieval focuses on fetching relevant data."
3. "Storage and retrieval techniques enhance performance."
● DF for "retrieval" = 3 (appears in all documents)
● DF for "data" = 2 (appears in documents 1 & 2)

3. Inverse Document Frequency (IDF)

Definition: Measures the importance of a term across multiple documents. Rare terms have
higher IDF values.

Formula:

where:

● NNN = total number of documents

● DF(t)DF(t)DF(t) = document frequency of term ttt
Example: Using the previous example where N = 3:

● IDF for "retrieval" = log(3/3) = log(1) = 0 (common word)

● IDF for "data" = log(3/2) ≈ 0.176

Rare terms get higher IDF scores.

4. TF-IDF (Term Frequency-Inverse Document Frequency)

Definition: A combination of TF and IDF that balances word frequency and uniqueness.

Formula:

Example: For "retrieval" in Document 1:

● TF = 1/7 ≈ 0.142
● IDF = log(3/3) = 0
● TF-IDF = 0.142 × 0 = 0 (common word, not important)

For "data" in Document 2:

● TF = 1/6 ≈ 0.167
● IDF = 0.176
● TF-IDF = 0.167 × 0.176 ≈ 0.029 (relatively more important)

5. TF-IDF with Document Length Normalization (TF-IDF-DLN)

● Accounts for document length variations.

● Normalizes TF-IDF scores so longer documents don't dominate.

Formula:

This ensures that documents of different lengths contribute fairly.

6. Word Frequency Distribution

Definition: A histogram showing how frequently words appear in a corpus.

Example: Given a document:

"AI and ML are important in AI applications."

● "AI" appears twice → High frequency

● "and" appears once → Low frequency
Use Case: Helps identify stop words (common words like "and", "the") to ignore in search.

7. Zipf’s Law

Definition: The frequency of a word is inversely proportional to its rank in the corpus.

Formula:

where:

● f = frequency of a word
● r = rank of the word in terms of frequency

Observation: The most frequent term occurs twice as often as the second-most, three times
as often as the third-most, and so on.

Text Preprocessing

Text preprocessing involves cleaning and transforming raw text data into a format suitable for
analysis and modeling. It improves the quality and efficiency of downstream NLP tasks by
addressing noise, inconsistency, and irrelevant information.

Common Techniques in Text Preprocessing

● Lowercasing: Converting text to lowercase standardizes the text and reduces

vocabulary size.
● Tokenization: Splitting text into smaller units (tokens), such as words or phrases, to
facilitate further analysis.
● Removing Punctuation: Eliminating punctuation marks simplifies text and reduces
noise.
● Stemming and Lemmatization:
○ Stemming: Reduces words to their base form by removing prefixes and suffixes.
○ Lemmatization: Maps words to their dictionary form (lemma), improving
coherence.
● Removing Numbers and Special Characters: Eliminating numerical digits and
symbols focuses on textual content.
● Handling HTML Tags and URLs: Extracting only text by removing markup elements
and hyperlinks.
● Handling Contractions and Abbreviations: Expanding contractions (e.g., “can’t” →
“cannot”) enhances consistency.
● Spell Checking and Correction: Detecting and fixing spelling errors improves analysis
quality.
● Text Normalization: Standardizing spellings and variations to achieve consistency and
reduce vocabulary size.

Indexing

Indexing in information retrieval involves creating data structures that efficiently store and
retrieve documents based on content or metadata. It represents documents as vectors in a
high-dimensional space, where each dimension corresponds to a unique term.

● TF-IDF-based Indexing: Word statistics such as TF-IDF scores construct document

vectors, with TF-IDF values serving as term weights.
● Efficient Retrieval: Indexing enables quick identification of relevant documents by
analyzing term distributions.

Query Processing

Query processing ensures accuracy by retrieving relevant documents based on user queries.
The effectiveness of an IR system depends on how well the query is formulated.

Key Aspects of Query Processing

● Matching Query Terms with Indexed Documents: The system identifies relevant
documents by comparing indexed terms with query terms.
● Term Weighting in Query Ranking:
○ Terms with high Inverse Document Frequency (IDF) contribute significantly to
relevance.
○ Rare terms in the corpus but frequent in the query indicate high relevance.

Relevance Feedback

Relevance feedback is an interactive technique in IR systems where users provide feedback on

retrieved documents to refine search results iteratively.

Types of Relevance Feedback

1. Implicit Feedback:
○ Inferred from user behavior (e.g., dwell time, scrolling, document selection).
○ Example: The longer a user spends on a document, the more relevant it is
assumed to be.
2. Explicit Feedback:
○ Direct user assessment of document relevance.
○ Relevance Systems:
■ Binary Relevance System: A document is either relevant (1) or
irrelevant (0).
■ Graded Relevance System: Documents are rated on a scale (e.g., "not
relevant," "somewhat relevant," "very relevant").
3. Pseudo Feedback (Blind Feedback):
○ Automates relevance feedback without user interaction.
○ Enhances retrieval performance by assuming the top-ranked documents are
relevant.

1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Ir Mod1 Notes
No ratings yet
Ir Mod1 Notes
20 pages
Module 1print
No ratings yet
Module 1print
5 pages
1 IR Introduction
No ratings yet
1 IR Introduction
23 pages
22103071-Assignment - Ii
No ratings yet
22103071-Assignment - Ii
7 pages
Intelligent
No ratings yet
Intelligent
20 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
Intro Notes
No ratings yet
Intro Notes
11 pages
Information Retrieval and Web Search
No ratings yet
Information Retrieval and Web Search
29 pages
Information Retrieval
No ratings yet
Information Retrieval
21 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Information Retrieval: DR Sharifullah Khan Nust Seecs
No ratings yet
Information Retrieval: DR Sharifullah Khan Nust Seecs
32 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Documentation Ir
No ratings yet
Documentation Ir
58 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Module 3-2
No ratings yet
Module 3-2
17 pages
1stunit GN
No ratings yet
1stunit GN
36 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Information Retrieval 1 Introduction To IR
No ratings yet
Information Retrieval 1 Introduction To IR
12 pages
Information Retrieval (IR) System
No ratings yet
Information Retrieval (IR) System
21 pages
1 introIR
No ratings yet
1 introIR
15 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Information Retrieval 1
100% (2)
Information Retrieval 1
12 pages
Ir QP Answer
No ratings yet
Ir QP Answer
59 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Megha Yadav
No ratings yet
Megha Yadav
9 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Chap 1
No ratings yet
Chap 1
23 pages
Introduction To IR Chapter 01
No ratings yet
Introduction To IR Chapter 01
29 pages
IR Chapter 1
No ratings yet
IR Chapter 1
64 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
1 IRIntro
No ratings yet
1 IRIntro
95 pages
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
100% (1)
Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology
35 pages
IR Chapter 1&2
No ratings yet
IR Chapter 1&2
88 pages
Chapter 1 Introduction To IR
No ratings yet
Chapter 1 Introduction To IR
18 pages
Application NLP
No ratings yet
Application NLP
23 pages
Chapter 1
No ratings yet
Chapter 1
69 pages
RetrivalChapter One
No ratings yet
RetrivalChapter One
30 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
Unit 5
No ratings yet
Unit 5
14 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
UNIT I IR Final
No ratings yet
UNIT I IR Final
26 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Future Scope
No ratings yet
Future Scope
1 page
Abstract
No ratings yet
Abstract
2 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
9 pages
What Is Information Retrieval (IR)
No ratings yet
What Is Information Retrieval (IR)
5 pages
Packing Slip
No ratings yet
Packing Slip
2 pages
UiPath RPAv1
No ratings yet
UiPath RPAv1
4 pages
CS604 Mid Term Past Papers Mega File
No ratings yet
CS604 Mid Term Past Papers Mega File
29 pages
Office Memorandum For AEBAS
No ratings yet
Office Memorandum For AEBAS
3 pages
Trac Grader Plus - BKT
No ratings yet
Trac Grader Plus - BKT
2 pages
BW-L21 Guide
No ratings yet
BW-L21 Guide
17 pages
11 Most In-Demand Programming Languages in 2022
No ratings yet
11 Most In-Demand Programming Languages in 2022
7 pages
Quick Start
No ratings yet
Quick Start
32 pages
Future of Work Ebook
No ratings yet
Future of Work Ebook
19 pages
Bigdata-Bigdata (Set 1)
No ratings yet
Bigdata-Bigdata (Set 1)
11 pages
Kumpletong Sahog NG Adobo
No ratings yet
Kumpletong Sahog NG Adobo
4 pages
Stata Workshop
No ratings yet
Stata Workshop
13 pages
Muhammad Khaleel Afzal Full Stack Software Engineer Solution Architect
No ratings yet
Muhammad Khaleel Afzal Full Stack Software Engineer Solution Architect
4 pages
r22 Iot Unit 3 Notes
No ratings yet
r22 Iot Unit 3 Notes
63 pages
Medical Shop Management System
No ratings yet
Medical Shop Management System
197 pages
VMware Performance and Capacity Management - Second Edition - Sample Chapter
No ratings yet
VMware Performance and Capacity Management - Second Edition - Sample Chapter
34 pages
EDAMED
No ratings yet
EDAMED
155 pages
Intrusion Detection Technique (Idt) 4.1 Introduction To Intrusion Detection (ID)
No ratings yet
Intrusion Detection Technique (Idt) 4.1 Introduction To Intrusion Detection (ID)
10 pages
Assembly Language: 1.machine Language: This Is The Machine
No ratings yet
Assembly Language: 1.machine Language: This Is The Machine
59 pages
Crash Report
No ratings yet
Crash Report
83 pages
Log-2024 03 30 06 24
No ratings yet
Log-2024 03 30 06 24
10 pages
Odoo Candidate Skill Comparison Horizontal
No ratings yet
Odoo Candidate Skill Comparison Horizontal
2 pages
Facial Age Estimation With Age Difference
No ratings yet
Facial Age Estimation With Age Difference
4 pages
Key Roles & Responsibilities in A Software Development Team
No ratings yet
Key Roles & Responsibilities in A Software Development Team
7 pages
Unit - 5 Sad
No ratings yet
Unit - 5 Sad
17 pages
Leaflet
No ratings yet
Leaflet
2 pages
Device Info Report
No ratings yet
Device Info Report
5 pages
Picovoice Interview Questions
No ratings yet
Picovoice Interview Questions
1 page
Mass Transit
No ratings yet
Mass Transit
12 pages
SWOT Assessment - Microsoft Azure
No ratings yet
SWOT Assessment - Microsoft Azure
7 pages

What Is Information Retrieval (IR)

Uploaded by

What Is Information Retrieval (IR)

Uploaded by

What is Information Retrieval (IR)?

Uses of Information Retrieval (IR)

1.​ Search Engines (Google, Bing, etc.)

Information Retrieval (IR) in Natural Language Processing (NLP)

Imagine you type a question in Google like:​

How IR Systems Work in NLP

●​ IR systems search through a large collection of text documents.

Key Concept: Relevance in IR

For example, if you search:​

1.​ User enters a query in natural language.

Key Components and Flow

1.​ Documents (Corpus)

●​ Text Representation: Uses NLP techniques like tokenization, stemming, lemmatization,

Main Components of an IR System

An IR system consists of two main processes:

1.​ Indexing – Preparing documents for quick search.

Indexing is the process of organizing text so that it can be searched quickly.

1.​ Tokenization – Splitting text into individual words or phrases.

Matching is the process of finding how similar a document is to a query.

To do this, IR systems use mathematical formulas to measure similarity.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a common technique used to determine how important a word is in a document.

Let’s break this formula into two parts:

1. Term Frequency (TF)

TF tells us how many times a word appears in a document.

●​ Total words in the document = 6

This means "AI" is an important word in this document.

2. Inverse Document Frequency (IDF)

IDF checks how rare a word is across all documents.

●​ Helps rank search results

➡ Query: "Best programming language for AI"

●​ Google scans millions of documents.

Information Retrieval (IR) Process

Step-by-Step Procedure of an IR System

1. Indexing the Collection of Documents

●​ Tokenization: Breaking text into words or phrases.

1.​ D1: "Information retrieval is important."

The system creates an inverted index like this:

👉 Example: Query: "fast retrieval"

3. Matching (Comparing Query with Documents)

●​ Cosine Similarity (for vector-based models)

4. Ranking & Retrieval

●​ The documents are ranked based on their similarity to the query.

Vector Space Model (VSM) of Retrieval

TF-IDF Calculation (Weighting Terms)

●​ TF (Term Frequency) = Number of times a word appears in a document.

👉 Example: Let’s say we have the following three documents:

●​ D1⋅QD_1 \cdot QD1​⋅Q = Dot product of document and query vectors.

●​ D1: Cosine similarity = 0.85

So, D3 is most relevant!

What is Term Weighting?

Importance of Term Weighting

1.​ Helps distinguish important terms from less significant ones.

1. Term Frequency (TF)

Definition: Measures how often a term appears in a document.

Example: Consider a document: "Information retrieval is the process of retrieving information

●​ TF for "information" = 2 / 10 = 0.2

2. Document Frequency (DF)

Definition: Measures how many documents in a corpus contain a particular term.

Example: Given a corpus with 3 documents:

1.​ "Data structures and retrieval are important in CS."

3. Inverse Document Frequency (IDF)

●​ NNN = total number of documents

●​ IDF for "retrieval" = log(3/3) = log(1) = 0 (common word)

Rare terms get higher IDF scores.

4. TF-IDF (Term Frequency-Inverse Document Frequency)

Example: For "retrieval" in Document 1:

For "data" in Document 2:

5. TF-IDF with Document Length Normalization (TF-IDF-DLN)

●​ Accounts for document length variations.

This ensures that documents of different lengths contribute fairly.

6. Word Frequency Distribution

Definition: A histogram showing how frequently words appear in a corpus.

Example: Given a document:​

●​ "AI" appears twice → High frequency

Common Techniques in Text Preprocessing

1. Search Engines (Google, Bing, etc.)

Imagine you type a question in Google like:

● IR systems search through a large collection of text documents.

For example, if you search:

1. User enters a query in natural language.

1. Documents (Corpus)

● Text Representation: Uses NLP techniques like tokenization, stemming, lemmatization,

1. Indexing – Preparing documents for quick search.

1. Tokenization – Splitting text into individual words or phrases.

● Total words in the document = 6

● Helps rank search results

● Google scans millions of documents.

● Tokenization: Breaking text into words or phrases.

1. D1: "Information retrieval is important."

● Cosine Similarity (for vector-based models)

● The documents are ranked based on their similarity to the query.

● TF (Term Frequency) = Number of times a word appears in a document.

● D1⋅QD_1 \cdot QD1⋅Q = Dot product of document and query vectors.

● D1: Cosine similarity = 0.85

1. Helps distinguish important terms from less significant ones.

● TF for "information" = 2 / 10 = 0.2

1. "Data structures and retrieval are important in CS."

● NNN = total number of documents

● IDF for "retrieval" = log(3/3) = log(1) = 0 (common word)

● Accounts for document length variations.

Example: Given a document:

● "AI" appears twice → High frequency

● Lowercasing: Converting text to lowercase standardizes the text and reduces

● TF-IDF-based Indexing: Word statistics such as TF-IDF scores construct document