0% found this document useful (0 votes)
23 views102 pages

Lec 1 - Intro - Unit 1 Information Technology

The document provides an overview of Information Retrieval (IR) systems, detailing their historical development, key components, and various models used for data organization and retrieval. It emphasizes the distinction between data and information, the impact of the web on IR, and the processes involved in effective information retrieval. Additionally, it discusses different IR models, including Boolean, vector space, probabilistic, and neural models, highlighting their strengths and weaknesses.

Uploaded by

Zeddy Zayn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views102 pages

Lec 1 - Intro - Unit 1 Information Technology

The document provides an overview of Information Retrieval (IR) systems, detailing their historical development, key components, and various models used for data organization and retrieval. It emphasizes the distinction between data and information, the impact of the web on IR, and the processes involved in effective information retrieval. Additionally, it discusses different IR models, including Boolean, vector space, probabilistic, and neural models, highlighting their strengths and weaknesses.

Uploaded by

Zeddy Zayn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Information Retrieval

NINAD MIRAJKAR
[email protected]
MOB: 9321 727 943
Part 1 – Unit 1
Introduction:
- Overview of IR Systems
- Historical Perspectives - Goals of IR
- The impact of the web on IR
- The role of artificial intelligence (AI) in IR.
In today’s lecture..
• We cover some concepts from Unit 1.
• The definition of IR
• An overview of the IR system
• Components of an IR system
• The past, present and future of IR
In Unit 1
Introduction:
Overview of IR Systems - Historical Perspectives - Goals of IR - The impact of the web on IR - The role of
artificial intelligence (AI) in IR.

Text representation: Statistical Characteristics of Text: Zipf's law; Porter stemmer; morphology; index
term selection; using thesauri. Basic Tokenizing,

Indexing: Simple tokenizing, stop-word removal, and stemming; inverted indices; Data Structure and File
Organization for IR - efficient processing with sparse vectors
Definition
“A software program that deals with the organization, storage, retrieval, and evaluation of information from
document repositories, particularly textual information.”

Information Retrieval is the activity of obtaining material that can usually be documented on an
unstructured nature i.e. usually text which satisfies an information need from within large collections
which is stored on computers. For example, Information Retrieval can be when a user enters a query into
the system.

Not only librarians, professional searchers, etc engage themselves in the activity of information retrieval
but nowadays hundreds of millions of people engage in IR every day when they use web search engines.
Information Retrieval is believed to be the dominant form of Information access.
Overview of IR Systems:

Information Retrieval (IR) is a field of study that deals with the organization, storage, retrieval, and
presentation of information from various sources.
IR systems are designed to help users find relevant information efficiently and effectively. They are
essential in managing large volumes of data, such as those found on the web or in databases.
These systems play a crucial role in managing and organizing various types of information, including text
documents, images, audio, and video. The process of information retrieval involves matching user
queries with indexed data and presenting the most relevant results.
Difference between Information and Data
Data

Data is defined as a collection of individual facts or statistics. Data can come in the form of text,
observations, figures, images, numbers, graphs, or symbols. For example, data might include individual
prices, weights, addresses, ages, names, temperatures, dates, or distances.

Data is a raw form of knowledge and, on its own, doesn’t carry any significance or purpose. In other
words, you have to interpret data for it to have meaning. Data can be simple—and may even seem
useless until it is analyzed, organized, and interpreted.

Quantitative data is provided in numerical form, like the weight, volume, or cost of an item.

Qualitative data is descriptive, but non-numerical, like the name, sex, or eye color of a person.
Difference between Information and Data
Information

Information is defined as knowledge gained through study, communication, research, or instruction.


Essentially, information is the result of analyzing and interpreting pieces of data. Whereas data is the
individual figures, numbers, or graphs, information is the perception of those pieces of knowledge.

For example, a set of data could include temperature readings in a location over several years. Without
any additional context, those temperatures have no meaning. However, when you analyze and organize
that information, you could determine seasonal temperature patterns or even broader climate trends.
Only when the data is organized and compiled in a useful way can it provide information that is
beneficial to others.
The Key Differences Between Data vs
Information
•Data is a collection of facts, while information puts those facts into context.
•While data is raw and unorganized, information is organized.
•Data points are individual and sometimes unrelated. Information maps out that data to provide a
big-picture view of how it all fits together.

•Data, on its own, is meaningless. When it’s analyzed and interpreted, it becomes meaningful
information.

•Data does not depend on information; however, information depends on data.


•Data typically comes in the form of graphs, numbers, figures, or statistics. Information is typically
presented through words, language, thoughts, and ideas.

•Data isn’t sufficient for decision-making, but you can make decisions based on information.
Examples of Data vs Information

1. At a restaurant, a single customer’s bill amount is data. However, when the restaurant owners collect
and interpret multiple bills over a range of time, they can produce valuable information, such as what
menu items are most popular and whether the prices are sufficient to cover supplies, overhead, and
wages.

2. The number of likes on a social media post is a single element of data. When that’s combined with
other social media engagement statistics, like followers, comments, and shares, a company can intuit
which social media platforms perform the best and which platforms they should focus on to more
effectively engage their audience.
Historical Perspectives
The roots of Information Retrieval can be traced back to early systems like library catalogs, which
aimed to organize and retrieve books and documents. The development of digital computers and the
internet revolutionized IR, making it possible to index and search vast amounts of information
electronically. Early systems like Boolean and vector space models laid the foundation for modern IR
approaches. In the 1950s and 1960s, the development of electronic computers led to the first digital
information retrieval systems.
One of the earliest information retrieval systems was the Cranfield project, initiated at the Cranfield
Institute of Technology in the UK in the late 1950s. The Cranfield experiments laid the groundwork
for the vector space model and introduced the concept of relevance ranking.
In the 1960s, researchers like Gerard Salton made significant contributions to IR by introducing the
SMART (System for the Mechanical Analysis and Retrieval of Text) information retrieval system.
SMART employed the vector space model and statistical techniques for term weighting and ranking.
Past, Present, and Future of Information Retrieval
1. Early Developments: As there was an increase in the need for a lot of information, it became
necessary to build data structures to get faster access. The index is the data structure for faster retrieval
of information. Over centuries manual categorization of hierarchies was done for indexes.

2. Information Retrieval In Libraries: Libraries were the first to adopt IR systems for information
retrieval. In first-generation, it consisted, automation of previous technologies, and the search was based
on author name and title. In the second generation, it included searching by subject heading, keywords,
etc. In the third generation, it consisted of graphical interfaces, electronic forms, hypertext features, etc.

3. The Web and Digital Libraries: It is cheaper than various sources of information, it provides greater
access to networks due to digital communication and it gives free access to publish on a larger medium.
Types of IR model

We will discuss a few models


in brief. The models are
covered in detail in Unit 2 of
this course.
Boolean Model
The Boolean model is one of the earliest and simplest IR models. It uses Boolean logic (AND, OR, NOT)
to combine query terms and retrieve documents that match the query. In this model, each document is
represented as a binary vector of terms, where the presence or absence of a term is indicated by 1 or 0,
respectively. Queries are also represented as Boolean expressions. The Boolean model is useful for
precise retrieval but may suffer from the "relevance overload" problem, where the retrieved set is too large
or too small.
Vector Space Model
The vector space model represents both documents and queries as vectors in a multi-dimensional
space. Each dimension corresponds to a term, and the value of each dimension represents the
weight or importance of the term in the document or query. The similarity between a query vector
and a document vector is calculated using various similarity metrics, such as cosine similarity.
Documents with higher similarity scores are considered more relevant. The vector space model is
widely used in modern IR systems and allows for more flexible and probabilistic retrieval.
Probabilistic Model
The probabilistic model treats the retrieval process as a probabilistic event. It calculates the
probability of relevance for each document given a user query. The model estimates the probability
using factors like term frequency, document frequency, and collection frequency. One of the popular
probabilistic models is the Okapi BM25 (Best Matching 25) model, which is commonly used in web
search engines.
Language Model
The language model IR is based on statistical language modeling techniques. It treats both the
query and the document as language models. The goal is to estimate the probability of generating
the query given the document's language model. The language model approach helps in handling
term dependencies and term co-occurrences. It has been successful in retrieval tasks, particularly in
pseudo-relevance feedback scenarios.
Fuzzy Retrieval Model
The fuzzy retrieval model considers the uncertainty in relevance assessments. Instead of providing
binary relevance judgments (relevant or non-relevant), users can specify degrees of relevance for
documents. This model is useful when the relevance of documents is subjective or ambiguous.
Fuzzy retrieval models often employ fuzzy logic or fuzzy set theory to handle uncertain and
imprecise information.
Latent Semantic Indexing (LSI) Model
The LSI model is a dimensionality reduction technique used in IR to capture the latent semantic
structure in a collection of documents. It transforms the term-document matrix into a
lower-dimensional space by performing singular value decomposition (SVD). LSI allows for
semantic relationships between terms and documents to be captured, enabling better retrieval of
conceptually related documents.
Neural IR Model
Neural IR models utilize neural networks and deep learning techniques to perform various IR tasks.
These models have shown promising results in learning complex patterns and capturing semantic
relationships between words and documents. Neural IR models can be used for tasks like query
understanding, relevance ranking, and document clustering.

---------------------------------------------------------------------------------------------------------------------------------------------------------------
Each IR mode has its strengths and weaknesses, and the choice of the appropriate model depends
on the specific requirements of the retrieval task, the size and nature of the collection, and the user's
information needs. In practice, hybrid models that combine different IR modes are often used to
achieve more accurate and effective retrieval results.
IR systems consist of several key processes:

1. Document Collection: This is the set of documents that the IR system will index and search. It can range
from a small local database to the entire World Wide Web.
2. Preprocessing: Before indexing, text documents undergo preprocessing steps, including tokenization,
stop-word removal, and stemming. Preprocessing helps standardize the text and reduces variations in
word forms, enabling more accurate retrieval.
3. Indexing: The process of creating an inverted index, which maps terms (words or phrases) to the
documents that contain them. The inverted index allows for efficient lookup of documents containing
specific terms, making retrieval faster.
4. Query Processing: When a user submits a query, it goes through the same preprocessing steps as the
documents. The processed query is then matched against the inverted index to identify relevant
documents.
5. Ranking: The retrieved documents need to be ranked based on their relevance to the query. Various
ranking algorithms, such as TF-IDF (Term Frequency-Inverse Document Frequency) and BM25, assign
scores to documents based on term frequency, document frequency, and other factors.
6. Presentation of Results: The top-ranked documents are presented to the user in a user-friendly format,
often as a list of titles and snippets.
Components of an IR system
The components of an Information Retrieval (IR) model are the fundamental building blocks that
define how the model operates and processes information. These components collectively
determine how the model indexes, retrieves, and ranks documents to match user queries. The main
components of an IR model include:
1. Document Representation: This component defines how documents are represented in the IR
system. It involves converting raw text or other types of data (e.g., images, audio) into a suitable
format that can be indexed and processed. Document representation is crucial for capturing the
content and characteristics of each document, enabling efficient retrieval.
2. Query Representation: Query representation determines how user queries are processed and
matched against indexed documents. Queries can be represented as sets of terms, vectors, or other
data structures that facilitate matching with document representations.
3. Indexing: Indexing involves the creation of data structures that efficiently map terms to the
documents that contain them. The most common indexing structure is the inverted index, which
lists terms along with pointers to the documents in which they appear. Indexing plays a vital role in
speeding up the retrieval process by enabling quick access to relevant documents.
4. Term Weighting: Term weighting assigns importance scores to terms based on their frequency
and relevance. Weighting helps to distinguish between important and less important terms during
retrieval. Popular term weighting schemes include TF-IDF (Term Frequency-Inverse Document
Frequency) .
5. Retrieval Algorithm: The retrieval algorithm determines how documents are ranked and selected
for presentation to the user. Different IR models use various algorithms, such as Boolean operations,
vector similarity measures, or probabilistic scoring functions, to determine document relevance.
6. Ranking Strategy: The ranking strategy dictates how the retrieved documents are ordered to
present the most relevant ones at the top of the result list. The ranking strategy may consider
factors like document similarity to the query, document popularity, or user feedback.
7. Relevance Feedback (optional): Some IR models incorporate relevance feedback, where users
provide feedback on the initial search results, and the model refines the results based on this
feedback. Relevance feedback helps improve the accuracy of retrieval by incorporating user
preferences.
8. Query Processing: Query processing involves various preprocessing steps applied to user queries
before they are matched against the indexed data. These steps may include tokenization, stop-word
removal, stemming, and other text processing techniques to standardize and enhance query
representation.
9. Presentation of Results: This component deals with how the search results are presented to the
user. The results can be displayed as a ranked list, snippets of text, images, or any other format that
facilitates user understanding and interaction.
10. Evaluation Metrics (for model assessment): To assess the performance of an IR model,
evaluation metrics are used. These metrics measure the quality of retrieval results, such as
precision, recall, F1-score, mean average precision (MAP), and normalized discounted cumulative
gain (NDCG).
---------------------------------------------------------------------------------------------------------------------------------------------------------------
These components work together in an IR model to enable effective and efficient information
retrieval. The specific implementation and design of these components vary across different IR
models, reflecting the diversity of approaches used to address the challenges of information
retrieval.
Sequence of activities of how the system
works User Query

The flowchart illustrates the typical sequence of


Query Processing operations in an IR system, from the user query to the
presentation of relevant search results. It highlights how
each component contributes to the overall retrieval
Query Representation process and how the system functions as a cohesive unit
to deliver accurate and useful information to the user.

Indexing

Term Weighting

Note: the process of a specific algorithm will work at this stage. Rest of the
Retrieval Algorithm steps in the process are followed in general and are independent of the
algorithm.

Presentation of
Results

User Receives
Results
Explanation of Flowchart:
User Query: The process starts when a user enters a query to search for specific information.
Query Preprocessing: The user query undergoes preprocessing steps, such as tokenization,
stop-word removal, and stemming, to standardize and enhance its representation.
Query Representation: The preprocessed query is converted into a suitable data structure, such as a
set of terms or a vector, that can be matched against the indexed data.
Indexing: The IR system creates an inverted index that maps terms to the documents containing
them. This index allows for efficient retrieval of relevant documents during the search process.
Term Weighting: Term weighting assigns importance scores to the terms in the query and
documents based on their frequency and relevance.
Retrieval Algorithm: The IR system uses a retrieval algorithm (e.g., Boolean operations, vector
similarity measures, probabilistic scoring) to identify relevant documents that match the user's
query.
Ranking Strategy: The retrieved documents are ranked based on their relevance to the query using a
ranking strategy, such as sorting by similarity scores or incorporating user feedback.
Presentation of Results: The top-ranked documents are presented to the user in a user-friendly
format, such as a list of titles and snippets, images, or other relevant information.
User Receives Results: The user receives the search results, and based on the presented documents,
they can interact with the retrieved information.
The Impact of the Web on IR
The advent of the World Wide Web brought new challenges and opportunities to Information
Retrieval. Web search engines, such as Google, Bing, and Yahoo, have become an integral part of our
daily lives. The sheer size and dynamic nature of the web necessitated the development of
specialized algorithms and techniques to crawl, index, and rank web pages effectively.
The emergence of the World Wide Web in the 1990s brought unprecedented challenges and
opportunities to information retrieval. The web presented an enormous and dynamic collection of
interconnected documents, making traditional IR techniques less effective.
Web search engines like Google revolutionized IR by combining efficient crawling, indexing, and
ranking algorithms. Google's PageRank algorithm, developed by Larry Page and Sergey Brin,
revolutionized web search by considering the link structure between web pages to determine
relevance and authority.
The Role of Artificial Intelligence (AI) in
IR
Artificial Intelligence plays a crucial role in modern IR systems. AI techniques, such as natural
language processing (NLP), machine learning, and deep learning, are employed to improve various
aspects of IR, including query understanding, relevance ranking, and user personalization.
AI-powered systems can learn from user interactions and feedback, leading to more accurate and
context-aware search results.
1. Query Understanding: AI-powered systems can better understand user queries, accounting for
synonyms, word order, and context, leading to more accurate results.
2. Relevance Ranking: Machine learning models can learn from user interactions and feedback to
improve ranking algorithms and present more relevant results.
3. Personalization: AI enables the creation of personalized IR experiences, tailoring search results to
individual user preferences and behaviors.
4. Natural Language Search: Voice-based and natural language search has become more prevalent,
allowing users to interact with IR systems using spoken language.
AI-powered IR systems continue to evolve, and ongoing research in this area is essential for further
improvements and innovations in the field.
References Links
1. https://bloomfire.com/blog/data-vs-information/

2. https://www.geeksforgeeks.org/what-is-information-retrieval/

3. https://www.geeksforgeeks.org/difference-between-information-and-data/
4. The Cranfield Project
5. The SMART Information Retrieval System
6. Google's PageRank Algorithm
Part 2 – Unit 1
Text representation:
Statistical Characteristics of Text: Zipf's law;
Porter stemmer; morphology;
Index term selection;
Using thesauri.
In this lecture..
1. Text Representation in Information Retrieval
2. Statistical Characteristics of Text: Zipf's Law:
3. Porter Stemmer and Morphology:
4. Index Term Selection and Thesauri:
Text Representation in Information
Retrieval
Statistical Characteristics of Text: Zipf's
Law:
Zipf's law states that in a given corpus* of natural language text, the frequency of any word is inversely
proportional to its rank in the frequency table.
In simpler terms, a few words occur very frequently (like "the," "and," "is") while the majority of words
occur rarely. Understanding Zipf's law helps in designing efficient indexing and retrieval strategies, such
as term weighting and relevance ranking.
Zipf's law, proposed by linguist George Zipf, is an empirical law that describes the frequency distribution
of words in natural language.
*corpus = collection
Zipf's law has significant implications for
information retrieval:
1. Term Frequency Weighting: Inverted index-based IR systems often use term frequency (TF)
weighting to rank documents. Zipf's law suggests that frequent terms are less informative than rare
terms. Therefore, TF-IDF weighting, which considers both term frequency and inverse document
frequency, helps in giving more importance to rare terms.
2. Stop-Word Removal: Since common words appear frequently and have little discriminative power,
they are often removed during preprocessing. Stop-word lists are created, containing words like "a,"
"an," "the," etc., which are omitted during indexing.
Mathematically, Zipf's law can be
expressed as:
f(r) = C / r^s

Where:
• f(r) is the frequency of the word at rank r
• C is a constant
• s is the exponent typically close to 1
Zipf's Law
Zipf's Law is an empirical law that describes the frequency distribution of words in natural language
texts. It states that the frequency of any word is inversely proportional to its rank in the frequency
table. In simpler terms, a few words occur very frequently, while the majority of words occur rarely.
Mathematically, Zipf's Law can be expressed as: f(r) = C / r^s
Explanation:
Zipf's Law suggests that the most frequent word in a language will appear approximately twice as
often as the second most frequent word, three times as often as the third most frequent word, and
so on. This power-law distribution is common in various natural phenomena, and it has significant
implications for information retrieval and language processing.
Let's illustrate Zipf's Law using a simple
example
Example: Consider the following text corpus containing seven sentences:
1. "Information retrieval is essential for data science.“
2. "Information retrieval systems aim to find relevant documents.“
3. "Data science is an interdisciplinary field.“
4. "Information retrieval deals with indexing and searching.“
5. "Data science includes statistics and machine learning.“
6. "Search engines use information retrieval techniques.“
7. "Data science is widely used in various industries."
Step 1: Tokenization and Word Frequency
Count
Tokenize the text corpus and count the frequency of each word:
Word Frequency

information 3

retrieval 3

is 3

data 3

science 3

essential 1

for 1

systems 1

aim 1

to 1

find 1

relevant 1

documents 1

an 1

interdisciplinary 1

field 1

deals 1

with 1

indexing 1

and 1

searching 1

includes 1

statistics 1

machine 1

learning 1

search 1

engines 1

use 1

techniques 1

widely 1

used 1

in 1

various 1

industries 1
Step 2: Rank Calculation
Rank the words based on their frequencies:

Rank Word Frequency


1 information 3
2 retrieval 3
3 is 3
4 data 3
5 science 3
6 essential 1
7 for 1
8 systems 1
9 aim 1
10 to 1
... ... ...
Step 3: Zipf's Law Calculation
Now, calculate the expected frequency for each rank using Zipf's Law formula:
Let's assume C = 15 (an arbitrary constant) and s = 1 (the exponent for Zipf's Law is often close to 1).

Rank Word Frequency Expected Frequency (Zipf's Law)


1 information 3 15 / 1^1 = 15
2 retrieval 3 15 / 2^1 = 7.5
3 is 3 15 / 3^1 = 5
4 data 3 15 / 4^1 = 3.75
5 science 3 15 / 5^1 = 3
6 essential 1 15 / 6^1 = 2.5
7 for 1 15 / 7^1 = 2.14
8 systems 1 15 / 8^1 = 1.87
9 aim 1 15 / 9^1 = 1.67
10 to 1 15 / 10^1 = 1.5
... ... ... ...
Step 4: Comparison and Analysis
Compare the actual word frequencies with the expected frequencies based on Zipf's Law. You will
notice that while the actual word frequencies do not precisely match the expected frequencies, there
is a clear trend of the frequencies decreasing with increasing ranks, as predicted by Zipf's Law.
Keep in mind that Zipf's Law is a statistical observation and may not hold exactly for every corpus.
Nevertheless, it provides valuable insights into the distribution of word frequencies in natural
language texts. In information retrieval, Zipf's Law is used in term weighting and ranking algorithms
to prioritize rare and discriminative terms for better retrieval performance.
Why is the value of C = 15?
I used an arbitrary constant value of 15 in the example of Zipf's Law to demonstrate how the
expected frequencies can be calculated for illustration purposes. In reality, the value of the constant
(C) in Zipf's Law is not fixed and can vary depending on the specific corpus or collection of
documents being analyzed.
In Zipf's Law, the constant (C) represents the scaling factor that adjusts the overall distribution of
word frequencies. The value of C is typically determined empirically based on the observed
frequency of the most frequent word (often referred to as the "top word"). Different collections or
corpora will have different most frequent words, leading to varying values of C.
To calculate a more accurate value for the constant C, you would need to:
1. Observe the actual frequency of the most frequent word in the corpus.
2. Plug that frequency and the corresponding rank (usually 1) into the Zipf's Law equation: f(r) = C / r^s
where r is the rank and s is the exponent, which is typically close to 1.
By solving for C using the actual frequency and rank, you can determine the appropriate constant
value for the specific collection you are working with. The goal is to find the C value that best fits the
observed word frequencies to the power-law distribution represented by Zipf's Law.
Keep in mind that Zipf's Law is an empirical observation and may not precisely hold for every corpus.
The value of C and the exponent s might vary slightly depending on the size and nature of the
dataset being analyzed. Nonetheless, Zipf's Law remains a valuable tool for understanding the
distribution of word frequencies in natural language texts and has important implications for
information retrieval and language modeling.
so how do I take the value of constant for a
different example? can I take any random
value, say 21????
Yes, you can determine the value of the constant (C) in Zipf's Law for a different example by
following a similar approach as mentioned before. The goal is to find the value of C that best fits the
observed word frequencies to the power-law distribution represented by Zipf's Law.
Here's a step-by-step approach to calculate the value of the constant (C) for a different example:
1. Observe the actual frequency of the most frequent word in your corpus (let's call this frequency "f1").
2. Determine the rank of the most frequent word (it is typically 1 since it is the most frequent word).
3. Plug the observed frequency and rank into the Zipf's Law equation: f(r) = C / r^s
4. Rearrange the equation to solve for C: C = f(r) * r^s
5. Substitute the values of f(r) and r into the equation to find the value of C.
For example, let's say you have a corpus, and the most frequent word is "apple," and it occurs 100
times in the entire collection. Using the Zipf's Law equation with s = 1 (the exponent is often close to
1), and the rank of "apple" is 1:
C = 100 * 1^1 = 100
So, in this case, the value of the constant C would be 100 for your specific example.
It's important to note that the constant C is not always an integer value, and it can vary based on the
characteristics of the corpus being analyzed. You may also choose to use a different exponent value
(s) depending on the specific fit of Zipf's Law to your data. Experimenting with different values of C
and s can help you gain insights into the distribution of word frequencies in your corpus.
Porter Stemmer and Morphology
Stemming is the process of reducing words to their base or root form (stem) to improve recall in IR
systems.
OR
Stemming is the process of reducing inflected (or sometimes derived) words to their word stems or
roots. The Porter stemming algorithm is a widely used technique to normalize words by removing
common suffixes. By reducing words to their base form, stemming helps in overcoming variations in
word forms and enhances the recall of IR systems.
The Porter stemming algorithm follows a set of rules to remove common suffixes from words, such
as "-ing," "-ed," "-s," etc. The goal is to map related word forms to the same stem, so variations of a
word can be matched during retrieval. For example, the words "running," "runs," and "ran" would all be
stemmed to "run.“

Stem = originate in or be caused by.


Think of most of our words which originate from Sanskrit. And in the Sanskrit language, words are
derived from the root word. Eg, बध ् (badh) – meaning - to bind, to restrain, to loathe, to be disgusted
with, to shrink from.
Note how a single root word when used can mean different things (remember Context). Also, note
how the root word changes with time and in sister languages like Hindi and Marathi where the same
meaning word is called ‘Bandhana’.
How the Porter Stemmer Works:
The Porter Stemmer follows a series of rules and steps to transform words into their stems. The
algorithm consists of five main phases, each targeting specific suffixes in the word. During each
phase, a set of rules is applied sequentially to remove suffixes, simplifying the word to its base form.
Phase 1: Handling Plurals
• Example: "cats" → "cat"
Phase 2: Handling Past Tense Verbs and Adjectives
• Example: "jumped" → "jump"
Phase 3: Handling Verb Endings
• Example: "running" → "run"
Phase 4: Handling Adjective Endings
• Example: "higher" → "high"
Phase 5: Handling Suffixes
• Example: "agreement" → "agre"
The Porter Stemmer is based on a set of heuristic rules, and it attempts to apply the rules in a
specific order to achieve effective stemming results. While the Porter Stemmer is widely used and
effective for many applications, it is not perfect and can produce stemmings that are not actual
words or may not fully capture the intended root of certain words.
Morphology
Morphology deals with the study of word structure and formation. Understanding the morphology of
languages can help in designing more sophisticated stemmers and improving the performance of
IR systems for languages with complex word forms and inflections.
Morphology Concepts:
Morphology is the study of the structure and formation of words in a language. It deals with the
internal structure of words and the rules governing how words are formed from smaller units called
morphemes. Morphemes are the smallest grammatical units that carry meaning. Understanding
morphology is essential for various natural language processing tasks, including stemming,
lemmatization, and language generation.
It is essential to know the nuances of the language in order to capture the context, grammatical
rules and the
Morphemes:
• Free Morphemes: These are morphemes that can stand alone as words and carry meaning by
themselves. Examples include "book," "run," and "happy."
• Bound Morphemes: These are morphemes that cannot stand alone as words and must be attached
to free morphemes to convey meaning. Examples include prefixes like "un-" (e.g., "unhappy") and
suffixes like "-ed" (e.g., "jumped").
Inflectional and Derivational Morphemes:
• Inflectional Morphemes: These morphemes are used to indicate grammatical information, such as
tense, number, or case, without changing the basic meaning of a word. In English, inflectional
morphemes include verb tense markers ("-ed" for past tense) and plural markers ("-s" for plural
nouns).
• Derivational Morphemes: These morphemes are used to create new words or change the meaning
or part of speech of a word. For example, adding the derivational morpheme "-ly" to the adjective
"quick" creates the adverb "quickly."
Lemmatisation vs. Stemming: Lemmatisation is another word for normalization of process similar
to stemming. However, lemmatisation aims to transform words to their base or dictionary form,
known as the lemma. Unlike stemming, lemmatisation considers the meaning and context of words,
resulting in more linguistically accurate results. For example, lemmatisation would convert both
"am" and "is" to the lemma "be," whereas stemming might treat them as separate stems.
Importance of Morphology (just for quick
reference)
Examples:
1. Morphemes in the word "unhappily":
• "un-" is a bound morpheme (prefix) meaning "not."
• "happy" is a free morpheme (base form).
• "-ly" is a bound morpheme (suffix) used to form an adverb.
2. Stemming using the Porter Stemmer:
• Input: "running"
• Output: "run"
3. Lemmatiastion:
• Input: "running"
• Output: "run"
Example: 'The cat was running happily'
Stemming:
Using the Porter Stemmer algorithm, we can stem each word in the phrase:
1. The → The (stop word, no stemming)
2. cat → cat (no change, stem remains the same)
3. was → was (no change, stem remains the same)
4. running → run
5. happily → happili
Stemmed phrase: 'The cat was run happili'
Lemmatisation:
Lemmatisation aims to find the lemma or base form of each word based on its meaning and
context. We'll use a lemmatiser that applies appropriate rules for each word:
1. The → The (stop word, no lemmatisation)
2. cat → cat (no change, lemma remains the same)
3. was → be
4. running → run
5. happily → happily
Lemmatized phrase: 'The cat was run happily'
As you can see, stemming and lemmatisation produced different results for the word "running."
Stemming reduced "running" to "run," which is a simpler form, while lemmatisation recognized the
verb form "running" and changed it to "run," the base or dictionary form. Similarly, the word "happily"
remained unchanged in lemmatisation, whereas stemming resulted in "happili," which is the stem
produced by the Porter Stemmer.
Index Term Selection and Thesauri
Index terms, also known as keywords or descriptors, play a crucial role in representing and indexing
documents. Proper selection of index terms improves the accuracy of retrieval. Thesauri are
structured vocabularies that provide synonyms and hierarchical relationships between words. They
aid in query expansion and disambiguation, thereby enhancing the retrieval process.
Index term selection is a critical process in information retrieval, where relevant terms are chosen to
represent the content of documents. Effective index terms should capture the key concepts of the
document and reflect the users' likely queries.
Several techniques are used for index term selection:
1. Frequency-Based Methods: Terms with high document frequency but relatively low collection
frequency are often considered significant for indexing. Terms that occur frequently in a particular
document but not in many others can be more discriminating.
2. Mutual Information: Mutual information measures the statistical dependency between terms and
documents. Terms with high mutual information scores are likely to be relevant and discriminative.
3. Information Gain: Information gain is another measure used to select informative terms for indexing.
It calculates the reduction in uncertainty about the document class (relevant or non-relevant) when
observing a particular term.
Thesauri, such as WordNet, are structured vocabularies that provide synonyms, antonyms, and
hierarchical relationships between words. Thesauri can be used to enhance query expansion, where
synonyms or related terms are added to the original query to improve retrieval.
Index Term Selection
Index term selection is a critical step in information retrieval and involves identifying and choosing
the most relevant and representative terms from documents to build the index. These index terms,
also known as keywords or terms of importance, serve as access points for retrieving documents in
response to user queries. Effective index term selection plays a crucial role in improving search
accuracy and the overall performance of an information retrieval system.
Process of Index Term Selection
The process of index term selection can involve several techniques and considerations:
1. Tokenization: The text in each document is split into individual tokens or words, which form the
initial set of candidate index terms.
2. Stop-Word Removal: Common words with little semantic value, known as stop words (e.g., "the,"
"and," "is"), are often removed from the candidate terms since they do not significantly contribute to
the document's content.
3. Stemming/Lemmatization: To consolidate variations of words, stemming or lemmatisation may be
applied to reduce words to their base forms or lemmas. This helps group together different inflected
forms of the same word under a common index term.
4. Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is a widely used term weighting
scheme that assigns weights to terms based on their importance in a document relative to the entire
collection. Terms with high TF-IDF scores are considered more informative and are likely to be
selected as index terms.
5. Relevance Ranking: Terms may also be ranked based on their relevance to the document or the
query. More relevant terms are given higher priority during the selection process.
6. Domain-Specific Considerations: Depending on the domain or subject matter of the collection,
domain-specific criteria or heuristics may be applied to select terms relevant to that domain.
Examples of Index Term Selection:
Consider a document collection on the topic of "Artificial Intelligence."
Example 1: Document Title - "Introduction to Artificial Intelligence"
Tokenized words: ["Introduction", "to", "Artificial", "Intelligence"]
After stop-word removal: ["Introduction", "Artificial", "Intelligence"]
Stemming/Lemmatization: No change
TF-IDF weights: Suppose "Introduction" has a low IDF score, while "Artificial" and "Intelligence" have
high IDF scores due to their importance in the collection.
Selected index terms: ["Artificial", "Intelligence"]
Example 2: Document Title - "Deep Learning Techniques in Artificial Intelligence"
Tokenized words: ["Deep", "Learning", "Techniques", "in", "Artificial", "Intelligence"]
After stop-word removal: ["Deep", "Learning", "Techniques", "Artificial", "Intelligence"]
Stemming/Lemmatization: No change
TF-IDF weights: Suppose "Deep" and "Learning" have high IDF scores, while "Techniques" and
"Artificial" have moderate IDF scores.
Selected index terms: ["Deep", "Learning", "Techniques", "Artificial", "Intelligence"]
In these examples, the selected index terms are the words that are most informative and relevant to
the document's content and the overall collection. These terms will be used to build the index,
facilitating efficient and accurate retrieval of documents during user searches.
*This process basically tags every single word in a document/heading and then later removes
supportive words and conjunctions. Later it applies the process of Stemming/Lemmatization. Finally
the term frequency and inverse document frequency weights are considered and the term is selected
for Indexing.
So this sounds easy, right!

What could go wrong??

Well it all depends on the documents you have and how each
term is eventually ranked.
How is the TF-IDF calculated? And what is
it?
TF-IDF (Term Frequency-Inverse Document Frequency) is a popular term weighting scheme used in
information retrieval and text mining to measure the importance of a term in a document relative to
a collection of documents. TF-IDF takes into account both the frequency of a term in a document
(TF) and its rarity across the entire collection (IDF). The formula to calculate TF-IDF is as follows:
TF-IDF = TF(term, document) * IDF(term)
Where:
• TF(term, document) is the term frequency of the term in the document. It represents how many
times the term appears in the document.
• IDF(term) is the inverse document frequency of the term. It measures the rarity of the term across
the entire collection of documents.
The TF-IDF value for a specific term in a particular document is the product of its term frequency
and its inverse document frequency.
Term Frequency (TF):
The term frequency (TF) of a term in a document is a measure of how frequently the term appears in
that document. It is calculated as the ratio of the number of times the term occurs in the document
to the total number of terms in the document. TF is typically normalized to prevent bias towards
longer documents, and one common normalization method is to divide the raw term frequency by
the maximum term frequency in the document.
Formula for normalized TF (TF_norm):
TF_norm(term, document) = (Number of occurrences of term in the document) / (Total number of
terms in the document)
Inverse Document Frequency (IDF):
The inverse document frequency (IDF) of a term is a measure of how rare the term is across the
entire collection of documents. It is calculated as the logarithm of the ratio of the total number of
documents in the collection (N) to the number of documents that contain the term (n).
Formula for IDF:
IDF(term) = log(N / n)
Where:
• N: Total number of documents in the collection.
• n: Number of documents that contain the term.
The IDF value is higher when the term is rare in the collection and lower when the term is more
common.
Calculating TF-IDF:
Once the TF and IDF values are calculated for each term in a document, you can compute the TF-IDF
score for that term in the document using the formula mentioned at the beginning:
TF-IDF(term, document) = TF_norm(term, document) * IDF(term)
The TF-IDF score reflects the importance of a term in a specific document relative to the entire
collection. Terms with high TF-IDF scores are considered more informative and discriminative and
are often used for ranking and retrieval purposes in information retrieval systems. They indicate that
the term is frequent in the document (high TF) and rare in the collection (high IDF), suggesting that
the term is highly relevant to the content of the document. Conversely, terms with low TF-IDF scores
are either very common across all documents or very rare in the document, indicating lower
relevance to the document's content.
Thesauri
A thesaurus is a controlled vocabulary that organizes and relates
words and phrases to express concepts and their relationships. It
serves as a knowledge organization system, helping users find
synonyms, antonyms, hierarchical relationships, and related terms.
Thesauri play a vital role in information retrieval and are commonly
used in search engines, libraries, and databases to improve search
precision and recall.
How Thesauri Work: Thesauri are typically organized as a
hierarchical structure or network, with broader and narrower terms
representing the hierarchical relationships between concepts.
Synonyms and related terms are also linked to provide alternative
and related access points to the same or similar information.
Note: Thesauri is named after the ancient Dinosaur Thesaurus.

Note: Don’t take the above note seriously !!


Thesaurus Concepts
1. Hierarchical Relationships: Terms in a thesaurus are arranged in a hierarchical structure, with
broader terms representing more general concepts and narrower terms representing specific
sub-concepts. For example:
1. General Concept: Animal
1. Broader Term: Mammal
1. Narrower Term: Cat
2. Synonyms: Synonyms are different words or phrases that have similar or identical meanings. They
are cross-referenced in the thesaurus, allowing users to access information using different but
equivalent terms. For example:
1. Synonyms: Car, Automobile, Vehicle
3. Related Terms: Related terms are terms that are conceptually or contextually related to a specific
term but may not have the exact same meaning. For example:
1. Related Term: Car → Related Terms: Driver, Road, Traffic
Example of Thesauri
Consider a thesaurus entry for the term "Dog":
• Term: Dog
• Synonyms: Canine, Pooch
• Broader Term: Animal
• Narrower Term: Beagle
• Related Term: Bark, Pet, Leash

In this example, the thesaurus provides information about synonyms for "Dog" (Canine, Pooch), its
broader concept (Animal), a narrower term representing a specific type of dog (Beagle), and related
terms that are conceptually linked to "Dog" (Bark, Pet, Leash). Users can utilize this information to
find relevant documents or resources associated with different terms that relate to the concept of
"Dog."
By integrating thesauri into an information retrieval system, users can access a broader range of
relevant information and discover alternative ways of expressing their information needs, enhancing
the effectiveness and flexibility of the retrieval process.
Another common identified example of
Thesauri
Students. Yes, read below

This is often observed that students write the exact same


thing twice during exams!! If you do that you wont be invited
to the Thesaurus club!!
Part 3 – Unit 1
Basic Tokenizing.
Indexing: Simple tokenizing.
Stop-word removal, and stemming.
Inverted indices; Data Structure and File Organization for IR.
Efficient processing with sparse vectors.
In this lecture..
1. Basic Tokenizing and Indexing in Information Retrieval.
2. Simple Tokenizing
3. Stop-Word Removal and Stemming
4. Inverted Indices
5. Data Structure and File Organization for IR
6. Efficient Processing with Sparse Vectors
7. Additional Topics:
7.1 Evaluation Metrics in Information Retrieval
7.2 Query Expansion Techniques
7.3 Relevance Ranking Algorithms
7.4 Web Search and PageRank
7.5 Personalized Information Retrieval
Simple Tokenizing
Tokenization is the process of breaking text into individual tokens or words. In basic tokenizing,
sentences are segmented into words using space or punctuation as delimiters. Tokenization is a
fundamental step in the text processing pipeline for IR systems.
Steps in Simple Tokenization:
1. Text Input: The process begins with a text input, which could be a sentence, paragraph, or an entire
document.
2. Splitting by Whitespace: The text is split into tokens using whitespace (spaces, tabs, newlines) as
the delimiter. Each whitespace-separated sequence becomes a token.
3. Handling Punctuation: Punctuation marks like periods, commas, exclamation marks, and question
marks are generally treated as separate tokens. However, this can vary based on the specific
tokenizer and application.
Example: Consider the following sentence: "Natural language processing is a fascinating field!“
Using simple tokenization, the sentence would be tokenized into the following tokens: "Natural,"
"language," "processing," "is," "a," "fascinating," "field," and "!".
Advantages of Simple Tokenization:
1. Simplicity: Simple tokenization is easy to implement and understand, making it a good starting point for text
processing tasks.
2. Readability: Tokens are typically representative of actual words, which can aid in human interpretation of the
processed text.

Limitations of Simple Tokenization:


1. Ambiguity: Punctuation marks and contractions can introduce ambiguity. For instance, "I'm" can be tokenized
into "I" and "'m," which might not be the desired behavior.
2. Special Cases: Simple tokenization may not handle certain cases well, like hyphenated words, numeric
expressions, or acronyms.
3. Language Variants: Different languages have unique punctuation and tokenization rules. Simple tokenization
might not account for these variations.
4. Stemming and Lemmatization: If stemming or lemmatization is required, basic tokenization might generate
misleading tokens that don't accurately represent the base forms.
Usage and Modifications:
Simple tokenization can be sufficient for certain basic text analysis tasks or as a preprocessing step for more
complex tokenization techniques. However, in many real-world applications, more advanced tokenization
methods are used to handle the limitations and challenges of simple tokenization.
Advanced Tokenization Techniques:
1. Word Tokenization: This approach segments text into words, handling punctuation, contractions, hyphenated
words, and more.
2. Sentence Tokenization: It splits text into sentences, accounting for various punctuation marks and sentence
structures.
3. Subword Tokenization: Involves breaking down words into subword units, useful for languages with complex
word formations or for handling rare words.
4. Byte-Pair Encoding (BPE): A subword tokenization method that breaks down words into smaller subword
units based on their frequency in a given corpus.
5. Tokenizer Libraries: Many programming languages and NLP frameworks offer tokenization libraries that
handle more complex tokenization needs.
While simple tokenization serves as a basic introduction to text segmentation, more advanced tokenization
techniques are necessary for handling the intricacies and nuances of real-world text data.
Stop-Word Removal and Stemming
Stop words are common words that occur frequently in a language and often do not carry significant
meaning (e.g., "a," "an," "the"). Removing stop words helps reduce the size of the index and improves
query efficiency. We've already discussed the importance of stemming in the previous section.
Example (Stop word): Consider the sentence: "The quick brown fox jumps over the lazy dog."
After stop word removal, the sentence becomes: "quick brown fox jumps lazy dog."
Stemming: Stemming is the process of reducing words to their base or root form. It involves
removing prefixes, suffixes, and other morphological variations to obtain a common base word.
Stemming helps consolidate words with similar meanings and reduces inflected forms to a common
representation.
Example: Consider the words: "running," "runs," "ran."
After stemming, all these words are reduced to the common stem "run."
Stop Word Removal and Stemming in Combination:
Using stop word removal and stemming together can enhance the effectiveness of text processing.
By removing common, non-informative words (stop words) and reducing words to their base forms
(stemming), the resulting text becomes more focused and relevant for analysis.
Example: Original sentence: "The running foxes are faster than the lazy dogs."
After stop word removal and stemming: "run fox fast laz dog."
Advantages:
1. Improved Analysis Efficiency: Removing stop words reduces the number of tokens to be processed,
making analysis faster.
2. Reduced Noise: Stop word removal eliminates words that add little semantic value, leaving behind
more significant words.
3. Consolidation of Vocabulary: Stemming reduces variations of words to a common form, which
helps in counting and analyzing word occurrences.
Limitations:
1. Potential Loss of Information: Removing stop words may lead to a loss of context, especially in
some languages where stop words contribute to grammatical structure.
2. Stemming Accuracy: Stemming algorithms might not always produce accurate base forms, leading
to incorrect representations.
Inverted Indices
Inverted indices are a core data structure in IR systems. They map terms (words or phrases) to the
documents that contain them. This structure allows for efficient retrieval of documents containing
specific terms and is essential for speeding up search operations. The inverted index is constructed
during the indexing phase, where terms are mapped to the documents that contain them.
The structure of an inverted index is similar to a dictionary, where each term serves as a key, and the
corresponding value is a list of document identifiers (or pointers) where the term appears. In the
case of large collections, posting lists (lists of documents containing the term) can be compressed
to reduce memory requirements.
Inverted indices facilitate rapid retrieval of relevant documents for a given query. When a user
submits a query, it is tokenized, and the corresponding terms are looked up in the inverted index to
obtain the relevant documents.
What next??
How is data stored once its gone through all the process (stemming, indexing etc)?
What are the challenges it faces when its stored in certain ways?
Efficient processing of information in IR heavily relies on appropriate data structures and file
organization. Various data structures like hash tables, B-trees, and trie-based structures can be used
for indexing and querying. File organization techniques, such as sequential and direct access
methods, influence the speed and performance of IR systems.
Data Structure and File Organization for
IR
Now we shall discuss what data structures are present and how files are organized in two parts.
We will look at different data structures, their pros and cons. And later we will discuss the different
file organization methods.
However, each of these data structures plays a vital role in information retrieval and data
organization, and the choice of which structure to use depends on the specific requirements of the
application and the characteristics of the dataset being managed.
Data Structure for IR
Efficient data structures and file organization are crucial for information retrieval systems to handle
large-scale collections effectively. Several data structures and file organizations are commonly used:
1. Hash Tables: Hash tables can be used for quick lookups of term-to-posting list mappings. However,
they can suffer from collisions, which can degrade performance.
2. B-trees: B-trees are balanced tree data structures that allow for efficient insertion, deletion, and
lookup operations. They are commonly used for indexing and organizing large datasets.
3. Trie-based Structures: Tries are tree-like structures used for storing and searching strings efficiently.
They are suitable for prefix-based searches and have applications in autocomplete functionality.
Hash Tables:
Hash tables are data structures used to store key-value pairs and are commonly employed in
information retrieval systems for efficient indexing and retrieval of data. They provide fast access to
information by using a hash function to map keys to specific locations in an array.
How Hash Tables Work:
1. Hash Function: A hash function takes a key as input and computes an index or bucket location in the
array where the corresponding value will be stored.
2. Collision Handling: Since multiple keys can sometimes hash to the same index (known as a
collision), hash tables use collision resolution techniques to manage such situations. Common
collision resolution methods include chaining (using linked lists to store multiple values at the same
index) and open addressing (probing nearby locations to find an empty slot).
Pros and Cons of Hash Tables:
Pros:
• Fast average-case time complexity for insertion, deletion, and retrieval (O(1)).
• Efficient space utilization when the hash function is well-designed and the load factor is low.
• Widely used in various applications due to their simplicity and effectiveness.
Cons:
• Worst-case time complexity for operations can be O(n) if many collisions occur, degrading
performance.
• Hash functions must be carefully designed to distribute keys evenly to minimize collisions.
• Resizing the hash table can be computationally expensive when the load factor exceeds a certain
threshold.
• Hash tables do not support range queries efficiently as they only allow retrieval of single key-value
pairs.
Example: Suppose we have a collection of documents, and we want to build a hash table to store the
term frequencies of each term in the collection. The keys would be the terms, and the values would
be their corresponding term frequencies.
Next we will look at a practical example of hash tables.
Practical Example of Hash Tables: Phone Book
Contacts

Let's consider a practical example of using a hash table to implement a simple phone book contacts
application. The goal is to store a list of contacts with their phone numbers and efficiently retrieve
contact information based on the contact name.
Step 1: Creating the Hash Table We'll start by creating a hash table to store the contacts. For
simplicity, let's assume we have a limited number of contacts, and the phone book can store up to
10 contacts. We'll use an array-based implementation for the hash table.
Step 2: Hash Function Next, we need a hash function to convert the contact name into an index
where the contact will be stored in the hash table. For this example, we'll use a simple hash function
that calculates the sum of ASCII values of characters in the name and takes the modulo of the hash
table size.

Step 3: Inserting Contacts Now, we'll implement a function to insert contacts into the phone book
using the hash table.
Step 4: Retrieving Contacts To retrieve contact information, we'll implement a function that takes
the contact name and returns the corresponding phone number.
Step 5: Putting It All Together Now, let's insert some contacts and retrieve their phone numbers:
Working of Hash Tables:
In this example, the hash table is created with 10 slots (hash_table_size = 10). The hash function
converts the contact name into an index, where the contact information will be stored. If two
contacts hash to the same index due to a collision, linear probing is used to find the next available
slot. When retrieving a phone number, the hash function is applied to the contact name to locate the
correct slot, and the phone number is returned.
Hash tables provide efficient access to contact information based on the contact name. The time
complexity for insertion, retrieval, and deletion is O(1) on average (assuming a good hash function),
making hash tables a practical and effective data structure for various applications, including phone
book contacts, databases, and caching systems.
B-trees
B-trees are self-balancing tree data structures designed to efficiently store and retrieve large
amounts of data in blocks or pages. They are commonly used in databases and information retrieval
systems for indexing and organizing data on disk.
How B-trees Work:
1. Node Structure: B-trees consist of internal nodes and leaf nodes. Internal nodes store keys and
pointers to child nodes, while leaf nodes store actual data entries or references to data blocks.
2. Balance: B-trees maintain balance, ensuring that all leaf nodes are at the same level. This balance
reduces the number of disk accesses required for retrieval and insertion.
Pros and Cons of B-trees:
Pros:
• Efficient for large datasets and disk-based storage.
• Maintains balanced structure, leading to predictable and consistent performance.
• Reduces the number of disk I/O operations, making it suitable for databases and file systems.
Cons:
• More complex to implement and maintain compared to simple data structures like hash tables.
• Insertions and deletions require tree restructuring, leading to higher overhead compared to hash
tables for small datasets.
• In-memory B-trees can be less efficient than other tree structures, such as binary search trees or
AVL trees, for smaller datasets.
Example: Consider a database containing a large number of user records. B-trees can be used to
index the user records efficiently based on a unique identifier, such as a user ID.
Trie-based structures
Trie (pronounced "try") is a tree-like data structure used for efficiently storing and retrieving strings
or sequences. Trie-based structures, such as prefix trees and compressed trie structures (like
Patricia trie), are used in information retrieval for various tasks involving string matching and
searching.
How Trie Based Structures Work:
1. Node Structure: In a trie, each node represents a single character of a string. Nodes are linked based
on the characters they represent, forming a hierarchical tree-like structure.
2. Prefix Matching: Trie-based structures excel at prefix matching, making them efficient for
autocompletion and searching tasks.
Pros and Cons of Trie Based Structrues
Pros:
• Excellent for tasks requiring string matching and prefix search, such as autocompletion and spell
checking.
• Space-efficient when there are many common prefixes in the dataset, as shared prefixes are
represented only once.
Cons:
• Inefficient for storing large datasets of long strings as they can be memory-intensive.
• High space overhead when there is little repetition of prefixes.
• More complex to implement compared to simple data structures like arrays or linked lists.
Example: For a search engine's autocompletion feature, a trie-based structure can be used to
efficiently store and retrieve a large number of search queries for real-time suggestions as the user
types.
File Organization for IR
File organization involves how the indexed data is stored on disk to optimize access times.
Sequential access methods are suitable for processing documents in order, while direct access
methods, such as hashing and indexing, allow for faster retrieval of specific documents.
Efficient Processing with Sparse Vectors
In IR, document-term matrices are often very sparse since most documents contain only a small
subset of the entire vocabulary. Efficient storage and processing techniques for sparse vectors are
critical to avoid unnecessary memory usage and computational overhead.
Sparse vectors can be represented using various data structures, such as:
1. Compressed Sparse Row (CSR) Format: This format stores only the non-zero elements of a sparse
matrix, along with row and column indices. It reduces memory requirements by omitting zero
elements.
2. Inverted Index Compression: Posting lists in the inverted index can be compressed to save space.
Techniques like variable-byte encoding, delta encoding, and Golomb coding are used to represent
integers more efficiently.
Efficient processing of sparse vectors is essential for high-performance IR systems, especially when
dealing with large-scale collections containing millions or billions of documents.

You might also like