Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
With the help of the following diagram, we can understand the process
of information retrieval (IR) −
1
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
In ad-hoc retrieval, the user must enter a query in natural language that
describes the required information. Then the IR system will return the
required documents related to the desired information. For example,
suppose we are searching something on the Internet and it gives some
exact pages that are relevant as per our requirement but there can be
some non-relevant pages too. This is due to the ad-hoc retrieval
problem.
Classical IR Model
Non-Classical IR Model
Alternative IR Model
Inverted Index
3
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
list, for every word, all documents that contain it and frequency of the
occurrences in document. It makes it easy to search for ‘hits’ of a query
word.
Stemming
4
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
For example, the query term “economic” defines the set of documents
that are indexed with the term “economic”.
Now, what would be the result after combining terms with Boolean
AND Operator? It will define a document set that is smaller than or
equal to the document sets of any of the single terms. For example, the
query with terms “social” and “economic” will produce the
documents set of documents that are indexed with both the terms. In
other words, document set with the intersection of both the sets.
Now, what would be the result after combining terms with Boolean OR
operator? It will define a document set that is bigger than or equal to
the document sets of any of the single terms. For example, the query
with terms “social” or “economic” will produce the documents set of
documents that are indexed with either the
term “social” or “economic”. In other words, document set with the
union of both the sets.
5
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
6
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
The top ranked document in response to the terms car and insurance
will be the document d2 because the angle between q and d2 is the
smallest. The reason behind this is that both the concepts car and
insurance are salient in d2 and hence have the high weights. On the
other side, d1 and d3 also mention both the terms but in each case, one
of them is not a centrally important term in the document.
Term Weighting
Term weighting means the weights on the terms in vector space. Higher
the weight of the term, greater would be the impact of the term on
cosine. More weights should be assigned to the more important terms
in the model. Now the question that arises here is how can we model
this.
7
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
8
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Here,
Relevance Feedback
Relevance feedback takes the output that is initially returned from the
given query. This initial output can be used to gather user information
and to know whether that output is relevant to perform a new query or
not. The feedbacks can be classified as follows −
Explicit Feedback
Pseudo Feedback
10
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
11
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
12
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
where wij is the weight of the term ki in relation to the document dj and
wiq is the weight of the term ki in relation to the query q.
Finally, the classic Probabilistic Model (PM) it was the last
classic model, proposed by Robertson and Spark Jones. The PM main
idea consists of, given a user query q and document dj, estimate the
probability of the user considering dj relevant; this is the probability of
13
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
For a while, that’s all folks! I hope that this overview may help some
of you guys in your first steps in the IR area.
14
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
4. Reinforcement Learning
5. Semantic Search
6. Topic Models
15
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
7. Graph-based Models
1. Pointwise Approach
o Description: Treats individual document-query pairs as
separate training examples. The model predicts relevance
scores for each document based on features.
o Example: Using regression to predict a relevance score
(e.g., 0 to 5) for each document.
2. Pairwise Approach
o Description: Focuses on comparing pairs of document-
query pairs. It learns to rank one document higher than
another based on their relevance.
o Example: A model is trained to distinguish between pairs,
like Document A is more relevant than Document B for a
given query.
3. Listwise Approach
o Description: Considers the entire list of documents for a
query. It evaluates how well the entire list is ranked rather
than individual pairs or points.
o Example: Loss functions that measure the order of
documents in the list (e.g., NDCG, which takes into account
the position of relevant documents).
16
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Key Components
Advantages
Applications
Key Concepts
1. End-to-End Learning
o Description: The entire retrieval process, from input to
ranking, is trained as a single model, optimizing for
relevance directly based on user interactions.
2. Two-Stage Retrieval
o Description: Combines traditional methods (like keyword-
based search) with neural models. The first stage retrieves a
broad set of candidates, and the second stage ranks them
using neural techniques.
3. Query and Document Interaction Models
o Description: Focus on modeling the interaction between
queries and documents to improve ranking. Techniques
18
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Advantages
Challenges
Applications
Key Concepts
19
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
How It Works
1. Document Representation
o Instead of using sparse term vectors (where many terms
have zero frequency), documents are represented as vectors
of averaged or summed word embeddings.
o This results in dense vectors that capture the overall
meaning of the document.
2. Query Representation
o Queries are similarly transformed into dense vectors using
the same word embeddings, allowing for direct comparison
with document vectors.
3. Similarity Calculation
o Similarity between the query and documents can be
computed using cosine similarity or other distance metrics
on the dense vectors.
20
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Advantages
Challenges
Applications
Key Components
21
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Learning Process
Key Concepts
Learning Algorithms
1. Model-Free Methods:
o Q-Learning: A value-based method that updates Q-values
for state-action pairs based on the received rewards,
22
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Advantages
Challenges
Applications
5.Semantic Search
Key Concepts
1. Understanding Intent:
o Semantic search focuses on deciphering what users actually
mean when they search, considering synonyms, related
concepts, and user context.
2. Contextual Relevance:
o It evaluates the meaning of the search query in relation to
the content of documents, rather than just matching
keywords.
3. Natural Language Processing (NLP):
o Utilizes NLP techniques to analyze and understand the
structure and meaning of language, enhancing the ability to
process user queries.
Techniques
1. Entity Recognition:
o Identifying and classifying key entities in text (e.g., people,
places, organizations) helps in understanding relationships
and context.
2. Word Embeddings:
o Uses vector representations of words (e.g., Word2Vec,
GloVe) that capture semantic meaning and relationships,
allowing for more nuanced matching.
24
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
3. Knowledge Graphs:
o Structures data in a way that defines relationships between
entities, enhancing the search engine’s ability to provide
relevant results.
4. Contextualized Embeddings:
o Models like BERT (Bidirectional Encoder Representations
from Transformers) consider the context of words within
sentences, improving comprehension of queries.
Advantages
Applications
Challenges
25
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
6. Topic Models
Key Concepts
26
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Applications
Advantages
Challenges
1. PageRank:
o Description: An algorithm developed by Google to rank
web pages based on the structure of links (edges) between
them.
o Mechanism: Assigns a score to each page based on the
number and quality of links pointing to it, simulating the
way users might navigate the web.
2. HITS (Hyperlink-Induced Topic Search):
o Description: A link analysis algorithm that identifies two
types of pages: hubs (pages that link to many other pages)
and authorities (pages that are linked to by many hubs).
28
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Applications
Advantages
Challenges
29
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
7.Alternative Models of IR
Cluster model,
fuzzy model and
latent semantic indexing (LSI) models are the example of
alternative IR model
What is Clustering ?
The task of grouping data points based on their similarity with each
other is called Clustering or Cluster Analysis. This method is defined
under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised
learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from
a heterogeneous dataset. It evaluates the similarity based on a metric
like Euclidean distance, Cosine similarity, Manhattan distance, etc.
and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there
are 3 circular clusters forming on the basis of distance.
30
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
31
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
32
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
33
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
More recently this idea has been used for language modeling.
Equation 102 , page 102 , showed that to avoid sparse data problems in
the language modeling approach to IR, the model of document can
be interpolated with a collection model. But the collection contains
many documents with terms untypical of . By replacing the collection
model with a model derived from 's cluster, we get more accurate
estimates of the occurrence probabilities of terms in .
Clustering can also speed up search. As we saw in
Section 6.3.2 ( page 6.3.2 ) search in the vector space model amounts
to finding the nearest neighbors to the query. The inverted index
supports fast nearest-neighbor search for the standard IR setting.
However, sometimes we may not be able to use an inverted index
efficiently, e.g., in latent semantic indexing (Chapter 18 ). In such
cases, we could compute the similarity of the query to every document,
but this is slow. The cluster hypothesis offers an alternative: Find the
clusters that are closest to the query and only consider documents from
these clusters. Within this much smaller set, we can compute
similarities exhaustively and rank documents in the usual way. Since
there are many fewer clusters than documents, finding the closest
cluster is fast; and since the documents matching a query are all similar
to each other, they tend to be in the same clusters. While this algorithm
is inexact, the expected decrease in search quality is small. This is
essentially the application of clustering that was covered in
Section 7.1.6 (page 7.1.6 ).
Exercises.
34
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Semantic hashing
[edit]
In semantic hashing [21] documents are mapped to memory addresses
by means of a neural network in such a way that semantically similar
documents are located at nearby addresses. Deep neural
network essentially builds a graphical model of the word-count vectors
obtained from a large set of documents. Documents similar to a query
document can then be found by simply accessing all the addresses that
differ by only a few bits from the address of the query document. This
way of extending the efficiency of hash-coding to approximate
matching is much faster than locality sensitive hashing, which is the
fastest current method. [clarification needed]
Latent semantic indexing
[edit]
Latent semantic indexing (LSI) is an indexing and retrieval method
that uses a mathematical technique called singular value
decomposition (SVD) to identify patterns in the relationships between
the terms and concepts contained in an unstructured collection of text.
LSI is based on the principle that words that are used in the same
contexts tend to have similar meanings. A key feature of LSI is its
ability to extract the conceptual content of a body of text by
establishing associations between those terms that occur in
similar contexts.[22]
LSI is also an application of correspondence analysis, a multivariate
statistical technique developed by Jean-Paul Benzécri[23] in the early
1970s, to a contingency table built from word counts in documents.
Called "latent semantic indexing" because of its ability to
correlate semantically related terms that are latent in a collection of
text, it was first applied to text at Bellcore in the late 1980s. The
method, also called latent semantic analysis (LSA), uncovers the
underlying latent semantic structure in the usage of words in a body of
text and how it can be used to extract the meaning of the text in response
to user queries, commonly referred to as concept searches. Queries, or
concept searches, against a set of documents that have undergone LSI
35
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
will return results that are conceptually similar in meaning to the search
criteria even if the results don’t share a specific word or words with the
search criteria.
Benefits of LSI
[edit]
LSI helps overcome synonymy by increasing recall, one of the most
problematic constraints of Boolean keyword queries and vector space
models.[18] Synonymy is often the cause of mismatches in the
vocabulary used by the authors of documents and the users
of information retrieval systems.[24] As a result, Boolean or keyword
queries often return irrelevant results and miss information that is
relevant.
LSI is also used to perform automated document categorization. In fact,
several experiments have demonstrated that there are a number of
correlations between the way LSI and humans process and categorize
text.[25] Document categorization is the assignment of documents to one
or more predefined categories based on their similarity to the
conceptual content of the categories.[26] LSI uses example documents
to establish the conceptual basis for each category. During
categorization processing, the concepts contained in the documents
being categorized are compared to the concepts contained in the
example items, and a category (or categories) is assigned to the
documents based on the similarities between the concepts they contain
and the concepts that are contained in the example documents.
Dynamic clustering based on the conceptual content of documents can
also be accomplished using LSI. Clustering is a way to group
documents based on their conceptual similarity to each other without
using example documents to establish the conceptual basis for each
cluster. This is very useful when dealing with an unknown collection
of unstructured text.
Because it uses a strictly mathematical approach, LSI is inherently
independent of language. This enables LSI to elicit the semantic
content of information written in any language without requiring the
36
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Challenges to LSI
[edit]
38
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
39
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
1. Query Understanding:
o Parsing and Tokenization: NLP breaks down user queries
into components like keywords, phrases, and entities,
helping the system understand the user’s intent.
o Synonym Expansion: By recognizing synonyms and
related terms, NLP can broaden the scope of search results
to include variations of the query terms.
2. Document Indexing:
o Text Normalization: Processes like stemming (reducing
words to their base forms) and lemmatization (reducing
words to their dictionary form) help in indexing documents
consistently.
o Named Entity Recognition (NER): Identifies and
categorizes entities (e.g., names of people, organizations) in
documents, enhancing the relevance of search results.
3. Relevance Ranking:
o Semantic Analysis: Beyond keyword matching, NLP
assesses the meaning of words in context, improving the
accuracy of search results.
o Topic Modeling: Techniques like Latent Dirichlet
Allocation (LDA) help in understanding the themes or
topics within documents, aiding in better ranking and
retrieval.
4. Query Expansion and Refinement:
o Relevance Feedback: NLP analyzes user interactions to
refine and improve search results based on what users find
relevant or not.
o Contextual Understanding: Techniques like BERT
(Bidirectional Encoder Representations from Transformers)
enable systems to understand the context of queries better,
improving accuracy in results.
5. Text Classification and Categorization:
o Document Classification: NLP categorizes documents
into predefined classes or topics, which can be useful for
organizing large corpora and improving retrieval efficiency.
40
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
9.Relation matching
1. Entity Recognition:
o Named Entity Recognition (NER): Identifies and
classifies entities (e.g., people, organizations, locations) in
text. For example, in the sentence "Steve Jobs founded
Apple," NER identifies "Steve Jobs" and "Apple" as
entities.
2. Relation Extraction:
o Pattern-Based Extraction: Uses predefined patterns or
rules to identify relationships between entities. For instance,
41
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
42
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
1. Rule-Based Methods:
o Regular Expressions: Simple patterns for matching
relationships, often used in conjunction with NER.
o Heuristic Rules: Manually crafted rules based on domain
knowledge.
2. Machine Learning Approaches:
o Feature-Based Models: Uses features like syntactic
patterns, part-of-speech tags, and entity types to train
classifiers.
o Sequence Models: Models like LSTM (Long Short-Term
Memory) and CRFs for extracting relationships in
sequences.
3. Deep Learning Approaches:
o Transformer Models: BERT, GPT, and other transformer-
based models can be fine-tuned for relation extraction tasks
by leveraging their ability to understand context and
semantics.
o Entity-Aware Models: Specialized models that
incorporate entity embeddings to improve relation
matching.
4. Evaluation Metrics:
o Precision, Recall, and F1-Score: Common metrics used to
evaluate the performance of relation extraction systems.
Precision measures accuracy, recall measures
completeness, and F1-score balances both.
Applications
43
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
44
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
CLIR systems have improved so much that the most accurate multi-
lingual and cross-lingual adhoc information retrieval systems today are
nearly as effective as monolingual systems.[3] Other related
information access tasks, such as media monitoring, information
filtering and routing, sentiment analysis, and information extraction
require more sophisticated models and typically more processing and
analysis of the information items of interest. Much of that processing
needs to be aware of the specifics of the target languages it is deployed
in.
Mostly, the various mechanisms of variation in human language pose
coverage challenges for information retrieval systems: texts in a
collection may treat a topic of interest but use terms or expressions
which do not match the expression of information need given by the
user. This can be true even in a mono-lingual case, but this is especially
true in cross-lingual information retrieval, where users may know the
target language only to some extent. The benefits of CLIR technology
for users with poor to moderate competence in the target language has
been found to be greater than for those who are fluent.[4] Specific
technologies in place for CLIR services include morphological analysis
to handle inflection, decompounding or compound splitting to handle
compound terms, and translations mechanisms to translate a query
from one language to another.
The first workshop on CLIR was held in Zürich during the SIGIR-96
conference.[5] Workshops have been held yearly since 2000 at the
meetings of the Cross Language Evaluation Forum (CLEF).
Researchers also convene at the annual Text Retrieval Conference
(TREC) to discuss their findings regarding different systems and
methods of information retrieval, and the conference has served as a
point of reference for the CLIR subfield.[6] Early CLIR experiments
were conducted at TREC-6, held at the National Institute of Standards
and Technology (NIST) on November 19–21, 1997.[7]
45
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Translation Approaches
CLIR requires the ability to represent and match information in the
same representation space even if the query and the document
collection are in different languages. The fundamental problem in
CLIR is to match terms in different languages that describe the same or
a similar meaning. The strategy of mapping between different language
representations is usually machine translation. In CLIR, this translation
process can be in several ways.
Document translation [2] is to map the document representation
into the query representation space, as illustrated in Figure 2.
Query translation [3] is to map the query representation into the
document representation space, as illustrated in Figure 3.
46
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
47
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
one approach or the other using the same machine translation system
[6], and the effectiveness is more dependent on the translation direction
[7].
Recent Progress
Very recently, cross-lingual word embeddings and neural network
based information retrieval systems have become increasingly popular.
Cross-lingual word embeddings can represent words in different
languages in the same vector space by learning a mapping from
monolingual embeddings even from no bilingual supervision. Neural
information retrieval can build better representations for documents
and queries and learn to rank directly from relevance labels. Here we
briefly discuss three recent papers in this direction.
DUET
This is the paper Learning to Match using Local and Distributed
Representations of Text for Web Search, WWW 2017 by Bhaskar Mitra,
Fernando Diaz, and Nick Craswell.
48
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
MUSE
The second paper is Word translation without parallel data, ICLR 2018
by Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato,
Ludovic Denoyer, Hervé Jégou.
The paper study cross-lingual word embeddings where word
embeddings for two languages are aligned in the same representation
space (Figure 5). State-of-the-art methods for cross-lingual word
embeddings rely on bilingual supervision such as dictionaries or
parallel corpora. Recent studies try to alleviate the bilingual
49
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Unsupervised CLIR
50
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed
to group similar data points:
Hard Clustering: In this type of clustering, each data point
belongs to a cluster completely or not. For example, Let’s say there
51
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
are 4 data point and we have to cluster them into 2 clusters. So each
data point will either belong to cluster 1 or cluster 2.
Data Points Clusters
A C1
B C2
C C2
D C1
Soft Clustering: In this type of clustering, instead of assigning
each data point into a separate cluster, a probability or likelihood
of that point being that cluster is evaluated. For example, Let’s say
there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to
both clusters. This probability is calculated for all data points.
Data Points Probability of C1 Probability of C2
A 0.91 0.09
B 0.3 0.7
C 0.17 0.83
D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go
through the use cases of Clustering algorithms. Clustering algorithms
are majorly used for:
Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more
audience.
52
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
53
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
4. Distribution-based Clustering
54
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
56
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025
However, the following are some potential challenges that come with
these systems:
57
St. Joseph’s College of Engineering