0% found this document useful (0 votes)
59 views57 pages

Unit Iii - Information Retrieval Design Features of Information Retrieval Systems

Uploaded by

shopwithheysell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views57 pages

Unit Iii - Information Retrieval Design Features of Information Retrieval Systems

Uploaded by

shopwithheysell
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

ML1701 Natural Language Processing Department of AML 2024 - 2025

UNIT III - INFORMATION RETRIEVAL


Design Features of Information Retrieval systems - Information
Retrieval Models – Classical Information Retrieval Models - Non-
classical models of IR -Alternative Models of IR - Evaluation of the IR
System- Natural Language Processing in IR -Relation Matching -
Knowledge-based Approaches - Conceptual Graphs in IR -Cross-
lingual Information Retrieval
1.Design Features of Information Retrieval systems:

Information retrieval (IR) may be defined as a software program


that deals with the organization, storage, retrieval and evaluation of
information from document repositories particularly textual
information. The system assists users in finding the information they
require but it does not explicitly return the answers of the questions. It
informs the existence and location of documents that might consist of
the required information. The documents that satisfy user’s
requirement are called relevant documents. A perfect IR system will
retrieve only relevant documents.

With the help of the following diagram, we can understand the process
of information retrieval (IR) −

1
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

It is clear from the above diagram that a user who needs


information will have to formulate a request in the form of query in
natural language. Then the IR system will respond by retrieving the
relevant output, in the form of documents, about the required
information.

2.Classical Problem in Information Retrieval (IR) System


The main goal of IR research is to develop a model for retrieving
information from the repositories of documents. Here, we are going to
discuss a classical problem, named ad-hoc retrieval problem, related
to the IR system.

In ad-hoc retrieval, the user must enter a query in natural language that
describes the required information. Then the IR system will return the
required documents related to the desired information. For example,
suppose we are searching something on the Internet and it gives some
exact pages that are relevant as per our requirement but there can be
some non-relevant pages too. This is due to the ad-hoc retrieval
problem.

3.Information Retrieval (IR) Model

Mathematically, models are used in many scientific areas having


objective to understand some phenomenon in the real world. A model
of information retrieval predicts and explains what a user will find in
relevance to the given query. IR model is basically a pattern that defines
the above-mentioned aspects of retrieval procedure and consists of the
following −

 A model for documents.


 A model for queries.
 A matching function that compares queries to documents.

Mathematically, a retrieval model consists of −

D − Representation for documents.

R − Representation for queries.


2
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

F − The modeling framework for D, Q along with relationship between


them.

R (q,di) − A similarity function which orders the documents with


respect to the query. It is also called ranking.

Types of Information Retrieval (IR) Model

An information model (IR) model can be classified into the following


three models −

Classical IR Model

It is the simplest and easy to implement IR model. This model is based


on mathematical knowledge that was easily recognized and understood
as well. Boolean, Vector and Probabilistic are the three classical IR
models.

Non-Classical IR Model

It is completely opposite to classical IR model. Such kind of IR models


are based on principles other than similarity, probability, Boolean
operations. Information logic model, situation theory model and
interaction models are the examples of non-classical IR model.

Alternative IR Model

It is the enhancement of classical IR model making use of some specific


techniques from some other fields. Cluster model, fuzzy model and
latent semantic indexing (LSI) models are the example of alternative
IR model.

Design features of Information retrieval (IR) systems

Let us now learn about the design features of IR systems −

Inverted Index

The primary data structure of most of the IR systems is in the form of


inverted index. We can define an inverted index as a data structure that

3
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

list, for every word, all documents that contain it and frequency of the
occurrences in document. It makes it easy to search for ‘hits’ of a query
word.

Stop Word Elimination


Stop words are those high frequency words that are deemed unlikely to
be useful for searching. They have less semantic weights. All such kind
of words are in a list called stop list. For example, articles “a”, “an”,
“the” and prepositions like “in”, “of”, “for”, “at” etc. are the examples
of stop words. The size of the inverted index can be significantly
reduced by stop list. As per Zipf’s law, a stop list covering a few dozen
words reduces the size of inverted index by almost half. On the other
hand, sometimes the elimination of stop word may cause elimination
of the term that is useful for searching. For example, if we eliminate
the alphabet “A” from “Vitamin A” then it would have no significance.

Stemming

Stemming, the simplified form of morphological analysis, is the


heuristic process of extracting the base form of words by chopping off
the ends of words. For example, the words laughing, laughs, laughed
would be stemmed to the root word laugh.

In our subsequent sections, we will discuss about some important and


useful IR models.

The Boolean Model

It is the oldest information retrieval (IR) model. The model is based on


set theory and the Boolean algebra, where documents are sets of terms
and queries are Boolean expressions on terms. The Boolean model can
be defined as −

 D − A set of words, i.e., the indexing terms present in a document.


Here, each term is either present (1) or absent (0).
 Q − A Boolean expression, where terms are the index terms and
operators are logical products − AND, logical sum − OR and
logical difference − NOT

4
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

 F − Boolean algebra over sets of terms as well as over sets of


documents
If we talk about the relevance feedback, then in Boolean IR model
the Relevance prediction can be defined as follows −
 R − A document is predicted as relevant to the query expression
if and only if it satisfies the query expression as −

((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)


We can explain this model by a query term as an unambiguous
definition of a set of documents.

For example, the query term “economic” defines the set of documents
that are indexed with the term “economic”.

Now, what would be the result after combining terms with Boolean
AND Operator? It will define a document set that is smaller than or
equal to the document sets of any of the single terms. For example, the
query with terms “social” and “economic” will produce the
documents set of documents that are indexed with both the terms. In
other words, document set with the intersection of both the sets.

Now, what would be the result after combining terms with Boolean OR
operator? It will define a document set that is bigger than or equal to
the document sets of any of the single terms. For example, the query
with terms “social” or “economic” will produce the documents set of
documents that are indexed with either the
term “social” or “economic”. In other words, document set with the
union of both the sets.

Advantages of the Boolean Mode

The advantages of the Boolean model are as follows −

 The simplest model, which is based on sets.


 Easy to understand and implement.
 It only retrieves exact matches
 It gives the user, a sense of control over the system.

5
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Disadvantages of the Boolean Model

The disadvantages of the Boolean model are as follows −

 The model’s similarity function is Boolean. Hence, there would


be no partial matches. This can be annoying for the users.
 In this model, the Boolean operator usage has much more
influence than a critical word.
 The query language is expressive, but it is complicated too.
 No ranking for retrieved documents.

Vector Space Model

Due to the above disadvantages of the Boolean model, Gerard Salton


and his colleagues suggested a model, which is based on Luhn’s
similarity criterion. The similarity criterion formulated by Luhn states,
“the more two representations agreed in given elements and their
distribution, the higher would be the probability of their representing
similar information.”
Consider the following important points to understand more about the
Vector Space Model −

 The index representations (documents) and the queries are


considered as vectors embedded in a high dimensional Euclidean
space.
 The similarity measure of a document vector to a query vector is
usually the cosine of the angle between them.

Vector Space Representation with Query and Document

6
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

The query and documents are represented by a two-dimensional vector


space. The terms are car and insurance. There is one query and three
documents in the vector space.

The top ranked document in response to the terms car and insurance
will be the document d2 because the angle between q and d2 is the
smallest. The reason behind this is that both the concepts car and
insurance are salient in d2 and hence have the high weights. On the
other side, d1 and d3 also mention both the terms but in each case, one
of them is not a centrally important term in the document.

Term Weighting

Term weighting means the weights on the terms in vector space. Higher
the weight of the term, greater would be the impact of the term on
cosine. More weights should be assigned to the more important terms
in the model. Now the question that arises here is how can we model
this.

One way to do this is to count the words in a document as its term


weight. However, do you think it would be effective method?

Another method, which is more effective, is to use term frequency


(tfij), document frequency (dfi) and collection frequency (cfi).

7
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Term Frequency (tfij)

It may be defined as the number of occurrences of wi in dj. The


information that is captured by term frequency is how salient a word is
within the given document or in other words we can say that the higher
the term frequency the more that word is a good description of the
content of that document.

Document Frequency (dfi)

It may be defined as the total number of documents in the collection in


which wi occurs. It is an indicator of informativeness. Semantically
focused words will occur several times in the document unlike the
semantically unfocused words.

8
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Here,

N = documents in the collection

nt = documents containing term t

User Query Improvement

The primary goal of any information retrieval system must be accuracy


− to produce relevant documents as per the user’s requirement.
However, the question that arises here is how can we improve the
output by improving user’s query formation style. Certainly, the output
of any IR system is dependent on the user’s query and a well-formatted
query will produce more accurate results. The user can improve his/her
query with the help of relevance feedback, an important aspect of any
IR model.

Relevance Feedback

Relevance feedback takes the output that is initially returned from the
given query. This initial output can be used to gather user information
and to know whether that output is relevant to perform a new query or
not. The feedbacks can be classified as follows −

Explicit Feedback

It may be defined as the feedback that is obtained from the assessors of


relevance. These assessors will also indicate the relevance of a
document retrieved from the query. In order to improve query retrieval
performance, the relevance feedback information needs to be
interpolated with the original query.

Assessors or other users of the system may indicate the relevance


explicitly by using the following relevance systems −

 Binary relevance system − This relevance feedback system


indicates that a document is either relevant (1) or irrelevant (0)
for a given query.
 Graded relevance system − The graded relevance feedback
system indicates the relevance of a document, for a given query,
9
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

on the basis of grading by using numbers, letters or descriptions.


The description can be like “not relevant”, “somewhat relevant”,
“very relevant” or “relevant”.
Implicit Feedback
It is the feedback that is inferred from user behavior. The behavior
includes the duration of time user spent viewing a document, which
document is selected for viewing and which is not, page browsing and
scrolling actions, etc. One of the best examples of implicit feedback
is dwell time, which is a measure of how much time a user spends
viewing the page linked to in a search result.

Pseudo Feedback

It is also called Blind feedback. It provides a method for automatic local


analysis. The manual part of relevance feedback is automated with the
help of Pseudo relevance feedback so that the user gets improved
retrieval performance without an extended interaction. The main
advantage of this feedback system is that it does not require assessors
like in explicit relevance feedback system.

Consider the following steps to implement this feedback −

 Step 1 − First, the result returned by initial query must be taken


as relevant result. The range of relevant result must be in top 10-
50 results.
 Step 2 − Now, select the top 20-30 terms from the documents
using for instance term frequency(tf)-inverse document
frequency(idf) weight.
 Step 3 − Add these terms to the query and match the returned
documents. Then return the most relevant documents.

10
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

4.classic Information Retrieval

An Information Retrieval (IR) system’s general purpose is to help


users find relevant information. the IR systems must present
documents’ views to users allowing a quick-select of items that fully or
partially satisfy their information need, specified through a query. In
other words, every IR system’s primary goal is to achieve high
effectiveness, that is, to maximize the effort-satisfaction proportion of
its users about queries.

Expressive advances in electronic and computer technology have


led to the advent of what has been termed as the information age. These
days, according to Mitra, information is readily available through
electronic and networked media in colossal quantities and on an
enormous variety of subjects. This information explosion has resulted
in a great demand for efficient and effective means for organizing and
indexing the data, so that we may retrieve useful information when it is
required through queries.

11
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Once there is a need to perform digital, effective, and efficient


searches for any type of documents in different collections, an
Information Retrieval (IR) system is responsible to process and index
this collection based on different strategies and assign significant
weights to each document. In this respect, an IR system aims to provide
relevant information to its users, considering different queries and
concerning a given collection.

However, ensuring effectiveness for the results produced by an


IR system involves an appropriate modeling of the collection
documents, in order to properly produce a similarity function that
assigns effective similarity scores to collection documents in relation
to the performed query.

In this sense, to define similarity functions, initially, the following


IR models, called classics, were proposed: boolean model,
probabilistic model, and vector space model.

A huge amount of IR models were proposed in the literature, as


shown in Figure 2, which shows just a few methods related to textual
databases:

Figure 2: IR taxonomy for Text Retrieval

12
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

The classic Boolean Model (BM) is based on set theory and


Boolean algebra. Thus, documents are represented as a set of indexing
terms, while queries are represented as a boolean expression on terms,
using AND, OR and NOT operators. Moreover, BM uses the idea of
exact matches between the user’s query and the collection documents;
there is no partial satisfaction in this model, and the generated response
by BM is always binary: the document is (1) or is not (0) relevant.
Therefore, the similarity calculation between a document dj ∈ D and a
query q ∈ Q can be formulated, in a general way, as shown in equation:

where c(q) corresponds to any conjunctive component from query q


and c(dj) corresponds to the conjunctive component from document dj
.

The Vector Space Model (VSM) recognizes BM limitations in


order to propose an algebraic solution that can perform partial matches.
In VSM, the indexing terms are mutually independent and are
represented as vectors in a t-dimensional space, where t is the number
of indexing terms. Thereby, the logical vision of documents and queries
in VSM is t-dimensional vectors, built through a weighting scheme
called TF-IDF, which aims to assign significant weights to indexing
terms. Therefore, the similarity degree between a document dj and a
query q is calculated as vectors correlation, that is, the cosine between
its angles, as shown in equation:

where wij is the weight of the term ki in relation to the document dj and
wiq is the weight of the term ki in relation to the query q.
Finally, the classic Probabilistic Model (PM) it was the last
classic model, proposed by Robertson and Spark Jones. The PM main
idea consists of, given a user query q and document dj, estimate the
probability of the user considering dj relevant; this is the probability of
13
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

dj ⊂ R, where R is the set of relevant documents to a given query. Thus,


the similarity function in PM is calculated as shown in equation:

where dj is a vector representation built through binary weights,


that indicates the absence or presence of indexing terms. Perceives that
this hypothesis is blurred due to a significant lack of information and
properties about the ideal set R. Because of that, Croft and Harper
propose a simple method that generates a classification function
without any previous relevance information (see References).

Finally, it’s important to mention that IR modeling is based on


context and is directly influenced by factors such as document typing,
the homogeneous (e.g., news articles) or heterogeneous (e.g., Web
pages) property of the collection, the size of the documents, and, even
so, practical aspects of IR system’s design. So which of these models
are the best? Depends!

For a while, that’s all folks! I hope that this overview may help some
of you guys in your first steps in the IR area.

5.Non-classical models of Information Retrieval (IR)

Non-classical models of Information Retrieval (IR) have emerged to


address the limitations of traditional models by incorporating more
advanced techniques, such as machine learning and natural language
processing. Here’s an overview of some prominent non-classical
models:

1. Learning to Rank (LTR)

 Basics: Combines various features of documents and queries to


train models that predict the relevance of documents.
 Techniques: Includes supervised learning approaches, using
labeled data to improve ranking.

14
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

2. Neural Information Retrieval

 Basics: Leverages deep learning architectures (e.g., neural


networks) to model complex relationships in data.
 Models: Includes approaches like Siamese networks for
similarity matching and transformers for context understanding.

3. Vector Space with Embeddings

 Basics: Uses word embeddings (e.g., Word2Vec, GloVe) to


capture semantic relationships between terms, improving context
awareness.
 Pros: Better handling of synonyms and word meanings.

4. Reinforcement Learning

 Basics: Treats IR as a sequential decision-making problem,


where the model learns from user interactions and feedback over
time.
 Applications: Can adapt dynamically to user behavior and
preferences.

5. Semantic Search

 Basics: Focuses on understanding the intent and context behind


queries rather than just matching keywords.
 Techniques: Incorporates natural language processing and
knowledge graphs to enhance understanding.

6. Topic Models

 Basics: Models such as Latent Dirichlet Allocation (LDA)


identify abstract topics within documents, allowing for more
thematic searches.
 Pros: Helps in clustering and discovering patterns in large
collections of documents.

15
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

7. Graph-based Models

 Basics: Uses graph structures to represent documents and their


relationships (e.g., citations, co-occurrences).
 Examples: PageRank and HITS algorithms, which leverage
graph connectivity for ranking.

1. Learning to Rank (LTR)

Learning to Rank (LTR) is a machine learning approach specifically


designed for improving the ranking of search results. It uses labeled
training data to learn the relevance of documents to queries and
produces better ranking models than traditional methods. Here’s a
concise overview:

Types of Learning to Rank

1. Pointwise Approach
o Description: Treats individual document-query pairs as
separate training examples. The model predicts relevance
scores for each document based on features.
o Example: Using regression to predict a relevance score
(e.g., 0 to 5) for each document.
2. Pairwise Approach
o Description: Focuses on comparing pairs of document-
query pairs. It learns to rank one document higher than
another based on their relevance.
o Example: A model is trained to distinguish between pairs,
like Document A is more relevant than Document B for a
given query.
3. Listwise Approach
o Description: Considers the entire list of documents for a
query. It evaluates how well the entire list is ranked rather
than individual pairs or points.
o Example: Loss functions that measure the order of
documents in the list (e.g., NDCG, which takes into account
the position of relevant documents).

16
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Key Components

 Features: The model uses various features to determine


relevance, including:
o Term frequency
o Inverse document frequency
o Click-through rates
o Document length
o Semantic similarity
 Training Data: Requires labeled data indicating the relevance of
documents to queries, often collected from user interactions or
expert annotations.
 Models: Various algorithms can be used for LTR, including:
o Decision Trees (e.g., Gradient Boosting)
o Support Vector Machines (SVM)
o Neural Networks

Advantages

 Improved Relevance: Can significantly enhance the quality of


search results compared to traditional methods.
 Adaptability: Models can be continuously refined with new data
and user feedback.
 Flexibility: Can incorporate a wide range of features for better
performance.

Applications

LTR is widely used in search engines, recommendation systems, and


any application where ranking of items is crucial, allowing for more
personalized and relevant user experiences.

2. Neural Information Retrieval

Neural Information Retrieval (Neural IR) is an advanced approach


that utilizes deep learning techniques to improve the process of
retrieving relevant information from large datasets. It addresses some
limitations of traditional IR models by capturing complex patterns and
relationships within data. Here’s a concise overview:
17
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Key Concepts

1. Deep Learning Models


o Neural Networks: Various architectures, including
feedforward networks, recurrent neural networks (RNNs),
and convolutional neural networks (CNNs), are used to
model the interactions between queries and documents.
o Transformers: State-of-the-art models like BERT
(Bidirectional Encoder Representations from Transformers)
leverage attention mechanisms to understand context and
semantics in language.
2. Embeddings
o Word Embeddings: Techniques like Word2Vec and
GloVe convert words into dense vectors, capturing
semantic relationships and improving the model’s
understanding of language.
o Document and Query Embeddings: Documents and
queries are represented as high-dimensional vectors,
allowing for effective similarity comparisons.
3. Relevance Scoring
o Ranking Mechanisms: Neural IR models often output a
relevance score for documents given a query, allowing for
ranking based on these scores.

Types of Neural IR Approaches

1. End-to-End Learning
o Description: The entire retrieval process, from input to
ranking, is trained as a single model, optimizing for
relevance directly based on user interactions.
2. Two-Stage Retrieval
o Description: Combines traditional methods (like keyword-
based search) with neural models. The first stage retrieves a
broad set of candidates, and the second stage ranks them
using neural techniques.
3. Query and Document Interaction Models
o Description: Focus on modeling the interaction between
queries and documents to improve ranking. Techniques

18
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

include attention mechanisms to focus on relevant parts of


the text.

Advantages

 Contextual Understanding: Captures semantic meaning,


allowing for better handling of synonyms and nuanced queries.
 Flexibility: Can incorporate various features and adapt to
different types of data.
 Scalability: Handles large datasets effectively through parallel
processing capabilities of neural networks.

Challenges

 Data Requirements: Requires substantial labeled training data


for effective learning.
 Interpretability: Neural models can be complex and less
interpretable than traditional IR models.
 Computational Resources: Often requires significant
computational power for training and inference.

Applications

Neural IR is widely used in modern search engines, recommendation


systems, chatbots, and any domain where understanding natural
language is crucial for retrieving relevant information.

3. Vector Space with Embeddings

Vector Space with Embeddings is a method in Information Retrieval


(IR) that enhances traditional vector space models by incorporating
word embeddings. This approach allows for a more nuanced
representation of text, capturing semantic relationships between words
and documents. Here’s a brief overview:

Key Concepts

1. Vector Space Model (VSM)

19
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Basics: In a traditional VSM, documents and queries are


o
represented as vectors in a multi-dimensional space. Each
dimension corresponds to a term (word) in the vocabulary.
o Similarity Measurement: Typically uses cosine similarity
to evaluate how similar the vectors (documents or queries)
are.
2. Word Embeddings
o Definition: Dense vector representations of words that
capture their meanings, contexts, and relationships.
Common models include:
 Word2Vec: Learns embeddings based on word co-
occurrences in a corpus.
 GloVe: Combines global word co-occurrence
statistics to produce embeddings.
o Pros: Embeddings enable the model to understand
synonyms and relationships (e.g., "king" - "man" +
"woman" ≈ "queen").

How It Works

1. Document Representation
o Instead of using sparse term vectors (where many terms
have zero frequency), documents are represented as vectors
of averaged or summed word embeddings.
o This results in dense vectors that capture the overall
meaning of the document.
2. Query Representation
o Queries are similarly transformed into dense vectors using
the same word embeddings, allowing for direct comparison
with document vectors.
3. Similarity Calculation
o Similarity between the query and documents can be
computed using cosine similarity or other distance metrics
on the dense vectors.

20
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Advantages

 Semantic Understanding: Better handles synonyms, word


variations, and contextual meanings compared to traditional bag-
of-words approaches.
 Dimensionality Reduction: Dense vectors significantly reduce
the dimensionality of the representation space, making
computations more efficient.
 Robustness: More resilient to noise in text, as embeddings
capture broader meanings.

Challenges

 Training Data: Requires a substantial corpus for effective


training of embeddings.
 Out-of-Vocabulary Words: Rare or new words not present in
the training data may not have embeddings.
 Interpretability: Dense vectors can be less interpretable than
traditional term-based representations.

Applications

Vector space with embeddings is widely used in search engines,


recommendation systems, and natural language processing tasks like
sentiment analysis, document classification, and chatbots.

4.Reinforcement Learning (RL)

Reinforcement Learning (RL) is a branch of machine learning


focused on training agents to make decisions by interacting with an
environment. The goal is to maximize cumulative rewards through
trial-and-error learning. Here’s a more detailed overview:

Key Components

1. Agent: The learner or decision-maker that takes actions in the


environment.

21
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

2. Environment: The context or setting where the agent operates.


The agent receives feedback from the environment based on its
actions.
3. States (s): Representations of the current situation of the agent
within the environment.
4. Actions (a): Choices made by the agent that affect the state of the
environment.
5. Rewards (r): Feedback signals received after taking an action,
indicating the immediate benefit or penalty associated with that
action.

Learning Process

1. Policy (π): A strategy that the agent employs to decide which


action to take based on the current state. It can be:
o Deterministic: Always yields the same action for a given
state.
o Stochastic: Provides a probability distribution over actions.
2. Value Function (V): Estimates the expected cumulative reward
from a given state. It helps the agent evaluate the desirability of
different states.
3. Q-Function (Q): Represents the expected cumulative reward of
taking a specific action in a specific state, guiding the agent’s
decision-making process.

Key Concepts

 Exploration vs. Exploitation:


o Exploration: Trying out new actions to discover their
effects.
o Exploitation: Leveraging known actions that yield high
rewards based on past experiences.

Learning Algorithms

1. Model-Free Methods:
o Q-Learning: A value-based method that updates Q-values
for state-action pairs based on the received rewards,

22
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

learning optimal policies without a model of the


environment.
o SARSA (State-Action-Reward-State-Action): Similar to
Q-learning but updates the Q-values based on the actual
action taken, incorporating the next action's value.
2. Policy Gradient Methods:
o Basics: Directly optimize the policy itself using gradients.
These methods adjust policy parameters based on the
rewards received.
o Examples: REINFORCE and Actor-Critic methods, where
an actor learns the policy and a critic evaluates it.
3. Deep Reinforcement Learning:
o Description: Integrates deep learning with RL, using neural
networks to handle complex, high-dimensional state spaces
(e.g., images). Examples include Deep Q-Networks (DQN)
and Proximal Policy Optimization (PPO).

Advantages

 Adaptability: Can learn optimal behaviors in dynamic and


uncertain environments.
 Generalization: Capable of transferring learned strategies to
similar tasks.
 Real-Time Decision-Making: Effective in sequential decision-
making scenarios.

Challenges

 Sample Efficiency: Requires extensive interactions with the


environment, which can be time-consuming.
 Credit Assignment Problem: Difficulties in determining which
actions are responsible for long-term rewards.
 Exploration Dilemmas: Balancing exploration of new strategies
with exploiting known ones can be tricky.

Applications

 Game Playing: RL has achieved remarkable success in games


like Go (AlphaGo) and video games (OpenAI Five).
23
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

 Robotics: Teaching robots to perform tasks through


reinforcement learning.
 Autonomous Vehicles: Training self-driving cars to navigate
complex environments.
 Recommendation Systems: Optimizing content delivery based
on user interactions.

5.Semantic Search

Semantic Search is an advanced search technique that aims to improve


search accuracy by understanding the context and intent behind user
queries rather than relying solely on keyword matching. Here’s a
concise overview:

Key Concepts

1. Understanding Intent:
o Semantic search focuses on deciphering what users actually
mean when they search, considering synonyms, related
concepts, and user context.
2. Contextual Relevance:
o It evaluates the meaning of the search query in relation to
the content of documents, rather than just matching
keywords.
3. Natural Language Processing (NLP):
o Utilizes NLP techniques to analyze and understand the
structure and meaning of language, enhancing the ability to
process user queries.

Techniques

1. Entity Recognition:
o Identifying and classifying key entities in text (e.g., people,
places, organizations) helps in understanding relationships
and context.
2. Word Embeddings:
o Uses vector representations of words (e.g., Word2Vec,
GloVe) that capture semantic meaning and relationships,
allowing for more nuanced matching.
24
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

3. Knowledge Graphs:
o Structures data in a way that defines relationships between
entities, enhancing the search engine’s ability to provide
relevant results.
4. Contextualized Embeddings:
o Models like BERT (Bidirectional Encoder Representations
from Transformers) consider the context of words within
sentences, improving comprehension of queries.

Advantages

 Improved Relevance: Provides results that are more aligned with


user intent, leading to higher satisfaction.
 Handling Variability: Better manages variations in language,
such as synonyms or different phrasing.
 User Experience: Offers a more intuitive search experience by
anticipating user needs and providing comprehensive answers.

Applications

 Search Engines: Major search engines like Google use semantic


search to enhance query understanding and deliver relevant
results.
 E-commerce: Helps customers find products more effectively by
understanding queries in the context of their needs.
 Chatbots and Virtual Assistants: Enhances conversational
interfaces by interpreting user requests more accurately.

Challenges

 Complexity: Building and maintaining systems that effectively


understand semantics can be technically challenging.
 Data Requirements: Requires large amounts of data to train
models for effective understanding.
 Interpretability: Semantic models, especially deep learning
ones, can be less interpretable than traditional keyword-based
approaches.

25
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

6. Topic Models

Topic Models are statistical models used in natural language


processing (NLP) and information retrieval to discover abstract themes
or topics within a collection of documents. They help to summarize
large datasets by identifying the underlying patterns and structures.
Here’s an overview of key concepts and techniques in topic modeling:

Key Concepts

1. Topic: A distribution over words that often co-occur in


documents. Each topic is characterized by a set of keywords that
are statistically significant.
2. Document Representation: Documents are represented as
mixtures of topics, where each document can contain multiple
topics in varying proportions.
3. Latent Variables: Topic models assume that the topics are latent
(hidden) variables that explain the observed words in documents.

Common Topic Modeling Techniques

1. Latent Dirichlet Allocation (LDA):


o Description: One of the most popular topic modeling
techniques. It assumes a generative process where
documents are produced from a mixture of topics.
o Mechanism:
 Each document is treated as a mixture of topics.
 Each topic is a distribution over words.
 LDA uses Dirichlet distributions to model the topic
proportions and word distributions.
o Output: Assigns probabilities to each topic for each
document and to each word for each topic.
2. Non-Negative Matrix Factorization (NMF):
o Description: A linear algebra-based approach that
factorizes a document-term matrix into two lower-
dimensional non-negative matrices.
o Mechanism: Decomposes the matrix into a topics matrix
and a document matrix, revealing the latent topics.

26
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Output: Each document is represented as a combination of


o
topics, similar to LDA.
3. Latent Semantic Analysis (LSA):
o Description: A technique that uses singular value
decomposition (SVD) to reduce the dimensionality of the
document-term matrix.
o Mechanism: Identifies latent structures in the data, linking
similar terms and documents.
o Output: Represents documents in a reduced space, making
it easier to identify themes.

Applications

 Document Clustering: Grouping similar documents based on the


topics they cover.
 Information Retrieval: Enhancing search engines by indexing
documents based on identified topics.
 Content Recommendation: Suggesting articles or products
based on the topics of interest.
 Text Summarization: Summarizing large corpora by
highlighting the main themes.

Advantages

 Dimensionality Reduction: Helps to reduce the complexity of


text data by summarizing information.
 Interpretability: Provides a clear representation of the themes
present in a corpus, making it easier to understand large datasets.
 Flexibility: Can be applied to various types of textual data across
different domains.

Challenges

 Choosing the Number of Topics: Determining the optimal


number of topics can be subjective and requires careful tuning.
 Interpretation: Topics may sometimes be hard to interpret or
may not align perfectly with human understanding.
 Data Requirements: Requires a substantial amount of text data
to produce meaningful topics.
27
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

7.Graph-based Models --Graph-Based Models in information


retrieval and natural language processing leverage the relationships
between entities (documents, terms, users, etc.) to improve tasks like
search, recommendation, and clustering. Here’s a concise overview:
Key Concepts

1. Graphs: Structures made up of nodes (vertices) and edges


(connections between nodes). In IR, nodes can represent
documents, terms, or entities, while edges represent relationships
or similarities between them.
2. Types of Graphs:
o Directed Graphs: Edges have a direction, indicating a one-
way relationship (e.g., citation networks).
o Undirected Graphs: Edges indicate a two-way relationship
(e.g., co-occurrence of terms in documents).
3. Graph Representations:
o Adjacency Matrix: A square matrix used to represent a
finite graph, where the entry at row iii, column jjj indicates
the presence or absence of an edge between nodes iii and jjj.
o Edge List: A list of all edges in the graph, representing
connections between nodes.

Common Graph-Based Models

1. PageRank:
o Description: An algorithm developed by Google to rank
web pages based on the structure of links (edges) between
them.
o Mechanism: Assigns a score to each page based on the
number and quality of links pointing to it, simulating the
way users might navigate the web.
2. HITS (Hyperlink-Induced Topic Search):
o Description: A link analysis algorithm that identifies two
types of pages: hubs (pages that link to many other pages)
and authorities (pages that are linked to by many hubs).

28
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Mechanism: Assigns scores to hubs and authorities based


o
on their relationships, allowing the identification of topic-
specific resources.
3. Graph Neural Networks (GNNs):
o Description: Deep learning models designed to operate on
graph-structured data.
o Mechanism: Nodes in the graph are updated based on their
neighbors' features, allowing for the capture of local and
global patterns.
o Applications: Useful for tasks like node classification, link
prediction, and graph clustering.

Applications

 Search Engines: Improve relevance by considering the


relationships between documents and terms, leading to better
ranking algorithms.
 Recommendation Systems: Analyze user-item interactions in a
graph format to provide personalized recommendations.
 Social Network Analysis: Explore relationships between users
and content to identify influential nodes and communities.
 Knowledge Graphs: Store and query complex relationships
between entities, enhancing search capabilities with contextual
understanding.

Advantages

 Rich Representation: Captures complex relationships that are


often missed in traditional models.
 Flexibility: Can be applied to various domains and data types,
from text to social networks.
 Dynamic Updates: Graphs can be updated as new data comes in,
allowing for real-time insights.

Challenges

 Scalability: Large graphs can be computationally intensive to


process and analyze.

29
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

 Data Quality: The effectiveness of graph-based models depends


on the quality and completeness of the relationships represented.
 Complexity: Designing and interpreting graph structures can be
non-trivial, requiring domain expertise.

7.Alternative Models of IR
 Cluster model,
 fuzzy model and
 latent semantic indexing (LSI) models are the example of
alternative IR model
What is Clustering ?
The task of grouping data points based on their similarity with each
other is called Clustering or Cluster Analysis. This method is defined
under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised
learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from
a heterogeneous dataset. It evaluates the similarity based on a metric
like Euclidean distance, Cosine similarity, Manhattan distance, etc.
and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there
are 3 circular clusters forming on the basis of distance.

Clustering in information retrieval

 The cluster hypothesis states the fundamental assumption we


make when using clustering in information retrieval.
 Cluster hypothesis. Documents in the same cluster behave
similarly with respect to relevance to information needs.
 The hypothesis states that if there is a document from a cluster
that is relevant to a search request, then it is likely that other
documents from the same cluster are also relevant.
 This is because clustering puts together documents that share
many terms. The cluster hypothesis essentially is the contiguity

30
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

hypothesis. In both cases, we posit that similar documents behave


similarly with respect to relevance.

Table 16.1 shows some of the main applications of clustering in


information retrieval. They differ in the set of documents that they
cluster - search results, collection or subsets of the collection - and the
aspect of an information retrieval system they try to improve - user
experience, user interface, effectiveness or efficiency of the search
system. But they are all based on the basic assumption stated by the
cluster hypothesis.

Clustering of search results to improve recall. None of the top hits


cover the animal sense of jaguar, but users can easily access it by
clicking on the cat cluster in the Clustered Results panel on the left
(third arrow from the top).

The first application mentioned in Table 16.1 is search result


clustering where by search results we mean the documents that were

31
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

returned in response to a query. The default presentation of search


results in information retrieval is a simple list. Users scan the list from
top to bottom until they have found the information they are looking
for. Instead, search result clustering clusters the search results, so that
similar documents appear together. It is often easier to scan a few
coherent groups than many individual documents. This is particularly
useful if a search term has different word senses. The example in
Figure 16.2 is jaguar. Three frequent senses on the web refer to the car,
the animal and an Apple operating system. The Clustered Results panel
returned by the Vivísimo search engine (http://vivisimo.com) can be a
more effective user interface for understanding what is in the search
results than a simple list of documents.

An example of a user session in Scatter-Gather. A collection of New


York Times news stories is clustered (``scattered'') into eight clusters
(top row). The user manually gathers three of these into a smaller
collection International Stories and performs another scattering
operation. This process repeats until a small cluster with relevant
documents is found (e.g., Trinidad).

32
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

A better user interface is also the goal of Scatter-Gather , the second


application in Table 16.1 . Scatter-Gather clusters the whole collection
to get groups of documents that the user can select or gather. The
selected groups are merged and the resulting set is again clustered. This
process is repeated until a cluster of interest is found. An example is
shown in Figure 16.3 .

Automatically generated clusters like those in Figure 16.3 are not as


neatly organized as a manually constructed hierarchical tree like the
Open Directory at http://dmoz.org. Also, finding descriptive labels for
clusters automatically is a difficult problem (Section 17.7 , page 17.7 ).
But cluster-based navigation is an interesting alternative to keyword
searching, the standard information retrieval paradigm. This is
especially true in scenarios where users prefer browsing over searching
because they are unsure about which search terms to use.

As an alternative to the user-mediated iterative clustering in Scatter-


Gather, we can also compute a static hierarchical clustering of a
collection that is not influenced by user interactions (``Collection
clustering'' in Table 16.1 ). Google News and its precursor, the
Columbia NewsBlaster system, are examples of this approach. In the
case of news, we need to frequently recompute the clustering to make
sure that users can access the latest breaking stories. Clustering is well
suited for access to a collection of news stories since news reading is
not really search, but rather a process of selecting a subset of stories
about recent events.

The fourth application of clustering exploits the cluster hypothesis


directly for improving search results, based on a clustering of the entire
collection. We use a standard inverted index to identify an initial set of
documents that match the query, but we then add other documents from
the same clusters even if they have low similarity to the query. For
example, if the query is car and several car documents are taken from
a cluster of automobile documents, then we can add documents from
this cluster that use terms other than car (automobile, vehicle etc). This
can increase recall since a group of documents with high mutual
similarity is often relevant as a whole.

33
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

More recently this idea has been used for language modeling.
Equation 102 , page 102 , showed that to avoid sparse data problems in
the language modeling approach to IR, the model of document can
be interpolated with a collection model. But the collection contains
many documents with terms untypical of . By replacing the collection
model with a model derived from 's cluster, we get more accurate
estimates of the occurrence probabilities of terms in .
Clustering can also speed up search. As we saw in
Section 6.3.2 ( page 6.3.2 ) search in the vector space model amounts
to finding the nearest neighbors to the query. The inverted index
supports fast nearest-neighbor search for the standard IR setting.
However, sometimes we may not be able to use an inverted index
efficiently, e.g., in latent semantic indexing (Chapter 18 ). In such
cases, we could compute the similarity of the query to every document,
but this is slow. The cluster hypothesis offers an alternative: Find the
clusters that are closest to the query and only consider documents from
these clusters. Within this much smaller set, we can compute
similarities exhaustively and rank documents in the usual way. Since
there are many fewer clusters than documents, finding the closest
cluster is fast; and since the documents matching a query are all similar
to each other, they tend to be in the same clusters. While this algorithm
is inexact, the expected decrease in search quality is small. This is
essentially the application of clustering that was covered in
Section 7.1.6 (page 7.1.6 ).

Exercises.

 Define two documents as similar if they have at least two proper


names like Clinton or Sarkozy in common. Give an example of
an information need and two documents, for which the cluster
hypothesis does not hold for this notion of similarity.
 Make up a simple one-dimensional example (i.e. points on a line)
with two clusters where the inexactness of cluster-based retrieval
shows up. In your example, retrieving clusters close to the query
should do worse than direct nearest neighbor search.

34
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Semantic hashing
[edit]
In semantic hashing [21] documents are mapped to memory addresses
by means of a neural network in such a way that semantically similar
documents are located at nearby addresses. Deep neural
network essentially builds a graphical model of the word-count vectors
obtained from a large set of documents. Documents similar to a query
document can then be found by simply accessing all the addresses that
differ by only a few bits from the address of the query document. This
way of extending the efficiency of hash-coding to approximate
matching is much faster than locality sensitive hashing, which is the
fastest current method. [clarification needed]
Latent semantic indexing
[edit]
Latent semantic indexing (LSI) is an indexing and retrieval method
that uses a mathematical technique called singular value
decomposition (SVD) to identify patterns in the relationships between
the terms and concepts contained in an unstructured collection of text.
LSI is based on the principle that words that are used in the same
contexts tend to have similar meanings. A key feature of LSI is its
ability to extract the conceptual content of a body of text by
establishing associations between those terms that occur in
similar contexts.[22]
LSI is also an application of correspondence analysis, a multivariate
statistical technique developed by Jean-Paul Benzécri[23] in the early
1970s, to a contingency table built from word counts in documents.
Called "latent semantic indexing" because of its ability to
correlate semantically related terms that are latent in a collection of
text, it was first applied to text at Bellcore in the late 1980s. The
method, also called latent semantic analysis (LSA), uncovers the
underlying latent semantic structure in the usage of words in a body of
text and how it can be used to extract the meaning of the text in response
to user queries, commonly referred to as concept searches. Queries, or
concept searches, against a set of documents that have undergone LSI

35
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

will return results that are conceptually similar in meaning to the search
criteria even if the results don’t share a specific word or words with the
search criteria.

Benefits of LSI
[edit]
LSI helps overcome synonymy by increasing recall, one of the most
problematic constraints of Boolean keyword queries and vector space
models.[18] Synonymy is often the cause of mismatches in the
vocabulary used by the authors of documents and the users
of information retrieval systems.[24] As a result, Boolean or keyword
queries often return irrelevant results and miss information that is
relevant.
LSI is also used to perform automated document categorization. In fact,
several experiments have demonstrated that there are a number of
correlations between the way LSI and humans process and categorize
text.[25] Document categorization is the assignment of documents to one
or more predefined categories based on their similarity to the
conceptual content of the categories.[26] LSI uses example documents
to establish the conceptual basis for each category. During
categorization processing, the concepts contained in the documents
being categorized are compared to the concepts contained in the
example items, and a category (or categories) is assigned to the
documents based on the similarities between the concepts they contain
and the concepts that are contained in the example documents.
Dynamic clustering based on the conceptual content of documents can
also be accomplished using LSI. Clustering is a way to group
documents based on their conceptual similarity to each other without
using example documents to establish the conceptual basis for each
cluster. This is very useful when dealing with an unknown collection
of unstructured text.
Because it uses a strictly mathematical approach, LSI is inherently
independent of language. This enables LSI to elicit the semantic
content of information written in any language without requiring the

36
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

use of auxiliary structures, such as dictionaries and thesauri. LSI can


also perform cross-linguistic concept searching and example-based
categorization. For example, queries can be made in one language, such
as English, and conceptually similar results will be returned even if they
are composed of an entirely different language or of multiple
languages.[citation needed]
LSI is not restricted to working only with words. It can also process
arbitrary character strings. Any object that can be expressed as text can
be represented in an LSI vector space. For example, tests with
MEDLINE abstracts have shown that LSI is able to effectively classify
genes based on conceptual modeling of the biological information
contained in the titles and abstracts of the MEDLINE citations.[27]
LSI automatically adapts to new and changing terminology, and has
been shown to be very tolerant of noise (i.e., misspelled words,
typographical errors, unreadable characters, etc.).[28] This is especially
important for applications using text derived from Optical Character
Recognition (OCR) and speech-to-text conversion. LSI also deals
effectively with sparse, ambiguous, and contradictory data.
Text does not need to be in sentence form for LSI to be effective. It can
work with lists, free-form notes, email, Web-based content, etc. As
long as a collection of text contains multiple terms, LSI can be used to
identify patterns in the relationships between the important terms and
concepts contained in the text.
LSI has proven to be a useful solution to a number of conceptual
matching problems.[29][30] The technique has been shown to capture key
relationship information, including causal, goal-oriented, and
taxonomic information

Additional uses of LSI


[edit]
It is generally acknowledged that the ability to work with text on a
semantic basis is essential to modern information retrieval systems. As
a result, the use of LSI has significantly expanded in recent years as
earlier challenges in scalability and performance have been overcome.
37
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

LSI is being used in a variety of information retrieval and text


processing applications, although its primary application has been for
concept searching and automated document categorization. [36] Below
are some other ways in which LSI is being used:

 Information discovery[37] (eDiscovery, Government/Intelligence


community, Publishing)
 Automated document classification (eDiscovery,
Government/Intelligence community, Publishing)[38]
 Text summarization[39] (eDiscovery, Publishing)
 Relationship discovery[40] (Government, Intelligence community,
Social Networking)
 Automatic generation of link charts of individuals and
organizations[41] (Government, Intelligence community)
 Matching technical papers and grants with
[42]
reviewers (Government)
 Online customer support[43] (Customer Management)
 Determining document authorship[44] (Education)
 Automatic keyword annotation of images[45]
 Understanding software source code[46] (Software Engineering)
 Filtering spam[47] (System Administration)
 Information visualization[48]
 Essay scoring[49] (Education)
 Literature-based discovery[50]
 Stock returns prediction[7]
 Dream Content Analysis (Psychology) [8]
LSI is increasingly being used for electronic document discovery
(eDiscovery) to help enterprises prepare for litigation. In eDiscovery,
the ability to cluster, categorize, and search large collections of
unstructured text on a conceptual basis is essential. Concept-based
searching using LSI has been applied to the eDiscovery process by
leading providers as early as 2003.[51]

Challenges to LSI

[edit]

38
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Early challenges to LSI focused on scalability and performance. LSI


requires relatively high computational performance and memory in
comparison to other information retrieval techniques.[52] However,
with the implementation of modern high-speed processors and the
availability of inexpensive memory, these considerations have been
largely overcome. Real-world applications involving more than 30
million documents that were fully processed through the matrix and
SVD computations are common in some LSI applications. A fully
scalable (unlimited number of documents, online training)
implementation of LSI is contained in the open source gensim software
package.[53]
Another challenge to LSI has been the alleged difficulty in determining
the optimal number of dimensions to use for performing the SVD. As
a general rule, fewer dimensions allow for broader comparisons of the
concepts contained in a collection of text, while a higher number of
dimensions enable more specific (or more relevant) comparisons of
concepts. The actual number of dimensions that can be used is limited
by the number of documents in the collection. Research has
demonstrated that around 300 dimensions will usually provide the best
results with moderate-sized document collections (hundreds of
thousands of documents) and perhaps 400 dimensions for larger
document collections (millions of documents).[54] However, recent
studies indicate that 50-1000 dimensions are suitable depending on the
size and nature of the document collection.[55] Checking the proportion
of variance retained, similar to PCA or factor analysis, to determine the
optimal dimensionality is not suitable for LSI. Using a synonym test or
prediction of missing words are two possible methods to find the
correct dimensionality.[56] When LSI topics are used as features in
supervised learning methods, one can use prediction error
measurements to find the ideal dimensionality.

8.Natural Language Processing (NLP) in IR


Natural Language Processing (NLP) plays a crucial role in Information
Retrieval (IR), enhancing how systems understand, search, and
organize information. Here’s an overview of how NLP is applied in IR:

39
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

1. Query Understanding:
o Parsing and Tokenization: NLP breaks down user queries
into components like keywords, phrases, and entities,
helping the system understand the user’s intent.
o Synonym Expansion: By recognizing synonyms and
related terms, NLP can broaden the scope of search results
to include variations of the query terms.
2. Document Indexing:
o Text Normalization: Processes like stemming (reducing
words to their base forms) and lemmatization (reducing
words to their dictionary form) help in indexing documents
consistently.
o Named Entity Recognition (NER): Identifies and
categorizes entities (e.g., names of people, organizations) in
documents, enhancing the relevance of search results.
3. Relevance Ranking:
o Semantic Analysis: Beyond keyword matching, NLP
assesses the meaning of words in context, improving the
accuracy of search results.
o Topic Modeling: Techniques like Latent Dirichlet
Allocation (LDA) help in understanding the themes or
topics within documents, aiding in better ranking and
retrieval.
4. Query Expansion and Refinement:
o Relevance Feedback: NLP analyzes user interactions to
refine and improve search results based on what users find
relevant or not.
o Contextual Understanding: Techniques like BERT
(Bidirectional Encoder Representations from Transformers)
enable systems to understand the context of queries better,
improving accuracy in results.
5. Text Classification and Categorization:
o Document Classification: NLP categorizes documents
into predefined classes or topics, which can be useful for
organizing large corpora and improving retrieval efficiency.

40
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Sentiment Analysis: Helps in retrieving documents based


o
on the sentiment expressed, which can be useful in
applications like customer feedback analysis.
6. Summarization and Extraction:
o Automatic Summarization: Provides concise summaries
of documents, which can help users quickly assess the
relevance of search results.
o Information Extraction: Extracts specific information
(e.g., facts, figures) from documents, making it easier to
find precise answers to queries.
7. Multilingual Retrieval:
o Translation and Transcription: NLP facilitates searching
across different languages by translating queries and
documents or by transcribing spoken content.
NLP continues to evolve, and its integration with IR systems is
becoming more sophisticated, thanks to advancements in machine
learning and deep learning technologies.

9.Relation matching

Relation matching in the context of Natural Language Processing


(NLP) and Information Retrieval (IR) is about identifying and linking
relationships between entities in text. This can be crucial for tasks like
knowledge graph construction, question answering, and semantic
search. Here’s a breakdown of how relation matching works and its
applications:

Key Concepts in Relation Matching

1. Entity Recognition:
o Named Entity Recognition (NER): Identifies and
classifies entities (e.g., people, organizations, locations) in
text. For example, in the sentence "Steve Jobs founded
Apple," NER identifies "Steve Jobs" and "Apple" as
entities.
2. Relation Extraction:
o Pattern-Based Extraction: Uses predefined patterns or
rules to identify relationships between entities. For instance,
41
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

the pattern "X founded Y" can be used to extract the


relationship "founded" between "Steve Jobs" and "Apple."
o Machine Learning Approaches: Employs supervised
learning models to recognize relationships based on labeled
training data. Techniques include classification algorithms
and sequence models like CRFs (Conditional Random
Fields).
o Deep Learning Models: Utilizes models like BERT or
GPT to understand and extract relationships based on
contextual understanding. These models are often fine-
tuned on specific datasets to improve performance in
relation extraction.
3. Semantic Role Labeling:
o Understanding Relationships: Semantic role labeling
involves identifying roles that entities play in a sentence,
which helps in understanding their relationships. For
example, in "Barack Obama was born in Honolulu," the role
of "Barack Obama" is the subject and "Honolulu" is the
place of birth.
4. Knowledge Graph Construction:
o Linking Entities: Relation matching helps in creating and
populating knowledge graphs by linking entities with their
relationships. For instance, a knowledge graph might link
"Albert Einstein" with "Theory of Relativity" through the
relationship "developed."
5. Question Answering:
o Contextual Relation Understanding: In question
answering systems, relation matching helps in determining
the correct answer by understanding the relationships
between entities in the query and the information in the
corpus.
6. Semantic Search:
o Enhanced Query Understanding: Relation matching
improves search by understanding the underlying
relationships between terms in a query, leading to more
relevant results. For example, searching for "CEO of

42
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Microsoft" involves understanding that "CEO" and


"Microsoft" are related in a specific way.

Techniques and Tools for Relation Matching

1. Rule-Based Methods:
o Regular Expressions: Simple patterns for matching
relationships, often used in conjunction with NER.
o Heuristic Rules: Manually crafted rules based on domain
knowledge.
2. Machine Learning Approaches:
o Feature-Based Models: Uses features like syntactic
patterns, part-of-speech tags, and entity types to train
classifiers.
o Sequence Models: Models like LSTM (Long Short-Term
Memory) and CRFs for extracting relationships in
sequences.
3. Deep Learning Approaches:
o Transformer Models: BERT, GPT, and other transformer-
based models can be fine-tuned for relation extraction tasks
by leveraging their ability to understand context and
semantics.
o Entity-Aware Models: Specialized models that
incorporate entity embeddings to improve relation
matching.
4. Evaluation Metrics:
o Precision, Recall, and F1-Score: Common metrics used to
evaluate the performance of relation extraction systems.
Precision measures accuracy, recall measures
completeness, and F1-score balances both.

Applications

 Knowledge Graphs: Enhancing knowledge graphs with rich and


accurate relationships between entities.
 Customer Support: Automating the extraction of relationships
from customer queries to provide better support.

43
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

 Information Extraction: Extracting structured information from


unstructured text for databases and applications.
 Content Recommendation: Improving recommendation
systems by understanding relationships between entities in user
preferences.
Relation matching is a vital component of modern NLP systems and is
continually evolving with advancements in machine learning and AI.

10.Cross-language information retrieval (CLIR)


Cross-language information retrieval (CLIR) is a subfield of
information retrieval dealing with retrieving information written in a
language different from the language of the user's query.[1] The term
"cross-language information retrieval" has many synonyms, of which
the following are perhaps the most frequent: cross-lingual information
retrieval, translingual information retrieval, multilingual information
retrieval. The term "multilingual information retrieval" refers more
generally both to technology for retrieval of multilingual collections
and to technology which has been moved to handle material in one
language to another. The term Multilingual Information Retrieval
(MLIR) involves the study of systems that accept queries for
information in various languages and return objects (text, and other
media) of various languages, translated into the user's language. Cross-
language information retrieval refers more specifically to the use case
where users formulate their information need in one language and the
system retrieves relevant documents in another. To do so, most CLIR
systems use various translation techniques. CLIR techniques can be
classified into different categories based on different translation
resources:[2]
Dictionary-based CLIR techniques
Parallel corpora based CLIR techniques
Comparable corpora based CLIR techniques
Machine translator based CLIR techniques

44
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

CLIR systems have improved so much that the most accurate multi-
lingual and cross-lingual adhoc information retrieval systems today are
nearly as effective as monolingual systems.[3] Other related
information access tasks, such as media monitoring, information
filtering and routing, sentiment analysis, and information extraction
require more sophisticated models and typically more processing and
analysis of the information items of interest. Much of that processing
needs to be aware of the specifics of the target languages it is deployed
in.
Mostly, the various mechanisms of variation in human language pose
coverage challenges for information retrieval systems: texts in a
collection may treat a topic of interest but use terms or expressions
which do not match the expression of information need given by the
user. This can be true even in a mono-lingual case, but this is especially
true in cross-lingual information retrieval, where users may know the
target language only to some extent. The benefits of CLIR technology
for users with poor to moderate competence in the target language has
been found to be greater than for those who are fluent.[4] Specific
technologies in place for CLIR services include morphological analysis
to handle inflection, decompounding or compound splitting to handle
compound terms, and translations mechanisms to translate a query
from one language to another.
The first workshop on CLIR was held in Zürich during the SIGIR-96
conference.[5] Workshops have been held yearly since 2000 at the
meetings of the Cross Language Evaluation Forum (CLEF).
Researchers also convene at the annual Text Retrieval Conference
(TREC) to discuss their findings regarding different systems and
methods of information retrieval, and the conference has served as a
point of reference for the CLIR subfield.[6] Early CLIR experiments
were conducted at TREC-6, held at the National Institute of Standards
and Technology (NIST) on November 19–21, 1997.[7]

45
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

CLIR and its Motivation


Cross-lingual Information Retrieval is the task of retrieving relevant
information when the document collection is written in a different
language from the user query. Figure 1 below shows a typical
architecture of a CLIR system. There are many situations where CLIR
becomes essential because the information is not in the user’s native
language.

Translation Approaches
CLIR requires the ability to represent and match information in the
same representation space even if the query and the document
collection are in different languages. The fundamental problem in
CLIR is to match terms in different languages that describe the same or
a similar meaning. The strategy of mapping between different language
representations is usually machine translation. In CLIR, this translation
process can be in several ways.
Document translation [2] is to map the document representation
into the query representation space, as illustrated in Figure 2.
Query translation [3] is to map the query representation into the
document representation space, as illustrated in Figure 3.

46
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Pivot language or Interlingua [4,5] is to map both document


and query representations to a third space.

Query translation is generally considered the most appropriate


approach.
The query is short and thus fast to translate than the document, and it is
more flexible and allows more interaction with users if the user
understands the translation. However, query translation can suffer from
translation ambiguity, and this problem is even more obvious for the
short query text due to the limited context. By contrast, document
translation can provide more accurate translation thanks to richer
contexts. Document translation also has the advantage that once the
translated document is retrieved, the user can directly understand it,
while the query translation still needs a post-retrieval translation.
However, several experiments show that there is no clear evidence of

47
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

one approach or the other using the same machine translation system
[6], and the effectiveness is more dependent on the translation direction
[7].

Conferences and Data Sets


There are several data sets available for CLIR. The first is the TREC
(Text REtrieval Conference) organized by the National Institute of
Standards and Technology (NIST). It starts with English to Spanish and
then more languages are added including French, German, Italian,
Dutch, Chinese, Arabic, and so on. The second data set is from CLEF
(Cross-Language Experiment Forum). It focuses on European
languages where the first experiments include English, German,
French, and Italian documents using queries in Dutch, English, French,
German, Italian, Spanish, Swedish, and Finnish. The third one is the
NTCIR series of workshops organized by the National Institute for
Informatics (NII) of Japan. They emphasize Asian languages such as
Japanese, Chinese, Korean, Vietnamese, Mongolian.

Recent Progress
Very recently, cross-lingual word embeddings and neural network
based information retrieval systems have become increasingly popular.
Cross-lingual word embeddings can represent words in different
languages in the same vector space by learning a mapping from
monolingual embeddings even from no bilingual supervision. Neural
information retrieval can build better representations for documents
and queries and learn to rank directly from relevance labels. Here we
briefly discuss three recent papers in this direction.

DUET
This is the paper Learning to Match using Local and Distributed
Representations of Text for Web Search, WWW 2017 by Bhaskar Mitra,
Fernando Diaz, and Nick Craswell.

48
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

In traditional information retrieval approaches, we build a local


representation by discrete terms in the text. The relevance of a
document is based on the exact matches of query terms in the body text.
On the other hand, models such as latent semantic analysis and latent
Dirichlet allocation learn low dimensional vector representations of
terms. The query and the document are matched in the latent semantic
space. In this work, they propose a document ranking model consisting
of two separate deep neural network sub-models. The first sub-model
matches the query and the document using a local representation of
text, while the second learns a distributed representations for queries
and documents before matching them.
The overall architecture is shown in Figure 4.

MUSE
The second paper is Word translation without parallel data, ICLR 2018
by Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato,
Ludovic Denoyer, Hervé Jégou.
The paper study cross-lingual word embeddings where word
embeddings for two languages are aligned in the same representation
space (Figure 5). State-of-the-art methods for cross-lingual word
embeddings rely on bilingual supervision such as dictionaries or
parallel corpora. Recent studies try to alleviate the bilingual

49
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

supervision need by using character-level information and iterative


training.
However, they are not achieving performance on par with other
supervised methods. This work proposes to learn a mapping to align
monolingual word embedding spaces in an unsupervised way without
any parallel data.
The experiment also demonstrates their method is even better than
existing supervised methods on some language pairs.
Check out their implementation and Multilingual word Embeddings at
MUSE.

Unsupervised CLIR

The third paper is Unsupervised Cross-Lingual Information Retrieval


using Monolingual Data Only, SIGIR 2018 by Robert Litschko, Goran
Glavaš, Simone Paolo Ponzetto, Ivan Vulić.
They propose an unsupervised CLIR framework. To this end, they
leverage shared cross-lingual word embedding spaces induced solely
from monolingual corpora in two languages through an iterative
process based on adversarial neural networks. The information retrieval
is performed by calculating semantic similarity directly from the cross-
lingual embedding space. This does not require any bilingual
supervision or relevance labels of documents.

50
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

Now it is not necessary that the clusters formed must be circular in


shape. The shape of clusters can be arbitrary. There are many
algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters
formed are not circular in shape.

Types of Clustering
Broadly speaking, there are 2 types of clustering that can be performed
to group similar data points:
 Hard Clustering: In this type of clustering, each data point
belongs to a cluster completely or not. For example, Let’s say there

51
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

are 4 data point and we have to cluster them into 2 clusters. So each
data point will either belong to cluster 1 or cluster 2.
Data Points Clusters

A C1

B C2

C C2

D C1
 Soft Clustering: In this type of clustering, instead of assigning
each data point into a separate cluster, a probability or likelihood
of that point being that cluster is evaluated. For example, Let’s say
there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to
both clusters. This probability is calculated for all data points.
Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go
through the use cases of Clustering algorithms. Clustering algorithms
are majorly used for:
 Market Segmentation – Businesses use clustering to group their
customers and use targeted advertisements to attract more
audience.

52
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

 Market Basket Analysis – Shop owners analyze their sales and


figure out which items are majorly bought together by the
customers. For example, In USA, according to a study diapers and
beers were usually bought together by fathers.
 Social Network Analysis – Social media sites use your data to
understand your browsing behaviour and provide you with targeted
friend recommendations or content recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased
areas in diagnostic images like X-rays.
 Anomaly Detection – To find outliers in a stream of real-time
dataset or forecasting fraudulent transactions we can use clustering
to identify them.
 Simplify working with large datasets – Each cluster is given a
cluster ID after clustering is complete. Now, you may reduce a
feature set’s whole feature set into its cluster ID. Clustering is
effective when it can represent a complicated case with a
straightforward cluster ID. Using the same principle, clustering
data can make complex datasets simpler.
There are many more use cases for clustering but there are some of the
major and common use cases of clustering. Moving forward we will
be discussing Clustering Algorithms that will help you perform the
above tasks.
Types of Clustering Algorithms
At the surface level, clustering helps in the analysis of unstructured
data. Graphing, the shortest distance, and the density of the data points
are a few of the elements that influence cluster formation. Clustering
is the process of determining how related the objects are based on a
metric called the similarity measure. Similarity metrics are easier to
locate in smaller sets of features. It gets harder to create similarity
measures as the number of features increases. Depending on the type
of clustering algorithm being utilized in data mining, several
techniques are employed to group the data from the datasets. In this
part, the clustering techniques are described. Various types of
clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)

53
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

4. Distribution-based Clustering

11.Evaluation measures (information retrieval)


[Refer PDF]
knowledge-based systems (KBSes)
What are knowledge-based systems?
Knowledge-based systems (KBSes) are computer programs that use a
centralized repository of data known as a knowledge base to provide a
method for problem-solving. Knowledge-based systems are a form of
artificial intelligence (AI) designed to capture the knowledge of human
experts to support decision-making. An expert system is an example of
a knowledge-based system because it relies on human expertise.

KBSes can assist in decision-making, human learning and creating a


companywide knowledge-sharing platform, for example. KBS can be
used as a broad term, but these programs are generally distinguished by
representing knowledge as a reasoning system to derive new
knowledge.

A basic KBS works using a knowledge base and an interface engine.


The knowledge base is a repository of data that contains a collection of
information in a given field -- such as medical data. The inference
engine processes and locates data based on requests, similar to a search
engine. A reasoning system is used to draw conclusions from data
provided and make decisions based on if-then rules, logic programming
or constraint handling rules. Users interact with the system through a
user interface.

54
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

What are knowledge-based systems used for?


Knowledge-based systems are commonly used to aid in solving
complex problems and to support human learning. KBSes have been
developed for numerous applications. For example, an early
knowledge-based system, Mycin, was created to help doctors diagnose
diseases. Healthcare has remained an important market for knowledge-
based systems, which are now referred to as clinical decision support
systems in the health sciences context.

Knowledge-based systems have also been used in applications as


diverse as avalanche path analysis, industrial equipment fault diagnosis
and cash management.

Knowledge-based systems and artificial intelligence


While a subset of artificial intelligence, classical knowledge-based
systems differ in their approach to some of the newer developments in
AI.

AI is organized in a top-down system that uses methods of statistical


pattern detection such as big data, deep learning and data mining, for
example. AI, in this sense, includes neural network systems which use
deep learning and focus on pattern recognition problems such as facial
recognition. By comparison, KBSes handle large amounts
55
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

of unstructured data while integrating knowledge based on that data on


a large scale.

Types of knowledge-based systems


Some example types of knowledge-based systems include the
following:

 Blackboard systems. These systems enable multiple sources to


input new information into a system to help create solutions to
potential problems. Blackboard systems rely heavily on updates
from human experts.
 Case-based systems. These systems use case-based reasoning to
create solutions to a problem. This system works by reviewing past
data of similar situations.
 Classification systems. These systems analyze different data to
understand its classification status.
 Eligibility analysis systems. These systems are used to determine a
user's eligibility for a specific service. A system asks a user guided
questions until it receives a disqualifying answer.
 Expert systems. These are a common type of KBS that simulate
human expert decision-making in a particular field. Expert systems
provide solutions for problems as well as the explanations behind
them. For example, they could be used for calculations and
predictions.
 Intelligent tutoring systems. These systems are designed to
support human learning and education. Intelligent tutoring
systems provide users with instructions and give feedback based on
performance or questions.
 Medical diagnosis systems. These systems help diagnose patients
by inputting data or having a patient answer a series of questions.
Based on the responses, the KBS identifies a diagnosis and makes

56
St. Joseph’s College of Engineering
ML1701 Natural Language Processing Department of AML 2024 - 2025

recommendations medical professionals can use to determine a


patient's treatment.
 Rule-based systems. These systems rely on human-specified rules
to analyze or change data to reach a desired outcome. For example,
rule-based systems might use if-then rules.
Advantages and challenges of knowledge-based systems
Knowledge-based systems offer the following benefits:

 Aid in expert decision-making, especially when a human expert isn't


available.
 Provide efficient documentation for users to access quickly.
 Create new knowledge by referring to and reviewing existing stored
data.
 Group data by analyzing and classifying different inputted data.
 Handle large amounts of structured and unstructured data.
 Improve decision-making processes, enabling users to work at
higher levels of expertise.

However, the following are some potential challenges that come with
these systems:

 Difficult to maintain, as some systems might require continual


updating and organizational policies, or procedures might change
and require updating over time.
 Potential anomalies such as circular dependencies or repetitive rules
might appear in some systems.
 Require a large amount of accurate data.
 Require training for new users to understand the system.
 Some data could be considered abstract, making it difficult for a
system to make decisions.

57
St. Joseph’s College of Engineering

You might also like