0% found this document useful (0 votes)
28 views10 pages

A Survey of Numerous Text Similarity Approach

This document presents a comprehensive survey of various text similarity approaches utilized in Natural Language Processing (NLP), highlighting their applications in areas such as search engines and document classification. It discusses traditional methods and their limitations, as well as recent advancements in deep learning techniques that improve the accuracy of text similarity measurements. The paper also categorizes different similarity metrics, including Euclidean distance, Cosine similarity, and Jaccard index, while emphasizing the importance of context in determining semantic similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

A Survey of Numerous Text Similarity Approach

This document presents a comprehensive survey of various text similarity approaches utilized in Natural Language Processing (NLP), highlighting their applications in areas such as search engines and document classification. It discusses traditional methods and their limitations, as well as recent advancements in deep learning techniques that improve the accuracy of text similarity measurements. The paper also categorizes different similarity metrics, including Euclidean distance, Cosine similarity, and Jaccard index, while emphasizing the importance of context in determining semantic similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of Scientific Research in Computer Science, Engineering and Information Technology

ISSN : 2456-3307 (www.ijsrcseit.com)


doi : https://doi.org/10.32628/CSEIT2390133

A Survey of Numerous Text Similarity Approach


Joyinee Dasgupta, Priyanka Kumari Mishra, Selvakuberan Karuppasamy , Arpana Dipak Mahajan
and Subhashini Lakshminarayanan
Advanced Technology Centers, India

ABSTRACT
Article Info One of the most common NLP use cases is text similarity. Every domain comes
Publication Issue : with a variety of use cases. The most common uses of text similarity include
Volume 9, Issue 1 finding related articles/news/genres, efficient use of search engines, classification
January-February-2023 of related issues on any topic, etc. It serves as a framework for many text
analytics use cases. Methods to solve text similarity use cases have been around
Page Number : 184-194 for a while, but the main drawbacks of the old methods are loss of dependency
information, difficulty remembering long conversations, exploding gradient
Article History problems, etc. Recent advanced deep learning-based models pay attention to
Accepted: 01 Feb 2023 both contiguous and distant words, making their learning ability more rigorous.
Published: 11 Feb 2023 This white paper focuses on various text similarity techniques that can be used
in everyday life to solve these use cases.
Keywords : Natural Language Processing; Euclidian distance, Cosine similarity,
Jaccard Distance, word embeddings, Language Models ,Universal Sentence
Encoders

I. INTRODUCTION concepts like vector, trigonometry, linear algebra etc


to calculate the similarity between two sentences .
Natural Language Processing is one of the fields that Similarity measures are used in a wide range of
has advanced the most recently and has assisted us in applications, including automated document linking,
resolving a wide range of issues in many sectors of information retrieval systems, paraphrase detection,
society. It consists of text categorization, text search engines, text classification etc.
similarity, text production, text summarization,
machine translation, chatbots, information retrieval II. LITERATURE SURVEY
systems, and other related technologies. The majority
of industries are currently dealing with the issue of This study examines how semantic similarity
text similarity. Data preparation, which involves techniques have changed throughout the years,
cleaning the data by removing noise from the data separating them according to the underlying
and undesirable characters, is a necessary step in the methodologies that underlie them. The measurement
process of text similarity or any natural language of semantic equivalency amongst of text is known as
processing. It leverages advanced mathematical Semantic Textual Similarity (STS)[1]. Instead of a

Copyright: © the author(s), publisher and licensee Technoscience Academy. This is an open-access article distributed 184
under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-
commercial use, distribution, and reproduction in any medium, provided the original work is properly cited
Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

simple yes or no answer, semantic similarity III. RESULTS AND DISCUSSION


algorithms typically provide a ranking or percentage
of similarity across texts. Semantic relatedness and A. Perception of Similarity
semantic similarity are sometimes used
Similarity is the measurement of how similar or how
interchangeably [2]. The goal of this review is to
different a set of things look like. In other words, it is
present a thorough analysis of the different semantic
a metric that helps to decide whether two objects are
similarity methodologies, including the most recent
how much same or how much different from each
developments based on deep neural techniques. Text
other. If the distance between two objects is large it
classification approaches have been widely employed
shows their dissimilarity and vice versa. This metric
to speed up multimedia data processing in numerous
generally lies from 0 to 1;0 being high dissimilar and
multimedia applications, such as video/image tagging
1 being highly similar. Similarity is a very metric and
and multimedia recommendation [3]–[7].
it is highly dependent on domain. Let’s say if we
In this study [8], the Jaccard and Cosine similarity
want to compare two laptops with same colour,
metrics for measuring text similarity are compared.
screen size, positioning of the keyboard, webcam,
The text similarity was measured by the Jaccard
type of charging point and socket used etc. We might
similarity index. The text was converted into a
call it similar if the features of comparison are almost
vector space model along with their distance is
same but if the features for comparison for the use
calculated utilizing the "Word2Vec" method in the
case is different like may be the OS used, the
Cosine similarity algorithm.
memory, presence of graphic card or the speed etc
Kaundal et al. reviewed two methods for calculating
are different we could say for the same laptop its
short text semantic similarity (STSS), a vector space
dissimilar. So, the perception of similarity or
model and a knowledge-based model that used
dissimilarity completely depends on the use case and
WordNet [7]. In [9]Three approaches have been used
the business objective. So, need to be very clear
to discuss the existing studies on text similarity:
when measuring similarity.
string-based, knowledge-based, and corpus-based
Although the concept may seem straightforward,
similarity. Each method is based on a distinct
similarity is the cornerstone of many machine
perspective, and they all measure how closely two
learning approaches. For instance, similarity is used
short texts are related. A definitive perspective on
by the K-Nearest-Neighbors classifier to categorize
this subject is also provided by the introduction of
new entity or, and by K-means clustering to allocate
datasets, which are frequently used as benchmarks
data points to suitable groups. Even recommendation
for assessing approaches in this area. The utilisation
algorithms use neighborhood-based collaborative
of methods that incorporate several viewpoints
filtering techniques that identify a user's neighbours
yields better outcomes. In [10], a small gap between
based on similarity.
those traits suggests a high level of similarity,
Example: -
whereas a big distance suggests a low level of
Let’s take an example of 2 very simple sentences.
similarity . Few of the distance metrics used in
calculating document similarity are Euclidean
1. I am hungry.
distance, Cosine distance, and Jaccard coefficient
2. I want to eat something.
metrics.
As a human when we read through the sentences,
we can understand that these two sentences are
similar but for a machine it’s very difficult to

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 185


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

understand the context though both the sentences • Cosine Similarity


are exactly same. Let’s get into the various similarity
Cosine similarity, which is the cosine of the angle
techniques to measure it .
between the two vectors, is used to calculate how
similar two vectors are to one another. When two
B. Measures
vectors nearly point in the same direction, it can
1. Text Distance identify this. The cosine similarity is 1 if the angle
between the vectors is 0 degrees.
It offers a description of the semantic similarity
among two words based on distance. Length,
Using python to calculate the cosine similarity:
distribution, and semantic distance are the 3 ways to
measure distance.

• Length Distance

By computing the distance length of vector text


using the text's numerical properties, length distance
has been used to measure text similarity.

• Euclidean Distance Sentence similarity resulted in a slightly better score.


Jaccard similarity does not work with text
The Pythagoras rule is used to determine the
embeddings, so it is rarely used when working with
separation between two points. The similarity score
text data. This means that it is limited to assessing
decreases and vice versa when the distance between
the lexical similarity of texts. H. Degree of similarity
two vectors increases. The output of the Euclidean
of documents at the word level. The difference
distance lies from 0 to infinity. It’s very difficult to
between cosine and Euclidean metrics is that cosine
state whether its similar or dissimilar so
similarity is not affected by feature vector size or
normalisation is a way to convert the score between
length. A sparse vector of text embeddings makes
0 to 1 so that we can measure the distance between a
Euclidean distance a problem. Therefore, cosine
pair of sentences. Using python to calculate the
similarity is usually preferred over Euclidean
Euclidean distance.
distance when working with text data. The only use
case for length-dependent text similarity that comes
to mind is plagiarism detection.

• Manhattan Distance

According to the Manhattan distance metric, the


separation between two locations is equal to the total
of their absolute differences in Cartesian coordinates.
Initially when the Euclidean distance of 4.32 is It is the total sum of the difference between the x-
calculated, it’s difficult to state whether the pair of and y-coordinates, to put it simply.
sentences is similar or dissimilar. So, we normalise it
using the Euler’s constant (1/e to the power distance).

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 186


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

• kullback-leibler-divergence

For a given random variable or set of occurrences,


KL divergence is a metric for assessing the relative
differences between two probability distributions.
Relative entropy is another name for KL divergence.

• Jensen-Shannon-divergence

The degree to which the label distributions of several


facets diverge entropically from one another is
Here the number 6 signifies the distance between gauged by the Jensen-Shannon divergence (JS). It is
two sentences and it fails to capture context and symmetric and is based on the Kullback-Leibler
more on string. divergence. The lesser the JS distance, similar the
distribution of the two documents would be similar.
• Hamming Distance The Jensen-Shannon divergence is calculated as
follows:
The distance between two chains of equal length is
determined by the number of locations where the
JS = ½ * [KL (P a || P) + KL (P d || P)]
linked symbols diverge. In other words, it
determines the minimal number of mistakes that can
• Semantic Distance
occur when converting one string to another or the
minimal number of substitutions needed. The idea of distance between objects is based on the
similarity of their meaning or semantic content
• Distribution Distance rather than lexicographical similarity in the case of
semantic distance, which is a distance metric defined
There are majorly two downsides to using length
over a set of texts or phrases. These are mathematical
distance to assess similarity:
instruments that are used to gauge the strength of
When query Q is used to obtain response A, the
the semantic connection between language units,
associated similarity is not symmetrical, but it would
concepts, or instances by a numerical description
be adequate for symmetrical problems like
derived from a comparison of the data evidencing
Sim (A, B) = Sim (B, A).
their meaning or outlining their characteristics.
Secondly, a danger is involved there in utilizing
distance and length to gauge similarity measurement
• Word Mover’s Distance
without understanding statistical properties of the
data. When establishing if two articles come from WMD makes use of outcomes of cutting-edge
the same distribution, the distribution distance is embedding methods like Glove, Word2vec, which
employed to measure how similar the two papers are. produce word-embeddings of extraordinary quality
We briefly discuss many distributions distance and scale organically to extremely huge data sets.
approaches These embedding methods show how word vector
operations often preserve semantic links.
It treats text documents as such and makes use of the
weighted point cloud of embedded words that word
vector embeddings contain. The two documents are

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 187


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

separated by the smallest total distance that words


from text document A must cover to precisely match
the point cloud in text document B.

2. Text Representation

Both lexically and thematically, texts may be similar


to one another. Words are lexically comparable if
their character patterns are similar across the text's
words. Two words are semantically similar if they • Jaccard Index
have the same meaning, oppose one another, are
employed in the same way, are used in the same A Jaccard index, commonly known as the Jaccard
context, or are examples of one another. Lexical similarity coefficient, treats data elements as a set. By
similarity is evaluated with a variety of text dividing size of union by size of the intersection of
representation metrics, whilst semantic similarity is the two sets, this can be calculated. Consider the
introduced with the help of the string-based same illustration.
methodology, corpus-based method, semantic text Sentence 1: I am hungry.
matching, and graph-structure-based approach. Sentence 2: I want to eat something.
We will first do text normalisation to remove word
• String Based roots and lemmas before computing the similarity
When two text string are compared or approximate using the Jaccard similarity. In the case of our
string matched, string similarity metrics are used to example sentences, there are no words to eliminate,
determine how similar or dissimilar (distance) two thus we can proceed to the following section. Python
text strings are. The most common string similarity function for Jaccard similarity:
metrics used in the symmetric package are
represented in this survey.

• Phrase Based

The phrase-based method's fundamental building


block is a phrase word, and its primary approaches
include the dice coefficient, Jaccard, and others.
• Character Based
• Dice
To express the similarity between two texts,
The definition of Dice's coefficient is two times the
character-based similarity calculations are based on
number of terms that are shared by both strings
the similarity of characters within the text. LCS
divided by all of the terms in both strings.
(Longest Common Substring), Jaro similarity, edit
distance and other related techniques are introduced.

• LCS

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 188


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

A typical illustration of a character-based similarity • Corpus Based


measure is LCS. The longest continuous chain of
characters that can be found in both strings is taken Corpus-based similarity analyses information from a
into account by the Longest Common SubString (LCS) huge corpus to determine the semantic similarity
algorithm. between terms. The corpus-based method calculates
text similarity using data from the corpus; this data
may be either have linguistic characteristic or a
likelihood of co-occurrence.

• Bag-of-words model

The fundamental tenet of the bag-of-words method


is to analyse a document as a collection of words
without considering their usage order. The most
often used word bag-based techniques are LSA
• Edit Distance (latent semantic analysis), TF-IDF (term frequency-
inverse document frequency), and BOW (bag of
Levenshtein distance, also known as edit distance, is words).
a metric used to compare the similarity of two strings,
known as source and target strings. The number of • TF-IDF
editing operations (deletions, insertions, or
Each text will be transformed into its vector
substitutions) required to transform a source string
representation via the TF-IDF vectorizer. As a result,
into a target string is called the distance between the
we will be able to approach each text as a collection
source and target strings. The similarity between
of points in a multidimensional space.
two strings increases as the distance between them
decreases.
• Shallow Window-Based Methods

The shallow window-based techniques differ from


the word bag model in significant ways, one of
which being the semantic distance between words.
This isn't considered in the bag of words model.
Low-dimensional real vectors, on the other hand,
can be learned in unstructured text without a mark
by creating word vectors using shallow window-
based techniques, which spatially cluster similar
• Jaro Similarity
words together.

The measure of resemblance between two strings is


• GloVe
called Jaro Similarity. The Jaro distance has a value
between 0 and 1. where 0 indicates that there is no Glove is a global vector method that blends count-
resemblance between the two strings and 1 indicates based methods with direct prediction techniques like
that they are equal. word2vec (such as PCA, principal component

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 189


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

analysis). With the help of GloVe's global matrix first input the model gets. The shared input weight
factorization method, a matrix that shows whether matrix W is then multiplied by each unique heat
or not a document contains words is produced. The vector. The word vectors' hidden layer vector is then
GloVe model analyses the word vector such that the updated to include the average. The final probability
dot product of the words equals the logarithm of the distribution is produced by multiplying the hidden
likelihood of those words occurring concurrently. layer vector by the weight matrix W of the output
The Glove Model can be written down as follows. (J. and then applying SoftMax algorithm to it.
Pennington, R. Socher, and C. Manning).
• Matrix Factorization Method
Using statistics to determine the relationship
between words is the core premise of the GloVe Matrix factorization is complex mathematical
word embedding. In contrast to the occurrence calculation that discover latent features which tells
matrix, the co-occurrence matrix lets you know how about the interactions among entities/texts etc.
frequently a particular word pair appears with Mainly there are different methods to calculate text
another. Each value in the co-occurrence matrix similarity using matrix factorization methods.
represents a word pair that commonly appears
together. • LSA

• Bert A method for constructing a vector representation of


a document is latent semantic analysis. When a
The popular attention model Transformer's document is represented as a vector, it is possible to
bidirectional training for language modeling has compare documents for similarity by figuring out
been translated into BERT's main technological how far off the vectors are from one another.
advancement. In order to determine the contextual
links between words (or subwords) in a document,
BERT uses the Transformer attention mechanism.
Transformer's basic configuration consists of two
separate processes: a text input encoder and a job
prediction decoder.

• Word2Vec

In the Word2Vec model, words are converted into


vectors. You can then use the cosine similarity
formula to calculate the similarity value from the
word vector data that the Word2Vec model has
Reference -viewcontent.cgi.pdf
provided. There are two pre-trained models for
Word2Vec: the continuous word bag model called
• LDA
CBOW, and the skip-gram model.As an illustration,
consider the CBOW model, which consists of the The unsupervised technique Latent Dirichlet
input, mapping, and output layers and predicts the Allocation (LDA) gives each document a value for
intermediate words that depends on context. The each specified subject (let's suppose we choose to
specific heat vector for the inter-word context is the

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 190


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

search for 5 distinct themes in our corpus). While exact term and association matching thanks to graph
Dirichlet is a sort of probability distribution, latent is enrichment. We utilise the data in the enriched
another synonym for hidden (i.e., properties that weighted graphs to calculate the similarity between
cannot be assessed directly). LDA views each text texts using an edge walk graph kernel. The
document as a combination of words and topics, and kernel function takes two weighted co-occurrence
vice versa. Each term and each subject are covered graphs as input and outputs a similarity score based
one by one. Each word will be given a subject at on how closely the relevant text in the two
random, and the frequency with which it appears in documents matches.
that topic and with which other terms will be
evaluated. As a result, it is a particularly popular tool • Graph Neural Network
for identifying text similarities since it groups related
words, documents, or phrases. To design a graph neural network where there are
many tiers of data and connections, the model must
be utilised to specify the hierarchical relationship of
the data (GNNs). The graph neural network (GNN), a
connectionism model, provides an example of a
graph's dependence on message transmission
between its nodes. A graph neural network keeps a
state that can represent information from its
neighbourhood at any depth, in contrast to a normal
Reference- neural network. Relationships between words can be
Multilingual_topic_modelling_for_tracking_COVID- automatically discovered, generated, and then
19.pdf reconstructed during processing using a special kind
of GNN called a WRGNN.
• Knowledge graph
• Semantic Text Matching
A graph kernel compares graph substructures to
determine how similar two graphs are to one another. In NLP, semantic matching techniques compare two
It allows for the consideration of structural utterances to see whether their meanings are similar.
information in text for determining document Semantic text matching is the process of comparing
similarity. A graph kernel's ability to calculate the semantic similarity of source and destination text
similarity depends on the data that is represented in fragments. Based on LSA, deep learning extracts the
the graphs. A weighted co-occurrence graph serves hierarchical semantic structure from the query and
as the first representation of the text document. content. In this case, a new expression is produced as
Then, using a similarity matrix based on word a result of the text being encoded to extract features.
similarities, it is turned into an enhanced document
network by automatically constructing related nodes The four primary approaches used for single
and edges (or relationships). Matching terms and semantic text matching are architecture-I (for
patterns contribute to document similarity based on matching two phrases), architecture-II,
their relevance since a supervised term weighting convolutional latent semantic model (CDSSM), and
method is employed to weight the terms and their deep structured semantic model (DSSM)
relationships. The similarity measure can go beyond (Architecture-II of convolutional matching model).

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 191


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

• DSSM granularity vector to characterize a piece of text is


not fine enough on the basis of a single semantic,
The search industry was where deep-structured according to the deep learning model of document
semantic models (DSSM) were first applied. A expression based on multiple semantics. We can
semantic similarity model will be trained using click examine local text similarities and synthesise the
exposure records from a large number of search degree of text matching thanks to the extensive
results, and Deep Neural Networks (DNN) will be interactive work and multi-semantic expression it
used to represent the query and title as low-latitude does before matching. The two main multi-semantic
semantic vectors, cosine distance will be utilised to techniques are MatchPyramid and Multi-View Bi-
gauge the relationship between the two vectors. The LSTM (MV-LSTM).
context loss of DSSM is partially mitigated by
switching to CNN from DNN (Convolutional Neural • MV-LSTM
Network). When deep learning first emerged, CNN
and long-term memory (LSTM) were proposed, and Three components make up MV-LSTM: First, each
the structures of these specific diagnosis extracts positional sentence representation is an individual
were also used to develop DSSM. The feature sentence representation created using a bidirectional
extraction layer, which takes the place of CNN or long short-term memory (Bi-LSTM); Second, many
LSTM, has a completely different network structure. similarity functions combine to generate a similarity
matrix or tensor via interactions between various
• ARC-1 positional sentence representations; A multilayer
perceptron and k-Max pooling are used to aggregate
The DSSM model's shortcomings in recording these interactions to create the final matching score.
question and document sequences and context
information are addressed by using the CNN module
to produce the suggested ARC-I and ARC-II. The
ARC-I model is an interactive learning paradigm that
is based on representational learning. The key
distinction between the two models and the original
DSSM model is the inclusion of convolution and
pooling layers to extract sentence word.

The most significant portions of these connections


are retrieved by pooling layer maxpooling after
ARC-1 creates a number of combinatorial
associations between nearby feature maps using Fig: Illustration of MV-LSTM (reference - A Deep
convolution layers with different terms. The text Architecture for Semantic Matching with Multiple
representation will then be delivered to DSSM. Positional Sentence Representations (aaai.org))

• Multi-semantic Document Matching • MatchPyramid

Significant local information is lost when In this part, we introduce MatchPyramid, a novel
complicated words are compressed into a single deep architecture for text matching. By seeing the
vector based on a single meaning. A single- matching matrix as an image and treating text

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 192


Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194

matching as image recognition, the fundamental Multimed. Tools Appl., vol. 78, no. 3, pp. 3797–
concept is revealed. 3816, 2019, doi: 10.1007/s11042-018-6083-5.
[4]. Z. Huang et al., Context-aware legal citation
recommendation using deep learning, vol. 1, no. 1.
Association for Computing Machinery, 2021. doi:
10.1145/3462757.3466066.
[5]. S. Yang, G. Huang, B. Ofoghi, and J. Yearwood,
“Short text similarity measurement using context-
aware weighted biterms,” Concurr. Comput. Pract.
Exp., vol. 34, no. 8, pp. 1–11, 2022, doi:
10.1002/cpe.5765.
[6]. D. W. Prakoso, A. Abdi, and C. Amrit, “Short text
similarity measurement methods: a review,” Soft
Comput., vol. 25, no. 6, pp. 4699–4723, 2021, doi:
10.1007/s00500-020-05479-2.
[7]. A. Kaundal, “A Review on WordNet and Vector
Fig: MatchPyramid on Text Matching.(reference : Space Analysis for Short-text Semantic Similarity,”
Text Matching as Image Recognition (aaai.org)) Int. J. Innov. Eng. Technol., vol. 8, no. 1, pp. 135–
142, 2017, doi: 10.21172/ijiet.81.018.
IV.CONCLUSION [8]. A. W. Qurashi, V. Holmes, and A. P. Johnson,
“Document Processing: Methods for Semantic Text
Generally, traditional embedding methods like Similarity Analysis,” INISTA 2020 - 2020 Int. Conf.
Word2Vec and Doc2Vec yield good results when the Innov. Intell. Syst. Appl. Proc., pp. 0–5, 2020, doi:
10.1109/INISTA49547.2020.9194665.
task only requires the broad sense of the text. On
[9]. T. Nora Raju, P. A. Rahana, R. Moncy, S. Ajay, and
tasks like semantic text similarity or paraphrase
S. K. Nambiar, “Sentence Similarity - A State of Art
identification, they perform better than state-of-the-
Approaches,” Proc. Int. Conf. Comput. Commun.
art deep learning techniques, which reflects this. On Secur. Intell. Syst. IC3SIS 2022, pp. 0–5, 2022, doi:
the other hand, when the task necessitates 10.1109/IC3SIS54991.2022.9885721.
something more specific than just the general [10]. R. Singh and S. Singh, “Text Similarity Measures in
meaning, such as sentiment analysis or sequence News Articles by Vector Space Model Using NLP,”
labelling, more complex contextual techniques J. Inst. Eng. Ser. B, vol. 102, no. 2, pp. 329–338,
perform better. Therefore, wherever possible, start 2021, doi: 10.1007/s40031-020-00501-5.
with a short and simple process before moving on to
Cite this article as :
one that requires more effort as needed.

Joyinee Dasgupta, Priyanka Kumari Mishra, Selvakuberan


V. REFERENCES
Karuppasamy , Arpana Dipak Mahajan, "A Survey of
Numerous Text Similarity Approach", International
[1]. P. Bambroo and A. Awasthi, “LegalDB : Long
Journal of Scientific Research in Computer Science,
DistilBERT for Legal Document Classification”.
Engineering and Information Technology (IJSRCSEIT),
[2]. D. Chandrasekaran and V. Mago, “Evolution of
ISSN : 2456-3307, Volume 9, Issue 1, pp.184-194, January-
Semantic Similarity — A Survey,” vol. 54, no. 2,
February-2023. Available at doi :
2021.
https://doi.org/10.32628/CSEIT2390133
[3]. X. Deng, Y. Li, J. Weng, and J. Zhang, “Feature
Journal URL : https://ijsrcseit.com/CSEIT2390133
selection for text classification: A review,”

Volume 9, Issue 1, January-February-2023 | http://ijsrcseit.com 193

You might also like