A Survey of Numerous Text Similarity Approach
A Survey of Numerous Text Similarity Approach
ABSTRACT
Article Info One of the most common NLP use cases is text similarity. Every domain comes
Publication Issue : with a variety of use cases. The most common uses of text similarity include
Volume 9, Issue 1 finding related articles/news/genres, efficient use of search engines, classification
January-February-2023 of related issues on any topic, etc. It serves as a framework for many text
analytics use cases. Methods to solve text similarity use cases have been around
Page Number : 184-194 for a while, but the main drawbacks of the old methods are loss of dependency
information, difficulty remembering long conversations, exploding gradient
Article History problems, etc. Recent advanced deep learning-based models pay attention to
Accepted: 01 Feb 2023 both contiguous and distant words, making their learning ability more rigorous.
Published: 11 Feb 2023 This white paper focuses on various text similarity techniques that can be used
in everyday life to solve these use cases.
Keywords : Natural Language Processing; Euclidian distance, Cosine similarity,
Jaccard Distance, word embeddings, Language Models ,Universal Sentence
Encoders
Copyright: © the author(s), publisher and licensee Technoscience Academy. This is an open-access article distributed 184
under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-
commercial use, distribution, and reproduction in any medium, provided the original work is properly cited
Joyinee Dasgupta et al Int. J. Sci. Res. Comput. Sci. Eng. Inf. Technol., January-February-2023, 9 (1) : 184-194
• Length Distance
• Manhattan Distance
• kullback-leibler-divergence
• Jensen-Shannon-divergence
2. Text Representation
• Phrase Based
• LCS
• Bag-of-words model
analysis). With the help of GloVe's global matrix first input the model gets. The shared input weight
factorization method, a matrix that shows whether matrix W is then multiplied by each unique heat
or not a document contains words is produced. The vector. The word vectors' hidden layer vector is then
GloVe model analyses the word vector such that the updated to include the average. The final probability
dot product of the words equals the logarithm of the distribution is produced by multiplying the hidden
likelihood of those words occurring concurrently. layer vector by the weight matrix W of the output
The Glove Model can be written down as follows. (J. and then applying SoftMax algorithm to it.
Pennington, R. Socher, and C. Manning).
• Matrix Factorization Method
Using statistics to determine the relationship
between words is the core premise of the GloVe Matrix factorization is complex mathematical
word embedding. In contrast to the occurrence calculation that discover latent features which tells
matrix, the co-occurrence matrix lets you know how about the interactions among entities/texts etc.
frequently a particular word pair appears with Mainly there are different methods to calculate text
another. Each value in the co-occurrence matrix similarity using matrix factorization methods.
represents a word pair that commonly appears
together. • LSA
• Word2Vec
search for 5 distinct themes in our corpus). While exact term and association matching thanks to graph
Dirichlet is a sort of probability distribution, latent is enrichment. We utilise the data in the enriched
another synonym for hidden (i.e., properties that weighted graphs to calculate the similarity between
cannot be assessed directly). LDA views each text texts using an edge walk graph kernel. The
document as a combination of words and topics, and kernel function takes two weighted co-occurrence
vice versa. Each term and each subject are covered graphs as input and outputs a similarity score based
one by one. Each word will be given a subject at on how closely the relevant text in the two
random, and the frequency with which it appears in documents matches.
that topic and with which other terms will be
evaluated. As a result, it is a particularly popular tool • Graph Neural Network
for identifying text similarities since it groups related
words, documents, or phrases. To design a graph neural network where there are
many tiers of data and connections, the model must
be utilised to specify the hierarchical relationship of
the data (GNNs). The graph neural network (GNN), a
connectionism model, provides an example of a
graph's dependence on message transmission
between its nodes. A graph neural network keeps a
state that can represent information from its
neighbourhood at any depth, in contrast to a normal
Reference- neural network. Relationships between words can be
Multilingual_topic_modelling_for_tracking_COVID- automatically discovered, generated, and then
19.pdf reconstructed during processing using a special kind
of GNN called a WRGNN.
• Knowledge graph
• Semantic Text Matching
A graph kernel compares graph substructures to
determine how similar two graphs are to one another. In NLP, semantic matching techniques compare two
It allows for the consideration of structural utterances to see whether their meanings are similar.
information in text for determining document Semantic text matching is the process of comparing
similarity. A graph kernel's ability to calculate the semantic similarity of source and destination text
similarity depends on the data that is represented in fragments. Based on LSA, deep learning extracts the
the graphs. A weighted co-occurrence graph serves hierarchical semantic structure from the query and
as the first representation of the text document. content. In this case, a new expression is produced as
Then, using a similarity matrix based on word a result of the text being encoded to extract features.
similarities, it is turned into an enhanced document
network by automatically constructing related nodes The four primary approaches used for single
and edges (or relationships). Matching terms and semantic text matching are architecture-I (for
patterns contribute to document similarity based on matching two phrases), architecture-II,
their relevance since a supervised term weighting convolutional latent semantic model (CDSSM), and
method is employed to weight the terms and their deep structured semantic model (DSSM)
relationships. The similarity measure can go beyond (Architecture-II of convolutional matching model).
Significant local information is lost when In this part, we introduce MatchPyramid, a novel
complicated words are compressed into a single deep architecture for text matching. By seeing the
vector based on a single meaning. A single- matching matrix as an image and treating text
matching as image recognition, the fundamental Multimed. Tools Appl., vol. 78, no. 3, pp. 3797–
concept is revealed. 3816, 2019, doi: 10.1007/s11042-018-6083-5.
[4]. Z. Huang et al., Context-aware legal citation
recommendation using deep learning, vol. 1, no. 1.
Association for Computing Machinery, 2021. doi:
10.1145/3462757.3466066.
[5]. S. Yang, G. Huang, B. Ofoghi, and J. Yearwood,
“Short text similarity measurement using context-
aware weighted biterms,” Concurr. Comput. Pract.
Exp., vol. 34, no. 8, pp. 1–11, 2022, doi:
10.1002/cpe.5765.
[6]. D. W. Prakoso, A. Abdi, and C. Amrit, “Short text
similarity measurement methods: a review,” Soft
Comput., vol. 25, no. 6, pp. 4699–4723, 2021, doi:
10.1007/s00500-020-05479-2.
[7]. A. Kaundal, “A Review on WordNet and Vector
Fig: MatchPyramid on Text Matching.(reference : Space Analysis for Short-text Semantic Similarity,”
Text Matching as Image Recognition (aaai.org)) Int. J. Innov. Eng. Technol., vol. 8, no. 1, pp. 135–
142, 2017, doi: 10.21172/ijiet.81.018.
IV.CONCLUSION [8]. A. W. Qurashi, V. Holmes, and A. P. Johnson,
“Document Processing: Methods for Semantic Text
Generally, traditional embedding methods like Similarity Analysis,” INISTA 2020 - 2020 Int. Conf.
Word2Vec and Doc2Vec yield good results when the Innov. Intell. Syst. Appl. Proc., pp. 0–5, 2020, doi:
10.1109/INISTA49547.2020.9194665.
task only requires the broad sense of the text. On
[9]. T. Nora Raju, P. A. Rahana, R. Moncy, S. Ajay, and
tasks like semantic text similarity or paraphrase
S. K. Nambiar, “Sentence Similarity - A State of Art
identification, they perform better than state-of-the-
Approaches,” Proc. Int. Conf. Comput. Commun.
art deep learning techniques, which reflects this. On Secur. Intell. Syst. IC3SIS 2022, pp. 0–5, 2022, doi:
the other hand, when the task necessitates 10.1109/IC3SIS54991.2022.9885721.
something more specific than just the general [10]. R. Singh and S. Singh, “Text Similarity Measures in
meaning, such as sentiment analysis or sequence News Articles by Vector Space Model Using NLP,”
labelling, more complex contextual techniques J. Inst. Eng. Ser. B, vol. 102, no. 2, pp. 329–338,
perform better. Therefore, wherever possible, start 2021, doi: 10.1007/s40031-020-00501-5.
with a short and simple process before moving on to
Cite this article as :
one that requires more effort as needed.