A Cognitive Study On Semantic Similarity Analysis
A Cognitive Study On Semantic Similarity Analysis
Abstract—Semantic similarity analysis and modeling is a two blocks of text had many similar words but conveyed
fundamentally acclaimed task in many pioneering applications different meanings. For example, the sentences Tom and Harry
of natural language processing today. Owing to the sensation played Badminton and Cricket, and Tom played Badminton
of sequential pattern recognition, many neural networks like
RNNs and LSTMs have achieved satisfactory results in semantic and Harry played Cricket are lexically similar. However, the
similarity modeling. However, these solutions are considered context of these two sentences was not the same. Conversely,
inefficient due to their inability to process information in a non- the sentences Jenny knows many languages and Jenny is a
sequential manner, thus leading to the improper extraction of polyglot, though lexically dissimilar, convey the same mean-
context. Transformers function as the state-of-the-art architecture ing. Similarity analysis based on lexical similarity methods is
due to their advantages like non-sequential data processing and
self-attention. In this paper, we perform semantic similarity anal- easy to implement, but they fail in sentences that are lexically
ysis and modeling on the U.S Patent Phrase to Phrase Matching dissimilar and semantically similar.
Dataset using both traditional and transformer-based techniques. Today’s pioneering ML algorithms and their applications
We experiment upon four different variants of the Decoding use the method of vectorization for the process of feature
Enhanced BERT - DeBERTa and enhance its performance by
performing K-Fold Cross-Validation. The experimental results extraction. The concept of vectorization in NLP [2] is that each
demonstrate our methodology’s enhanced performance compared word/phrase in a dataset is represented as a vector or an array
to traditional techniques, with an average Pearson correlation of numbers, making feature extraction easier owing to today’s
score of 0.79. computers’ computational efficiency. Some techniques that use
Index Terms—Semantic Similarity, K-Fold Cross Validation, this vectorization concept include Bag of Words [3] and TF-
Pearson Correlation, Transformers IDF [4]. However, the major limitation of these solutions
is that they do not consider the broader context of the text
I. INTRODUCTION block for computing semantic similarity. The broader context
between two blocks of text is inversely proportional to their
Semantic similarity is defined as the association between semantic distance.
two blocks of text, including sentences, words, and documents.
Recurrent Neural Networks and Long Short-Term Memory
It plays a fundamentally acclaimed role in most of the NLP
networks, abbreviated as RNNs and LSTMs, respectively, have
tasks performed by researchers worldwide today. The dynamic
been considered effective techniques for learning dependencies
and versatile nature of human language makes it difficult to
between blocks of text. While RNNs depend on the recent
standardize the process of semantic similarity [1]. As time
previous blocks, LSTMs depend on the broader context of the
evolves, finding new semantic analysis techniques is deemed
text. However, they are considered inferior to transformers due
essential due to the exponential rise of textual data generation.
to their limitation of sequential data processing. The significant
The conceptual overview of semantic similarity analysis is
advantages of transformers like the training of large corpus,
depicted in Fig. 1.
non-sequential data processing, self-attention, and techniques
like positional embeddings to replace recurrent have popular-
ized them for performing modern-day NLP tasks. This paper
provides an insight into how modern-day transformers can be
used for the task of semantic similarity analysis. The major
NN for similarity Similarity Similarity
Preprocessing
Blocks of Text estimation Metrics Database contributions of our paper can be listed below:
Fig. 1: Conceptual Overview of Semantic Similarity Analysis • We present a comprehensive study on the different
methodologies used for the process of semantic similarity
As mentioned above, semantic similarity analysis is pivotal analysis.
in various applications like information recovery, text sum- • We also perform the state-of-the-art preprocessing tech-
marization, speech enhancement, and automatic dialogue gen- niques and exploratory data analysis on the U.S Patent
eration. In the initial methodologies proposed by researchers Phrase-to-Phrase matching dataset.
worldwide, semantic similarity was calculated based on the • Exclusive experimentation and analysis of one traditional
number of similar words in two blocks of text. However, and four transformer-based techniques is performed to
this yielded inaccurate results as there were instances where extract context and perform semantic similarity analysis.
2
II. RELATED RESEARCH OVERVIEW similar sentences using a GAN-based approach. The authors
Research on Semantic Analysis has been a topic of interest proposed a syntactic and semantic long short-term memory
since the 20th century. Many solutions have been proposed (SSLSTM) algorithm for evaluating semantic similarity. Three
with their implementation on many benchmark datasets. This variations of the sentence similarity generative adversarial
section presents a comprehensive survey of datasets and dif- network (SSGAN) algorithm were proposed for generating
ferent methodologies used for Semantic Similarity Analysis. sentences. The state-of-the-art solutions for tasks in natural
Table I gives an overview of the widely-used datasets for language processing involve the usage of transformers. Pre-
semantic similarity. cisely, transformers in NLP are used to solve NLP tasks
involving the dependency of long sequences. In this context, Li
TABLE I: Popular Datasets for semantic similarity et al. [18] introduced a hybrid Cross2self attention, Bi-RNN -
BERT model to computer semantic similarity in biomedical
Author Dataset Word/Sentence pairs Similarity Score Range data. The methodology was validated on the OHNLP2018
Rubenstein et al. [5] R&G 65 0-4 baselines with an increase of 0.6% in the Pearson coefficient.
Miller et al. [6] M&C 30 0-4
Finkelstein et al. [7] WS353 353 0-10
Another approach involving the usage of BERT for semantic
Agirre et al. [8] STS2015 3000 0-5 similarity of outlook emails was proposed by Sanjeev et al
Marelli et al. [9] SICK 10000 1-5
USTPO Patent Phrase Matching 33000 0-1
[19]. Some of the standard approaches in NLP for semantic
analysis include Word2Vec, proposed by Google in 2013 and
the Glove model. The related research overview could be
Benajiba et al. [10] proposed a solution involving a Siamese summarized in Table II.
LSTM regression model that is used to predict the similarity
of the SQL template of two questions. The authors defined TABLE II: Overview of the existing solutions
a metric called the SQL structure distance used to estimate
the similarity using the proposed methodology. To reduce the Author Methodology Dataset used
Benajiba et al. [10] Siamese LSTM Regression WikiSQL
computational cost of the solution, the authors have clustered Li et al. [11] ISA + Siamese NNs DBMI , CDD-ful, CDD-ref
the training set samples with the one-hot lexical representation Pontes et al. [12] Siamese CNN + LSTM SICK dataset
of the questions. Li et al. [11] proposed another solution Quan et al. [13] Attention Constituency Vector Tree STS’12-STS’15
(ACVT)
involving semantic similarity in biomedical sentences using Shancheng et al. [14] Double Seq. NN + LSTM Chinese semantic similarity
an SNN approach. The methodology involved the integration dataset
Yang et al. [15] Probase M&G, WS353-Sim, and
of an interactive self-attention (ISA) mechanism and an SNN. R&G
The proposed solution was validated on three standard biomed- Liang et al. [17] SSLSTM + SSGAN SemEval and Quora
ical datasets with an average Pearson score of 0.65. Pontes et Li et al. [18] Cross2self, BERT Biomedical Data
Sanjeev et al [19] BERT Outlook Emails
al. [12] proposed a Siamese CNN + LSTM model in which the Our Work DeBERTa + K-Fold Stratified U.S Patent Phrase to
CNN extracts the local context while the LSTM extracts the Cross Validation Phrase Matching Dataset
global context. The proposed methodology was evaluated on
the SICK dataset with different combinations of local context
and global context.
III. METHODOLOGY
Quan et al. [13] proposed a framework combining the capa-
bility of word embeddings and attention weight mechanism by This section deals with the different techniques used to
integrating them into a unified network known as the Attention perform the task semantic similarity analysis. In this work,
Constituency Vector Tree (ACVT). The proposed solution was we compare and analyze the performance of five different
validated on 19 benchmark datasets which include STS’12- techniques used for semantic similarity analysis. These include
STS’15 with a Pearson score of 0.75. Shancheng et al. [14] Levenshtein Metric similarity and four different variants of the
proposed a double sequential network consisting of identical DeBERTa model. This section deals with the architecture of
LSTM layers that simultaneously train two sequences of sen- each model and how it can be fine-tuned for our dataset to
tences. The outputs of both the layers were passed through the perform the required task.
dense layer and compressed to obtain the semantic similarity.
The proposed solution addressed the problem of Chinese
A. Levenshtein Metric
characteristics and was compared with the Baidu Semantic
Text Similarity model and achieved higher accuracy. Yang et In Natural Language Processing, the Levenshtein distance
al. [15] proposed a methodology that involved an extensive between two words is defined as the number of single-
semantic network known as Probase. From the current weights character edits required to convert one word from other
and parameters of probase, the semantic similarity was per- [20]. It is a string metric used to understand the disparity
formed on the MG, WS353-Sim, and RG datasets. between two different sequences. Edits can be defined as
In recent years, Generative Adversarial Networks (GANs) insertion, replacement, and deletion in this context. Some of
have gained tremendous popularity in artificial data generation the Levenshtein Distance applications include DNA Analysis
for various tasks, including image sample generation with and Plagiarism Checking. In this task of semantic similarity
limited data [16] and text generation. In this view, Liang analysis, we experiment with the approach of the Levenshtein
et al. [17] addressed the generation and identification of Distance on our dataset.
3
B. DeBERTa
Sm,n = [Cm,n, Posm|n] × [Cn,m, Posn|m]T
As mentioned in section II, there has been a remarkable = CmCTn + CmPosTn|m + Posm|nCTn + Posm|nPosT
rise in the usage of transformers in many NLP tasks like |
nm
(1)
semantic analysis and dialogue generation. BERT (Bidirec- However, there have been recent improvements in the
tional Encoder Representations from Transformers) has been composition of DeBERTa owing to ELECTRA-Style Pre-
acknowledged as a recent advancement in transformers by Training [24]. The version, also known as DeBERTa-V3, has
researchers at Google in their work [21]. The concept of many variants, including DeBERTa-base, DeBERTa-V3-Small,
BERT lies in the fact that when sequential data is trained DeBERTa-V3-XSmall, and mDeBERTa-V3-Base. The initial
in a bi-directional manner, better and deeper inference can version of DeBERTa uses a mask language modeling (MLM)
be obtained on the data for specific tasks like language mechanism, which is now replaced by replaced token detection
understanding. In Machine Vision, transfer learning is a widely (RTD), considered to be a sample-efficient pre-training task.
used technique by researchers across the globe to perform The variants of DeBERTa differ in their backbone parameters,
various tasks rather than training a model from the onset. vocabulary, hidden size, and layers. The architectural speci-
The idea of transfer learning is that existing deep learning fications of all the variants have been depicted in Table III.
models could be transformed into objective-specific models Once we input the data into the model, we now perform the
by fine-tuning the existing model. This approach has gained task of Stratified K-Fold cross-validation.
significance among NLP researchers worldwide, and transfer
TABLE III: Specifications of the different versions of De-
learning could now be applied to many NLP tasks. BERT
BERTa
employs the usage of a transformer, a mechanism that is
based on attention that comprehends the contextual inference
Model Corpus Backbone Hidden Size Layers
between two words in a corpus. The simplest form of BERT
Parame-
has an encoder and a decoder. The purpose of the encoder is ters(M)
to comprehend the input text, while the decoder’s purpose is DeBERTa-V3-Base 128 86 768 12
to deliver a prediction. DeBERTa-V3-Small 128 44 768 6
Since 2018, there has been a rapid rise in the design and DeBERTa-V3-XSmall 128 22 384 12
development of pre-trained language models like GPT, T5, mDeBERTa-V3-Base 250 86 768 12
RoBERTa, StructBERT, and DeBERTa [22]. However, in this
work, we emphasize the different versions of DeBERTa and
their performance in the U.S Patent to Phrase matching dataset. C. Stratified K-Fold Cross Validation
DeBERTa is a Decoding Enhanced BERT with disentangled It is deemed essential to evaluate our model once trained
attention, which functions based on introducing two novel on our input data. A methodological error is incorporated if
techniques: Disentangled attention and enhanced masked de- the model retains the parameters of a periodic function and is
coding. The concept of disentangled attention is that each experimented on the same data. The prediction scores would
word or token in the input layer is represented by two vectors remain perfect on known labels, and the model’s performance
corresponding to its content and position in the corpus. This would still be unsatisfactory on unseen data. This condition
is inferred from the fact that the word’s position also has is also known as overfitting. So to prevent overfitting, it is
significant importance in content extraction. However, though always deemed essential to split out a chunk of data into the
disentangled attention conveys the relative positions of words, test/validation set. However, there is a probability of overfitting
it is deemed essential to determine the exact position of words the test/validation set due to tweaking the existing parameters
in a corpus to avoid semantic disparity. So to achieve this, until the estimator performs correctly. So to address this
DeBERTa integrates the positional word embeddings prior situation, we perform K − Fold Stratified Cross-Validation
to its softmax layer. Owing to its architecture, DeBERTa [25]. The objective of the K − Fold Stratified Cross-Validation
is considered significantly superior to its counterparts like is that the data is split into K folds. Training is performed on
RoBERTa [23]. K −1 folds while testing is performed on the remaining fold,
The input to this pipeline is the anchor phrase + seperated resulting in a higher performance of the model. Mathemati-
token + the context phrase. DeBERTa uses a metric known cally, the cross-validation estimate CV can be represented in
as the cross-attention score to infer the semantic similarity Eq. 2
between two blocks of text. Mathematically, the cross attention
score of a block m with respect to another block n can be Σ
N
CV = 1/N L(yi, f Ki (xi)) (2)
represented as shown in Eq. 1 where Cm represents the content
i=1
of with
m the word andtoPos
respect n m|n represents the position of the word depicts the actual score, Kidepicts the pre-
. The cross-attention score between two where yi f (xi)
blocks m and n can be categorized into four components: block diction the on Kith fold and L is the loss function. Subse-
value-to-block value, block value-to-index, index-to-index, and quently, we evaluate the model using the Pearson correlation
index-to-block value. as shown in Eq. 1 coefficient.
4
Fig. 2: Distribution of Terms in (a) Anchor Phrases, (b) Target Phrases, (c) Context Tags. (d) Distribution of scores
Fig. 3: (a) Distribution of anchors with respect to targets, (b) Distribution of anchor count with respect to context, (c) Distribution
of target count with respect to context
Fig. 5: Training and Validation Losses of DeBERTa-small in (a) Fold 1, (b) Fold 2 and (c) Fold 3
Fig. 6: Variation of pearson scores of DeBERTa-small in (a) Fold 1, (b) Fold 2 and (c) Fold 3
define the validation loss as the model’s performance on the Fold Training Loss Validation Loss Pearson Correlation
validation/test set. The final metric, also known as the Pearson 1 0.273200 0.276361 0.116614
correlation coefficient, gives us a linear measure of the strength 2 0.148500 0.141006 0.154153
of two variables. As depicted in Table IV, DeBERTa-base 3 0.147500 0.140739 0.193211
achieved an average training loss of 0.03, validation loss of 4 0.150100 0.136254 0.175404
0.026, and Pearson correlation score of 0.74. In the subsequent
sections, we analyze the performance of DeBERTa-V3-Small,
DeBERTa-V3-XSmall, and mDeBERTa-V3-Base. E. Performance of DeBERTa-Small
DeBERTa-Small is an abridged version of the DeBERTa-
D. Performance of mDeBERTa base, keeping in view the critical parameters required for pre-
In this section, we emphasize the performance of multi- diction. It is trained on 160GB data as its previous version with
lingual DeBERTa on the training and validation sets. The 44M backbone parameters and has a hidden size of 768 with
number of epochs are 5, with the batch size being 128. Despite the number of layers as 6. The model achieved a cumulative
performing stratified K-Fold cross-validation with the number Pearson score of 0.78 after three cross-validation folds. The
of folds set as 4, the model showed an inferior performance in performance metrics of DeBERTa-Small are depicted in Table
terms of similarity prediction with very less Pearson coefficient VI. From Table VI, we can infer that the best performance is
6
illustrated by DeBERTa-Small. Also, we illustrate the training [6] G. A. Miller and W. G. Charles, “Contextual correlates of semantic
and validation losses of each fold in Fig. 5 and the variation similarity,” Language and cognitive processes, vol. 6, no. 1, pp. 1–28,
1991.
of the pearson score in Fig. 6. [7] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolf-
man, and E. Ruppin, “Placing search in context: The concept revisited,”
TABLE VI: Performance Metrics of DeBERTa-v3-Small in Proceedings of the 10th international conference on World Wide Web,
2001, pp. 406–414.
[8] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre,
Fold Training Loss Validation Loss Pearson Correlation W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea et al., “Semeval-
2015 task 2: Semantic textual similarity, english, spanish and pilot on
1 0.003400 0.026166 0.799629 interpretability,” in Proceedings of the 9th international workshop on
2 0.003300 0.027797 0.782329 semantic evaluation (SemEval 2015), 2015, pp. 252–263.
3 0.003500 0.025930 0.803020 [9] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and
R. Zamparelli, “A sick cure for the evaluation of compositional dis-
tributional semantic models,” in Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC’14), 2014,
pp. 216–223.
[10] Y. Benajiba, J. Sun, Y. Zhang, L. Jiang, Z. Weng, and O. Biran,
F. Performance of DeBERTa-XSmall “Siamese networks for semantic pattern similarity,” in 2019 IEEE 13th
DeBERTa-XSmall is considered a simplified version of International Conference on Semantic Computing (ICSC). IEEE, 2019,
pp. 191–194.
DeBERTa-Small with only 22M backbone parameters which [11] Z. Li, H. Lin, W. Zheng, M. M. Tadesse, Z. Yang, and J. Wang, “Interac-
is half in number compared to its counterpart. This model tive self-attentive siamese network for biomedical sentence similarity,”
achieved a cumulative Pearson score of 0.765 after four cross- IEEE Access, vol. 8, pp. 84 093–84 104, 2020.
[12] E. L. Pontes, S. Huet, A. C. Linhares, and J. Torres-Moreno, “Predicting
validation folds. However, fewer backbone parameters and the semantic textual similarity with siamese CNN and LSTM,” CoRR,
hidden size justify its lesser performance than DeBERTa- vol. abs/1810.10641, 2018.
Small. The performance metrics of DeBERTa-XSmall are [13] Z. Quan, Z.-J. Wang, Y. Le, B. Yao, K. Li, and J. Yin, “An efficient
framework for sentence similarity modeling,” IEEE/ACM Transactions
depicted in Table VII. on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 853–
865, 2019.
TABLE VII: Performance Metrics of DeBERTa-XSmall [14] T. Shancheng, B. Yunyue, and M. Fuyu, “A semantic text similarity
model for double short chinese sequences,” in 2018 International Con-
ference on Intelligent Transportation, Big Data Smart City (ICITBS),
Fold Training Loss Validation Loss Pearson Correlation 2018, pp. 736–739.
[15] T. Yang, S. Wu, J. Feng, N. Fu, and M. Tian, “Semantic network
1 0.039200 0.030078 0.774637 based approach to compute term semantic similarity,” in 2019 3rd
2 0.039200 0.031391 0.765988 International Conference on Electronic Information Technology and
Computer Engineering (EITCE), 2019, pp. 654–658.
3 0.038800 0.029105 0.780142
[16] P. R. Medi, P. Nemani, V. R. Pitta, V. Udutalapally, D. Das, and S. P. Mo-
4 0.038700 0.031934 0.755139 hanty, “Skinaid: A gan-based automatic skin lesion monitoring method
for iomt frameworks,” in 2021 19th OITS International Conference on
Information Technology (OCIT), 2021, pp. 200–205.
[17] Z. Liang and S. Zhang, “Generating and measuring similar sentences
using long short-term memory and generative adversarial networks,”
V. CONCLUSION IEEE Access, vol. 9, pp. 112 637–112 654, 2021.
[18] Z. Li, H. Lin, C. Shen, W. Zheng, Z. Yang, and J. Wang, “Cross2self-
This paper experimented with traditional and transformer- attentive bidirectional recurrent neural network with bert for biomedical
based approaches for semantic similarity modeling on large semantic text similarity,” in 2020 IEEE International Conference on
corpora. We also compared our methodology with exist- Bioinformatics and Biomedicine (BIBM), 2020, pp. 1051–1054.
[19] M. M. Sanjeev, B. Ramalingam, and S. Kumar T.K., “Realtime se-
ing techniques, and the results demonstrated the improved mantic similarity analysis of bulk outlook emails using bert,” in 2020
performance of the model. The proposed methodology also International Conference on Advances in Computing, Communication
illustrated context extraction and showed its importance in Materials (ICACCM), 2020, pp. 89–94.
[20] R. Haldar and D. Mukhopadhyay, “Levenshtein distance technique in
similarity modeling. In the following aspects, the execution dictionary lookup methods: An improved approach,” arXiv preprint
time and memory could be optimized, thus leading to en- arXiv:1101.1232, 2011.
hanced training. Also, the architecture of the existing model [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
could be improved, thus leading to enhanced performance. neural information processing systems, vol. 30, 2017.
[22] P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert
with disentangled attention,” in International Conference on Learning
REFERENCES Representations, 2021.
[1] D. Chandrasekaran and V. Mago, “Evolution of semantic similarity—a [23] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
survey,” ACM Computing Surveys (CSUR), vol. 54, no. 2, pp. 1–37, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
2021. pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[2] A. K. Singh and M. Shashi, “Vectorization of text documents for [24] P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using
identifying unifiable news articles,” Int. J. Adv. Comput. Sci. Appl, electra-style pre-training with gradient-disentangled embedding shar-
vol. 10, no. 7, 2019. ing,” arXiv preprint arXiv:2111.09543, 2021.
[3] Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding bag-of-words model: [25] T.-T. Wong and N.-Y. Yang, “Dependency analysis of accuracy estimates
a statistical framework,” International Journal of Machine Learning and in k-fold cross validation,” IEEE Transactions on Knowledge and Data
Cybernetics, vol. 1, no. 1, pp. 43–52, 2010. Engineering, vol. 29, no. 11, pp. 2417–2427, 2017.
[4] S. Qaiser and R. Ali, “Text mining: use of tf-idf to examine the
relevance of words to documents,” International Journal of Computer
Applications, vol. 181, no. 1, pp. 25–29, 2018.
[5] H. Rubenstein and J. B. Goodenough, “Contextual correlates of syn-
onymy,” Communications of the ACM, vol. 8, no. 10, pp. 627–633, 1965.