NLP Project
NLP Project
SIMILARITY
Team Members:
Tanvi Paigude (23MAI0002)
Apurva Wankhade (23MAI0047)
ABSTRACT
The growing field of Natural Language Processing (NLP) has witnessed significant
advancements in recent years, with sentence similarity measurement standing out as a crucial
component in various applications such as text summarization, information retrieval, and
question-answering systems. The primary objective of this research is to provide a
comprehensive analysis of the effectiveness and applicability of cosine similarity in capturing
semantic relationships between sentences. This paper explores the integration of deep
learning approaches, such as neural embeddings, to enhance the capabilities of cosine
similarity in capturing subtle nuances and semantic intricacies within sentences. We discuss
the potential synergy between traditional cosine similarity methods and modern neural
network-based models.
INTRODUCTION
Sentence similarity refers to the quantification of how alike two sentences are in terms of
their semantic meaning or content. It is a crucial concept in Natural Language Processing
(NLP) and computational linguistics, aiming to measure the resemblance between sentences
based on their underlying meaning rather than surface-level features. The importance of
sentence similarity is evident in various NLP applications, including information retrieval,
text summarization, question-answering systems, and machine translation. For example, in
information retrieval, accurate sentence similarity metrics help retrieve relevant documents
based on user queries. In text summarization, sentence similarity aids in selecting and
summarizing key information.
Several methods and metrics are employed to determine sentence similarity, and one
commonly used approach is cosine similarity. Cosine similarity measures the cosine of the
angle between two vectors representing sentences in a high-dimensional space. This method
is robust to variations in sentence length and structure, making it suitable for capturing
semantic relationships. Sentence similarity is a fundamental concept in NLP, contributing to
various applications that require a nuanced understanding of the relationships between
sentences. The ability to measure and analyze sentence similarity facilitates the development
of more sophisticated and context-aware language processing systems.
PROBLEM STATEMENT
The existing methods for measuring sentence similarity, such as cosine similarity, play a
crucial role in various natural language processing (NLP) applications. However, these
methods may not effectively capture subtle nuances and semantic intricacies within
sentences, limiting their applicability in tasks requiring a more nuanced understanding of
sentence relationships. This research aims to address this gap by exploring the integration of
deep learning approaches, like neural embeddings, to enhance the effectiveness of cosine
similarity in capturing semantic relationships between sentences. The problem statement,
therefore, revolves around improving the accuracy and applicability of sentence similarity
measurement in NLP through the integration of traditional cosine similarity methods with
modern neural network-based models.
Several studies have investigated semantic similarity in the context of Natural Language
Processing (NLP) and sentence similarity modeling. Si, S. et. al.[1] explores semantic
similarity in Chinese finance-domain sentences using Word2Vec and GloVe word
embeddings. Experimental findings suggest that Word2Vec outperforms GloVe,
demonstrating stronger discrimination ability in distinguishing semantically similar and
dissimilar sentence pairs. Optimal Word2Vec parameters for Chinese character embeddings
are identified as a window size of 6 and embedding dimension of 400.
Chandrasekaran, D. et. al.[4] explores the evolution of semantic similarity methods in Natural
Language Processing (NLP), spanning from traditional kernel-based techniques to state-of-
the-art transformer-based models. Categorizing approaches into knowledge-based, corpus-
based, deep neural network–based, and hybrid methods, the survey offers a comprehensive
overview, assessing the strengths and weaknesses of each, providing valuable insights for
researchers tackling the challenging task of estimating semantic similarity in diverse text
data.
Zhao, Y. et. al.[5] introduces Enhanced Inter-sentence Attention (EIA), a novel multi-head
self-attention architecture for Semantic Textual Similarity (STS) tasks. By incorporating
attention between sentence pairs, EIA effectively captures semantic relations, achieving
improved performance on benchmark datasets. The proposed architecture, demonstrated on
RoBERTa, outperforms strong baseline models, showcasing its potential as a versatile plug-
and-play unit for enhancing various transformer-based language models in large-scale pre-
trained applications.
Quan, Z.et. al.[6] introduces an attention constituency vector tree (ACVT) structure for
sentence similarity modelling, combining syntactic information, semantic features, and
attention weights. The proposed ACVT kernel is tailored for measuring sentence similarity,
demonstrating effectiveness across various datasets and outperforming state-of-the-art
models. The model's key strengths lie in its versatility as a general framework and its
efficiency, avoiding time-consuming training once word embeddings are available. Future
work may explore further enhancements and comparisons with larger training datasets.
Piroozfar, P. et. al.[8] introduces an enhanced method for cross-lingual semantic similarity
between Persian and English sentences, utilizing ensemble models with transformers.
Achieving a remarkable 95.28% correlation rate, the approach outperforms previous
techniques, addressing challenges without relying on machine translation and controlling
complexity. The study highlights the potential applications in machine translation,
information retrieval systems, and search engines, suggesting future improvements,
particularly for the Persian language.
Gupta, A. et. al.[9] presents an enhanced algorithm for semantic similarity computation,
focusing on Word-Net noun IS-A and verb relationships to achieve more accurate results
compared to existing methods. The proposed approach employs disambiguation, edge-based
methodology, and semantic vectors to calculate similarity between words, sentences, and
paragraphs. The algorithm demonstrates superior performance with high Pearson correlation
coefficients of 0.875 for word similarity and 0.879 for sentence similarity, showcasing its
potential for advancing search accuracy in the vast landscape of available data.
Kale, N. et. al.[10] focuses on the growing importance of automated text summarization to
efficiently manage large volumes of data. It explores the integration of modern artificial
intelligence, optimized deep learning methods, and computational cognitive models to
enhance document summarization. The study evaluates the effectiveness of these models
based on precision, recall, and F-measure, emphasizing the superiority of human text
summarization behaviour in comparison to existing techniques.
Beken Fikri, et. al.[11] introduces a hybrid text summarization model for efficient data
collection in behavioral biology, leveraging T5 transformer preprocessing and combining
seq2seq and stacked LSTM with attention mechanisms. The model addresses challenges in
accessing scattered biomedical information but has a limitation of offering only high-level
document overviews. Future work proposes a multi-task learning strategy to enhance
summarization depth.
Sheikh Abujar et. al.[12] through their paper Sentence Similarity Estimation for Text
Summarization Using Deep Learning introduces a sentence similarity measure using both
lexical and semantic approaches, emphasizing the need for further development in Bengali
language resources. It notes the instability of Bengali WordNet compared to its English
counterparts. The research favors an unsupervised approach due to data constraints but
acknowledges the potential of supervised learning with a larger dataset. The paper suggests
exploring additional semantic and syntactic analyses for sentence similarity and proposes
combining these with lexical similarities for improved results. Identifying leading sentences
is crucial for better text summarization, with centroid sentences aiding in post-processing.
The importance of evaluating the summarizer and considering backtracking methods is
highlighted for optimal results.
A Novel Hybrid Methodology of Measuring Sentence Similarity by Tak-Sung Heo et. al.[13]
compares the performance of models using only deep learning versus models incorporating
their proposed method. Evaluation metrics include the Pearson and Spearman correlation
coefficients. Both correlation coefficients are higher when considering both deep learning and
lexical relationships, compared to using only deep learning.
McLean, D. et. al.[16] introduces an algorithm for measuring semantic similarity between
short texts or sentences, focusing on efficient computation without the need for high-
dimensional processing. The method utilizes semantic information from a lexical database
and corpus statistics, enabling adaptation to different domains. They proposed approach,
validated through experiments, demonstrates significant correlation with human intuition,
making it applicable in various text-related tasks such as knowledge representation and
discovery.
[17] IBMT.J. Watson Research Center Yorktown Heights, NY, USA proposed the Sentence
Similarity Learning by Lexical Decomposition and Composition that introduces a model for
evaluating sentence similarity by breaking down and reconstructing lexical semantics. To
address the lexical gap issue, the model uses context vectors to represent each word. Various
methods are employed to decompose word vectors into similar and dissimilar components,
extracting features from both sentence similarity and dissimilarity. The model utilizes a two-
channel CNN with diverse n-gram filters to capture features at different levels. Experimental
results demonstrate the model's effectiveness in tasks such as answer sentence selection and
paraphrase identification.
Survey on Sentence Similarity Evaluation using Deep Learning by J Ramaprabha et. al.[19]
discusses how detecting semantic equivalence between questions that use different
vocabulary and syntactic structures poses a challenge. In online forums like Quora and Stack
Overflow, maintaining a high-quality knowledge base is crucial to avoid redundant
information. Ensuring that each unique question exists only once helps writers avoid
repeating answers, and readers can easily find the information they seek. The example
provided illustrates the need to identify duplicate questions with similar intent, such as those
related to effective weight loss strategies.
Zhang, P. et. al.[20] proposed multi-model nonlinear fusion algorithm incorporates Jaccard
coefficient, TF-IDF, and word2vec-CNN to measure sentence similarity. By leveraging
weighted vectors and a fully connected neural network, the model achieves an 84% matching
accuracy and a 75% F1 value, demonstrating improved global understanding of sentence
features compared to fine-grained extraction methods.
Tien et. al.[21] introduces the M-MaxLSTM-CNN model, leveraging multiple sets of pre-
trained word embeddings to encode diverse linguistic properties into a novel sentence
representation. The proposed approach achieves robust performance across various tasks such
as measuring textual similarity, identifying paraphrase, and recognizing textual entailment,
outperforming state-of-the-art methods without the need for handcrafted features or uniform
dimensions in pre-trained word embeddings. The innovative Multi-level comparison
technique enables effective learning of semantic textual relations.
Shuang Zhang et. al.[22] through their paper titled A Survey of Semantic Similarity and its
Application to Social Network Analysis offer a concise survey of semantic similarity,
covering both semantic similarity between concepts and semantic textual similarity. The
methods for semantic similarity between concepts are classified into four categories based on
the background information resource used. Similarly, methods for semantic textual similarity
are also classified into four categories. The survey highlights the importance of semantic
similarity measures in text-related research and applications, particularly in online social
network analysis. The paper discusses how similarity computation methods play a crucial
role in various aspects of social network analysis.
The paper by Yunhong Xu et. al.[23] addresses the impact of Web 2.0 technologies on
communication and collaboration among researchers, leading to information overload. To
alleviate this issue, the paper advocates for the development of researcher recommendation
agents to offer personalized suggestions for potential collaborations. The proposed approach
integrates social network analysis and semantic concept analysis in a unified framework to
enhance the effectiveness of personalized researcher recommendations. The paper
emphasizes the improvement in knowledge discovery and exchange, ultimately enhancing
research' productivity. Experimental results demonstrate that the proposed approach
outperforms baseline methods, and a case study illustrates its application in real-world
academic contexts.
Saad, S. M et. al.[24] investigates semantic similarity measurement techniques for sentences,
specifically comparing Jaccard, Cosine, and Dice similarity measures. Utilizing WordNet for
word-to-word semantic similarity calculation, the research concludes that Jaccard and Dice
outperform in measuring semantic similarity between sentences, providing valuable insights
for applications like text mining and question answering. Further testing with real data and
human experts is suggested for comprehensive evaluation.
[25] This paper explores the effectiveness of unsupervised neural embedding models in
estimating semantic similarity of sentences from biomedical literature. Trained on 1.7 million
articles from PubMed, the best model, based on Paragraph Vector Distributed Memory,
outperforms previous state-of-the-art results on the BIOSSES biomedical benchmark set.
Additionally, a supervised model combining string-based similarity metrics with neural
embeddings surpasses ontology-dependent approaches, highlighting the value of neural
network-based models in biomedical semantic similarity estimation. However, challenges
remain in addressing contradictions and negations in biomedical sentences.
Qurashi, A. W. et. al.[26] explores the application of Natural Language Processing (NLP)
techniques to measure semantic text similarity in safety-critical documents related to railway
safety. The study employs preprocessing, NLP toolkits, and Jaccard and Cosine similarity
metrics to automate the identification of equivalent rules and procedures. Results indicate that
the Cosine similarity metric outperforms Jaccard in analysing safety-critical documents,
providing a consistent method for maintaining safety instructions. The proposed
methodology, applicable to other domains, holds promise for future analysis on various
hardware platforms.
Soler, A.G et. al.[27] explores usage similarity estimation by utilizing contextualized word
and sentence embeddings (ELMo and BERT) along with supervised models. Leveraging
lexical substitute annotations from context2vec, the proposed models demonstrate superior
performance in both graded and binary similarity tasks, outperforming previous methods and
highlighting the effectiveness of BERT for usage similarity prediction.
EZZIKOURI, H. et. al.[28] introduces a novel approach for computing semantic similarity
between concepts in WordNet, leveraging set theory and WordNet properties. The method
combines synsets and glosses to enhance similarity scores, with potential applications in
information retrieval, plagiarism detection, and sentiment analysis, addressing the need for
more robust semantic similarity calculations in various domains. The proposed technique
utilizes synonymy relationships based on synsets and glosses to maximize similarity scores
between concepts.
Shahmirzadi, O. et. al.[29] evaluates various vector space models for measuring semantic text
similarity, specifically in the context of patent-to-patent similarity. Surprisingly, the simpler
TFIDF model outperforms more complex methods like LSI and D2V in many cases,
especially for longer and more technical texts, or when making finer-grained distinctions
between nearest neighbors. The study suggests that, for the practical application of patent
similarity detection, the cost-effective and simple TFIDF model is often a sensible choice,
while more complex models may be justified only in specific scenarios, such as when dealing
with condensed text and relatively coarse similarity detection.
Pawar, A. et. al.[30] introduces a novel methodology for calculating semantic similarity
across various domains by incorporating corpora-based statistics into a standardized
algorithm. Employing an edge-based approach with a lexical database, the method achieves
high correlation values (r=0.8753 for word and r=0.8793 for sentence similarity) on
benchmark standards and human similarity datasets, outperforming other unsupervised
models. The approach involves disambiguating sentences, utilizing information content from
corpora, and forming semantic vectors for accurate similarity calculations, making it a
valuable tool with low computing overhead for professionals in diverse domains.
OPEN-SOURCE TOOLS:
1. NLTK
2. Sklearn
3. Cosine_smiliarity
PROPOSED ARCHITECTURE
The proposed architecture facilitates the calculation of cosine similarity between two
sentences, aiding in determining their semantic similarity. It involves a series of key steps,
starting with preprocessing to standardize the sentences for comparison. This includes
tokenization, lowercasing, removing stop words, and stemming or lemmatization to handle
word variations. Following preprocessing, each sentence is transformed into a vector in a
high-dimensional space.
This vectorization process can be achieved using the bag-of-words (BoW) model, where each
dimension represents a unique word, or through word embeddings like Word2Vec, GloVe, or
FastText, which represent words as dense vectors in a continuous space. The cosine similarity
is then calculated between the vector representations of the sentences. This metric measures
the cosine of the angle between the vectors, providing a score between -1 and 1. A score
closer to 1 indicates higher similarity, while 0 indicates no similarity and -1 indicates
dissimilarity.
To interpret the similarity score, a threshold can be set based on the application's
requirements. For instance, a score exceeding a certain threshold may indicate that the
sentences are similar, while scores below the threshold imply dissimilarity. Efficiency can be
optimized by employing techniques such as approximate nearest neighbour search or
dimensionality reduction, especially for large datasets, to speed up the vectorization and
similarity calculation processes.
1. Preprocessing:
Tokenization: Break each sentence into individual words or tokens.
Lowercasing: Convert all words to lowercase to ensure case-insensitive comparison.
Removing stop words: Eliminate common words like "the", "is", "and", etc., as they don't
contribute much to the meaning.
Stemming or Lemmatization: Reduce words to their base or root form to handle
variations like "running" and "ran" to "run".
2. Vectorization:
After preprocessing, each sentence is represented as a vector in a high-dimensional space.
One common technique is using the bag-of-words (BoW) model. Each dimension in the
vector represents a unique word, and the value at each dimension represents the frequency
of that word in the sentence.
Another approach is to use word embeddings like Word2Vec, GloVe, or FastText to
represent words as dense vectors in a continuous vector space.
Where A.B denotes the dot product of A and B, and ||A|| and ||B|| are the magnitudes of
vectors
6. Optimization:
Depending on the size of the dataset and computational resources, you might need to
optimize the vectorization and cosine similarity calculation steps for efficiency.
Techniques like using approximate nearest neighbour search or dimensionality reduction
can be employed to speed up the process.
IMPLEMENTATION
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
def load_glove_embeddings(embedding_file):
embeddings_index = {}
with open(embedding_file, encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
return embeddings_index
if not vectors:
return None # Return None if no word in the sentence is in the embeddings
sentence_vector = np.mean(vectors, axis=0) # Calculate the mean of word vectors to represent the
sentence
return sentence_vector
return cosine_sim[0][0]
# Example sentences
sentence1 = "The quick brown fox jumps over the lazy dog"
sentence2 = "A brown fox jumps over a lazy dog"
sentence3 = "The sky is blue"
REFERENCES
[1] Si, S., Zheng, W., Zhou, L., & Zhang, M. (2019). Sentence Similarity Computation in
Question Answering Robot. Journal of Physics: Conference Series, 1237(2), 022093.
https://dx.doi.org/10.1088/1742-6596/1237/2/022093
[2] Lopez-Gazpio, I., Maritxalar, M., Lapata, M., & Agirre, E. (2019) 'Word n-gram
attention models for sentence similarity and inference', Expert Systems with
Applications, 132, pp. 1–11. doi:10.1016/j.eswa.2019.04.054.
[3] Singh, R., Singh, S. (2021) 'Text Similarity Measures in News Articles by Vector
Space Model Using NLP', J. Inst. Eng. India Ser. B, 102, pp. 329–338.
doi:10.1007/s40031-020-00501-5.
[6] Quan, Z., Wang, Z. -J., Le, Y., Yao, B., Li, K., & Yin, J. (2019) 'An Efficient
Framework for Sentence Similarity Modeling', IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 27(4), pp. 853-865. doi:
10.1109/TASLP.2019.2899494.
[7] Wang, X., He, J., Wang, P., Zhou, Y., Sun, T., & Qiu, X. (2024) 'DenoSent: A
Denoising Objective for Self-Supervised Sentence Representation Learning', arXiv
preprint arXiv:2401.13621.
[8] Piroozfar, P., Abdous, M., Minaei Bidgoli, B., et al. (2024) 'Ensemble Transformer
for Cross Lingual Semantic Textual Similarity', Preprint, Research Square,
[https://doi.org/10.21203/rs.3.rs-3860915/v1] (Version 1), 16 January.
[9] Gupta, A., Sharma, K., & Goyal, K. K. (2023) 'Computation of Similarity Between
Two Pair of Sentence Using Word-Net', International Journal of Intelligent Systems
and Applications in Engineering, 11(5s), pp. 458–467.
[10] Kale, N., Dahake, R. P., & Metre, K. V. (2023) 'Text summarization based on human
behavioural learning model', Journal of Integrated Science and Technology, 12(2), p.
741.
[11] Beken Fikri, F., Oflazer, K., & Yanikoglu, B. (2021) 'Semantic Similarity Based
Evaluation for Abstractive News Summarization', In Proceedings of the 1st Workshop
on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp. 24–33.
Association for Computational Linguistics.
[12] Abujar, S., Hasan, M., Hossain, S.A. (2019) 'Sentence Similarity Estimation for Text
Summarization Using Deep Learning', In: Kulkarni, A., Satapathy, S., Kang, T.,
Kashan, A. (eds) Proceedings of the 2nd International Conference on Data
Engineering and Communication Technology. Advances in Intelligent Systems and
Computing, vol 828. Springer, Singapore.
[13] Yoo, Yongmin, Tak-Sung Heo, Yeongjoon Park, and Kyungsun Kim. 2021. 'A Novel
Hybrid Methodology of Measuring Sentence Similarity', Symmetry 13(8), p. 1442.
[14] Sun, X., Meng, Y., Ao, X., Wu, F., Zhang, T., Li, J., & Fan, C. (2022) 'Sentence
Similarity Based on Contexts', Transactions of the Association for Computational
Linguistics, 10, pp. 573–588. doi:10.1162/tacl_a_00477.
[15] Farouk, M. (2020) 'Measuring Text Similarity Based on Structure and Word
Embedding', Cognitive Systems Research. doi:10.1016/j.cogsys.2020.04.002.
[16] Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006) 'Sentence
similarity based on semantic nets and corpus statistics', IEEE Transactions on
Knowledge and Data Engineering, 18(8), pp. 1138–1150. doi: 10.1109/tkde.2006.130.
[17] Wang, Z., Mi, H., Ittycheriah, A. (2017) 'Sentence Similarity Learning by Lexical
Decomposition and Composition', arXiv preprint arXiv:1602.07019.
[18] Haque, R., Naskar, S. K., Way, A., Costa-jussa, M. R., & Banchs, R. E. (2010)
'Sentence Similarity-Based Source Context Modelling in PBSMT', 2010 International
Conference on Asian Language Processing. doi: 10.1109/ialp.2010.45.
[19] Ramaprabha, J., Das, S., Mukerjee, P. (2018) 'Survey on Sentence Similarity
Evaluation using Deep Learning', Journal of Physics: Conference Series, 1000,
012070.
[20] Zhang, P., Huang, X., Wang, Y., Jiang, C., He, S., & Wang, H. (2021) 'Semantic
Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion',
IEEE Access, 9, pp. 8433–8443. doi:10.1109/access.2021.3049378.
[21] Tien, N. H., Le, N. M., Tomohiro, Y., & Tatsuya, I. (2019) 'Sentence modeling via
multiple word embeddings and multi-level comparison for semantic textual
similarity', Information Processing & Management, 56(6), 102090.
doi:10.1016/j.ipm.2019.102090.
[22] Zhang, S., Zheng, X., & Hu, C. (2015) 'A survey of semantic similarity and its
application to social network analysis', 2015 IEEE International Conference on Big
Data (Big Data), Santa Clara, CA, USA, pp. 2362-236
[23] Yunhong Xu, Xitong Guo, Jinxing Hao, Jian Ma, Raymond Y.K. Lau, Wei Xu (2012)
'Combining social network and semantic concept analysis for personalized academic
researcher recommendation', Decision Support Systems, 54(1), pp. 564-573. ISSN
0167-9236.
[24] Saad, S. M., & Kamarudin, S. S. (2013) 'Comparative analysis of similarity measures
for sentence level semantic measurement of text', 2013 IEEE International Conference
on Control System, Computing and Engineering. doi:10.1109/iccsce.2013.671993.
[25] Blagec, K., Xu, H., Agibetov, A., & Samwald, M. (2019). 'Neural sentence
embedding models for semantic similarity estimation in the biomedical domain.'
BMC Bioinformatics, 20(1), doi:10.1186/s12859-019-2789-2.
[26] Qurashi, A. W., Holmes, V., & Johnson, A. P. (2020) 'Document Processing:
Methods for Semantic Text Similarity Analysis', 2020 International Conference on
INnovations in Intelligent SysTems and Applications (INISTA).
doi:10.1109/inista49547.2020.9194665.
[27] Soler, A.G., Apidianaki, M., & Allauzen, A. (2019) 'Word Usage Similarity
Estimation with Sentence Representations and Automatic Substitutes', arXiv preprint
arXiv:1905.08377.
[28] EZZIKOURI, H., MADANI, Y., ERRITALI, M., & OUKESSOU, M. (2019) 'A New
Approach for Calculating Semantic Similarity between Words Using WordNet and
Set Theory', Procedia Computer Science, 151, pp. 1261–1265.
doi:10.1016/j.procs.2019.04.182.
[29] Shahmirzadi, O., Lugowski, A., & Younge, K. (2019) 'Text Similarity in Vector
Space Models: A Comparative Study', 2019 18th IEEE International Conference On
Machine Learning And Applications (ICMLA). doi:10.1109/icmla.2019.00120.
[30] Pawar, A., & Mago, V. (2019) 'Challenging the boundaries of unsupervised learning
for semantic similarity', IEEE Access, 1–1. doi:10.1109/access.2019.2891692.