0% found this document useful (0 votes)

30 views16 pages

NLP Project

This research paper examines the effectiveness of cosine similarity in measuring sentence similarity within Natural Language Processing (NLP) applications. It highlights the limitations of traditional methods in capturing semantic nuances and proposes the integration of deep learning techniques, such as neural embeddings, to enhance accuracy. The paper reviews various studies and methodologies related to sentence similarity, emphasizing the importance of this metric in applications like text summarization, information retrieval, and question-answering systems.

Uploaded by

justpics.tanvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views16 pages

NLP Project

Uploaded by

justpics.tanvi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

SENTENCE SIMILARITY USING COSINE

SIMILARITY

Team Members:
Tanvi Paigude (23MAI0002)
Apurva Wankhade (23MAI0047)

ABSTRACT

The growing field of Natural Language Processing (NLP) has witnessed significant
advancements in recent years, with sentence similarity measurement standing out as a crucial
component in various applications such as text summarization, information retrieval, and
question-answering systems. The primary objective of this research is to provide a
comprehensive analysis of the effectiveness and applicability of cosine similarity in capturing
semantic relationships between sentences. This paper explores the integration of deep
learning approaches, such as neural embeddings, to enhance the capabilities of cosine
similarity in capturing subtle nuances and semantic intricacies within sentences. We discuss
the potential synergy between traditional cosine similarity methods and modern neural
network-based models.

INTRODUCTION

In the ever-expanding landscape of Natural Language Processing (NLP), the assessment of

sentence similarity has emerged as a pivotal domain with far-reaching implications for
numerous applications, including but not limited to text summarization, information retrieval,
and question-answering systems. Accurately measuring the likeness between sentences is
essential for enabling machines to comprehend and interpret human language, a task that
requires a nuanced understanding of semantic relationships.

Sentence similarity refers to the quantification of how alike two sentences are in terms of
their semantic meaning or content. It is a crucial concept in Natural Language Processing
(NLP) and computational linguistics, aiming to measure the resemblance between sentences
based on their underlying meaning rather than surface-level features. The importance of
sentence similarity is evident in various NLP applications, including information retrieval,
text summarization, question-answering systems, and machine translation. For example, in
information retrieval, accurate sentence similarity metrics help retrieve relevant documents
based on user queries. In text summarization, sentence similarity aids in selecting and
summarizing key information.

Additionally, question-answering systems use sentence similarity to match queries with

relevant passages, and machine translation benefits from aligning and comparing sentences in
different languages. Semantic search engines leverage sentence similarity to provide more
relevant results by understanding the user's intent rather than just matching keywords. In
conversational agents, understanding sentence similarity enables more natural and
contextually appropriate responses, enhancing the user experience.

Several methods and metrics are employed to determine sentence similarity, and one
commonly used approach is cosine similarity. Cosine similarity measures the cosine of the
angle between two vectors representing sentences in a high-dimensional space. This method
is robust to variations in sentence length and structure, making it suitable for capturing
semantic relationships. Sentence similarity is a fundamental concept in NLP, contributing to
various applications that require a nuanced understanding of the relationships between
sentences. The ability to measure and analyze sentence similarity facilitates the development
of more sophisticated and context-aware language processing systems.

PROBLEM STATEMENT
The existing methods for measuring sentence similarity, such as cosine similarity, play a
crucial role in various natural language processing (NLP) applications. However, these
methods may not effectively capture subtle nuances and semantic intricacies within
sentences, limiting their applicability in tasks requiring a more nuanced understanding of
sentence relationships. This research aims to address this gap by exploring the integration of
deep learning approaches, like neural embeddings, to enhance the effectiveness of cosine
similarity in capturing semantic relationships between sentences. The problem statement,
therefore, revolves around improving the accuracy and applicability of sentence similarity
measurement in NLP through the integration of traditional cosine similarity methods with
modern neural network-based models.

LITERATURE REVIEW AND RELATED WORK

Several studies have investigated semantic similarity in the context of Natural Language
Processing (NLP) and sentence similarity modeling. Si, S. et. al.[1] explores semantic
similarity in Chinese finance-domain sentences using Word2Vec and GloVe word
embeddings. Experimental findings suggest that Word2Vec outperforms GloVe,
demonstrating stronger discrimination ability in distinguishing semantically similar and
dissimilar sentence pairs. Optimal Word2Vec parameters for Chinese character embeddings
are identified as a window size of 6 and embedding dimension of 400.

Lopez-Gazpio, I.et. al.[2] introduces an innovative approach to attention models in natural

language understanding tasks, employing word n-grams instead of word pairs. Implemented
on the Decomposable Attention Model system, the n-gram attention model demonstrates
significant improvements, achieving up to 41% error reduction on five datasets, particularly
benefiting low-data scenarios and challenging subsets of Natural Language Inference
datasets. The method is positioned as an intermediate step between latent grammar learning
and flat word-level representation, showcasing its effectiveness in capturing syntactic
information and offering potential applications in tasks beyond textual similarity and
inference. The code and models are publicly available for further exploration.
Singh, R. et. al.[3] aims to analyse the global landscape of online news websites, highlighting
the challenges of selective reporting and the prevalence of similar articles across platforms.
The study focuses on comparing top news items in Hindi and English languages, utilizing
methods such as Google translation and similarity measures like cosine similarity, Jaccard
similarity, and Euclidean distance. Among the tested methods, cosine similarity with tf-idf
vectors proves most effective, achieving high accuracy, recall, and F-measure scores of
81.25%, 100%, and 76.92%, respectively.

Chandrasekaran, D. et. al.[4] explores the evolution of semantic similarity methods in Natural
Language Processing (NLP), spanning from traditional kernel-based techniques to state-of-
the-art transformer-based models. Categorizing approaches into knowledge-based, corpus-
based, deep neural network–based, and hybrid methods, the survey offers a comprehensive
overview, assessing the strengths and weaknesses of each, providing valuable insights for
researchers tackling the challenging task of estimating semantic similarity in diverse text
data.

Zhao, Y. et. al.[5] introduces Enhanced Inter-sentence Attention (EIA), a novel multi-head
self-attention architecture for Semantic Textual Similarity (STS) tasks. By incorporating
attention between sentence pairs, EIA effectively captures semantic relations, achieving
improved performance on benchmark datasets. The proposed architecture, demonstrated on
RoBERTa, outperforms strong baseline models, showcasing its potential as a versatile plug-
and-play unit for enhancing various transformer-based language models in large-scale pre-
trained applications.

Quan, Z.et. al.[6] introduces an attention constituency vector tree (ACVT) structure for
sentence similarity modelling, combining syntactic information, semantic features, and
attention weights. The proposed ACVT kernel is tailored for measuring sentence similarity,
demonstrating effectiveness across various datasets and outperforming state-of-the-art
models. The model's key strengths lie in its versatility as a general framework and its
efficiency, avoiding time-consuming training once word embeddings are available. Future
work may explore further enhancements and comparisons with larger training datasets.

Wang, X. et. al.[7] introduces DenoSent, a novel self-supervised sentence representation

learning framework that combines intra-sentence denoising and inter-sentence contrastive
objectives. By incorporating both discrete and continuous noise, the model is trained to
restore noisy sentences to their original form, demonstrating competitive results in tasks such
as semantic textual similarity and various transfer tasks. The proposed intra-sentence
denoising objective complements existing contrastive methodologies, offering improved
performance and generalization across multiple NLP tasks.

Piroozfar, P. et. al.[8] introduces an enhanced method for cross-lingual semantic similarity
between Persian and English sentences, utilizing ensemble models with transformers.
Achieving a remarkable 95.28% correlation rate, the approach outperforms previous
techniques, addressing challenges without relying on machine translation and controlling
complexity. The study highlights the potential applications in machine translation,
information retrieval systems, and search engines, suggesting future improvements,
particularly for the Persian language.

Gupta, A. et. al.[9] presents an enhanced algorithm for semantic similarity computation,
focusing on Word-Net noun IS-A and verb relationships to achieve more accurate results
compared to existing methods. The proposed approach employs disambiguation, edge-based
methodology, and semantic vectors to calculate similarity between words, sentences, and
paragraphs. The algorithm demonstrates superior performance with high Pearson correlation
coefficients of 0.875 for word similarity and 0.879 for sentence similarity, showcasing its
potential for advancing search accuracy in the vast landscape of available data.

Kale, N. et. al.[10] focuses on the growing importance of automated text summarization to
efficiently manage large volumes of data. It explores the integration of modern artificial
intelligence, optimized deep learning methods, and computational cognitive models to
enhance document summarization. The study evaluates the effectiveness of these models
based on precision, recall, and F-measure, emphasizing the superiority of human text
summarization behaviour in comparison to existing techniques.

Beken Fikri, et. al.[11] introduces a hybrid text summarization model for efficient data
collection in behavioral biology, leveraging T5 transformer preprocessing and combining
seq2seq and stacked LSTM with attention mechanisms. The model addresses challenges in
accessing scattered biomedical information but has a limitation of offering only high-level
document overviews. Future work proposes a multi-task learning strategy to enhance
summarization depth.

Sheikh Abujar et. al.[12] through their paper Sentence Similarity Estimation for Text
Summarization Using Deep Learning introduces a sentence similarity measure using both
lexical and semantic approaches, emphasizing the need for further development in Bengali
language resources. It notes the instability of Bengali WordNet compared to its English
counterparts. The research favors an unsupervised approach due to data constraints but
acknowledges the potential of supervised learning with a larger dataset. The paper suggests
exploring additional semantic and syntactic analyses for sentence similarity and proposes
combining these with lexical similarities for improved results. Identifying leading sentences
is crucial for better text summarization, with centroid sentences aiding in post-processing.
The importance of evaluating the summarizer and considering backtracking methods is
highlighted for optimal results.

A Novel Hybrid Methodology of Measuring Sentence Similarity by Tak-Sung Heo et. al.[13]
compares the performance of models using only deep learning versus models incorporating
their proposed method. Evaluation metrics include the Pearson and Spearman correlation
coefficients. Both correlation coefficients are higher when considering both deep learning and
lexical relationships, compared to using only deep learning.

Sun, X. et. al.[14] addresses challenges in sentence similarity measurement by introducing a

novel framework that leverages contextual information. They proposed method generates a
large-scale dataset in an unsupervised manner, bridging the train-test gap, and demonstrates
significant performance improvements over existing baselines in both supervised and
unsupervised settings. The approach involves a pipelined system with a surrogate model
trained on automatically labeled sentence pairs, showcasing effectiveness against
conventional sentence embedding methods.

Farouk, M. et. al.[15] introduces an innovative approach to enhance sentence similarity

calculation in Natural Language Processing. By combining traditional word-to-word
similarity measures with discourse representation structure (DRS) and word order similarity,
the proposed method outperforms existing approaches, achieving a high Pearson correlation
of 0.8813 with human similarity on a standard benchmark dataset. The integration of
semantic structure and consideration of word order contribute to improved accuracy in
measuring sentence similarity.

McLean, D. et. al.[16] introduces an algorithm for measuring semantic similarity between
short texts or sentences, focusing on efficient computation without the need for high-
dimensional processing. The method utilizes semantic information from a lexical database
and corpus statistics, enabling adaptation to different domains. They proposed approach,
validated through experiments, demonstrates significant correlation with human intuition,
making it applicable in various text-related tasks such as knowledge representation and
discovery.

[17] IBMT.J. Watson Research Center Yorktown Heights, NY, USA proposed the Sentence
Similarity Learning by Lexical Decomposition and Composition that introduces a model for
evaluating sentence similarity by breaking down and reconstructing lexical semantics. To
address the lexical gap issue, the model uses context vectors to represent each word. Various
methods are employed to decompose word vectors into similar and dissimilar components,
extracting features from both sentence similarity and dissimilarity. The model utilizes a two-
channel CNN with diverse n-gram filters to capture features at different levels. Experimental
results demonstrate the model's effectiveness in tasks such as answer sentence selection and
paraphrase identification.

Haque, R. et. al.[18]. enhances phrase-based statistical machine translation (PBSMT) by

integrating sentence-similarity features, measuring the resemblance of source sentences to
bilingual training data. Collaborating with supertag-based features, the approach yields a
5.25% relative improvement in BLEU score for English-to-Chinese translation, showcasing
the effectiveness of incorporating global contextual information in phrase selection. The
study acknowledges the need for further validation on larger datasets.

Survey on Sentence Similarity Evaluation using Deep Learning by J Ramaprabha et. al.[19]
discusses how detecting semantic equivalence between questions that use different
vocabulary and syntactic structures poses a challenge. In online forums like Quora and Stack
Overflow, maintaining a high-quality knowledge base is crucial to avoid redundant
information. Ensuring that each unique question exists only once helps writers avoid
repeating answers, and readers can easily find the information they seek. The example
provided illustrates the need to identify duplicate questions with similar intent, such as those
related to effective weight loss strategies.

Zhang, P. et. al.[20] proposed multi-model nonlinear fusion algorithm incorporates Jaccard
coefficient, TF-IDF, and word2vec-CNN to measure sentence similarity. By leveraging
weighted vectors and a fully connected neural network, the model achieves an 84% matching
accuracy and a 75% F1 value, demonstrating improved global understanding of sentence
features compared to fine-grained extraction methods.

Tien et. al.[21] introduces the M-MaxLSTM-CNN model, leveraging multiple sets of pre-
trained word embeddings to encode diverse linguistic properties into a novel sentence
representation. The proposed approach achieves robust performance across various tasks such
as measuring textual similarity, identifying paraphrase, and recognizing textual entailment,
outperforming state-of-the-art methods without the need for handcrafted features or uniform
dimensions in pre-trained word embeddings. The innovative Multi-level comparison
technique enables effective learning of semantic textual relations.
Shuang Zhang et. al.[22] through their paper titled A Survey of Semantic Similarity and its
Application to Social Network Analysis offer a concise survey of semantic similarity,
covering both semantic similarity between concepts and semantic textual similarity. The
methods for semantic similarity between concepts are classified into four categories based on
the background information resource used. Similarly, methods for semantic textual similarity
are also classified into four categories. The survey highlights the importance of semantic
similarity measures in text-related research and applications, particularly in online social
network analysis. The paper discusses how similarity computation methods play a crucial
role in various aspects of social network analysis.

The paper by Yunhong Xu et. al.[23] addresses the impact of Web 2.0 technologies on
communication and collaboration among researchers, leading to information overload. To
alleviate this issue, the paper advocates for the development of researcher recommendation
agents to offer personalized suggestions for potential collaborations. The proposed approach
integrates social network analysis and semantic concept analysis in a unified framework to
enhance the effectiveness of personalized researcher recommendations. The paper
emphasizes the improvement in knowledge discovery and exchange, ultimately enhancing
research' productivity. Experimental results demonstrate that the proposed approach
outperforms baseline methods, and a case study illustrates its application in real-world
academic contexts.

Saad, S. M et. al.[24] investigates semantic similarity measurement techniques for sentences,
specifically comparing Jaccard, Cosine, and Dice similarity measures. Utilizing WordNet for
word-to-word semantic similarity calculation, the research concludes that Jaccard and Dice
outperform in measuring semantic similarity between sentences, providing valuable insights
for applications like text mining and question answering. Further testing with real data and
human experts is suggested for comprehensive evaluation.

[25] This paper explores the effectiveness of unsupervised neural embedding models in
estimating semantic similarity of sentences from biomedical literature. Trained on 1.7 million
articles from PubMed, the best model, based on Paragraph Vector Distributed Memory,
outperforms previous state-of-the-art results on the BIOSSES biomedical benchmark set.
Additionally, a supervised model combining string-based similarity metrics with neural
embeddings surpasses ontology-dependent approaches, highlighting the value of neural
network-based models in biomedical semantic similarity estimation. However, challenges
remain in addressing contradictions and negations in biomedical sentences.

Qurashi, A. W. et. al.[26] explores the application of Natural Language Processing (NLP)
techniques to measure semantic text similarity in safety-critical documents related to railway
safety. The study employs preprocessing, NLP toolkits, and Jaccard and Cosine similarity
metrics to automate the identification of equivalent rules and procedures. Results indicate that
the Cosine similarity metric outperforms Jaccard in analysing safety-critical documents,
providing a consistent method for maintaining safety instructions. The proposed
methodology, applicable to other domains, holds promise for future analysis on various
hardware platforms.

Soler, A.G et. al.[27] explores usage similarity estimation by utilizing contextualized word
and sentence embeddings (ELMo and BERT) along with supervised models. Leveraging
lexical substitute annotations from context2vec, the proposed models demonstrate superior
performance in both graded and binary similarity tasks, outperforming previous methods and
highlighting the effectiveness of BERT for usage similarity prediction.
EZZIKOURI, H. et. al.[28] introduces a novel approach for computing semantic similarity
between concepts in WordNet, leveraging set theory and WordNet properties. The method
combines synsets and glosses to enhance similarity scores, with potential applications in
information retrieval, plagiarism detection, and sentiment analysis, addressing the need for
more robust semantic similarity calculations in various domains. The proposed technique
utilizes synonymy relationships based on synsets and glosses to maximize similarity scores
between concepts.

Shahmirzadi, O. et. al.[29] evaluates various vector space models for measuring semantic text
similarity, specifically in the context of patent-to-patent similarity. Surprisingly, the simpler
TFIDF model outperforms more complex methods like LSI and D2V in many cases,
especially for longer and more technical texts, or when making finer-grained distinctions
between nearest neighbors. The study suggests that, for the practical application of patent
similarity detection, the cost-effective and simple TFIDF model is often a sensible choice,
while more complex models may be justified only in specific scenarios, such as when dealing
with condensed text and relatively coarse similarity detection.

Pawar, A. et. al.[30] introduces a novel methodology for calculating semantic similarity
across various domains by incorporating corpora-based statistics into a standardized
algorithm. Employing an edge-based approach with a lexical database, the method achieves
high correlation values (r=0.8753 for word and r=0.8793 for sentence similarity) on
benchmark standards and human similarity datasets, outperforming other unsupervised
models. The approach involves disambiguating sentences, utilizing information content from
corpora, and forming semantic vectors for accurate similarity calculations, making it a
valuable tool with low computing overhead for professionals in diverse domains.

OPEN-SOURCE TOOLS:

1. NLTK
2. Sklearn
3. Cosine_smiliarity

PROPOSED ARCHITECTURE

The proposed architecture facilitates the calculation of cosine similarity between two
sentences, aiding in determining their semantic similarity. It involves a series of key steps,
starting with preprocessing to standardize the sentences for comparison. This includes
tokenization, lowercasing, removing stop words, and stemming or lemmatization to handle
word variations. Following preprocessing, each sentence is transformed into a vector in a
high-dimensional space.

This vectorization process can be achieved using the bag-of-words (BoW) model, where each
dimension represents a unique word, or through word embeddings like Word2Vec, GloVe, or
FastText, which represent words as dense vectors in a continuous space. The cosine similarity
is then calculated between the vector representations of the sentences. This metric measures
the cosine of the angle between the vectors, providing a score between -1 and 1. A score
closer to 1 indicates higher similarity, while 0 indicates no similarity and -1 indicates
dissimilarity.

To interpret the similarity score, a threshold can be set based on the application's
requirements. For instance, a score exceeding a certain threshold may indicate that the
sentences are similar, while scores below the threshold imply dissimilarity. Efficiency can be
optimized by employing techniques such as approximate nearest neighbour search or
dimensionality reduction, especially for large datasets, to speed up the vectorization and
similarity calculation processes.

Fig. Proposed Architecture of Sentence Similarity using Cosine Similiarity

Details explanation of each steps involved

1. Preprocessing:
Tokenization: Break each sentence into individual words or tokens.
Lowercasing: Convert all words to lowercase to ensure case-insensitive comparison.
Removing stop words: Eliminate common words like "the", "is", "and", etc., as they don't
contribute much to the meaning.
Stemming or Lemmatization: Reduce words to their base or root form to handle
variations like "running" and "ran" to "run".

2. Vectorization:
After preprocessing, each sentence is represented as a vector in a high-dimensional space.
One common technique is using the bag-of-words (BoW) model. Each dimension in the
vector represents a unique word, and the value at each dimension represents the frequency
of that word in the sentence.
Another approach is to use word embeddings like Word2Vec, GloVe, or FastText to
represent words as dense vectors in a continuous vector space.

3. Cosine Similarity Calculation:

Once the sentences are vectorized, cosine similarity is calculated between the vectors
representing the sentences.
Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1
to 1, where 1 indicates the highest similarity, 0 indicates no similarity, and -1 indicates
dissimilarity.
The formula for cosine similarity between two vectors A and B is:

Where A.B denotes the dot product of A and B, and ||A|| and ||B|| are the magnitudes of
vectors

4. Interpreting the Similarity Score:

The output of cosine similarity calculation is a similarity score between 0 and 1.
A score closer to 1 indicates higher similarity between the sentences, while a score closer
to 0 indicates dissimilarity.

5. Thresholding and Decision Making:

Depending on your application, you may set a threshold to determine whether two
sentences are similar enough.
For example, if the similarity score exceeds a certain threshold (e.g., 0.7), you might
consider the sentences to be similar; otherwise, they're dissimilar.

6. Optimization:
Depending on the size of the dataset and computational resources, you might need to
optimize the vectorization and cosine similarity calculation steps for efficiency.
Techniques like using approximate nearest neighbour search or dimensionality reduction
can be employed to speed up the process.
IMPLEMENTATION
import nltk
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity

# Download NLTK resources

nltk.download('punkt')
nltk.download('stopwords')

def load_glove_embeddings(embedding_file):
embeddings_index = {}
with open(embedding_file, encoding='utf-8') as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
return embeddings_index

def preprocess_sentence(sentence, embeddings_index):

# Tokenization
words = word_tokenize(sentence)

# Lowercasing and removing stopwords

stop_words = set(stopwords.words('english'))
words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]

# Vectorization using GloVe embeddings

vectors = []
for word in words:
if word in embeddings_index:
vectors.append(embeddings_index[word])

if not vectors:
return None # Return None if no word in the sentence is in the embeddings

sentence_vector = np.mean(vectors, axis=0) # Calculate the mean of word vectors to represent the
sentence
return sentence_vector

def calculate_cosine_similarity(sentence1, sentence2, embeddings_index):

# Preprocess sentences
sentence_vector1 = preprocess_sentence(sentence1, embeddings_index)
sentence_vector2 = preprocess_sentence(sentence2, embeddings_index)

if sentence_vector1 is None or sentence_vector2 is None:

return None # Return None if one of the sentences has no words in the embeddings

# Calculate cosine similarity

cosine_sim = cosine_similarity([sentence_vector1], [sentence_vector2])

return cosine_sim[0][0]

# Load GloVe embeddings

glove_embeddings_file = r'C:\Users\ASUS\Downloads\glove.6B\glove.6B.100d.txt'
glove_embeddings_index = load_glove_embeddings(glove_embeddings_file)

# Example sentences
sentence1 = "The quick brown fox jumps over the lazy dog"
sentence2 = "A brown fox jumps over a lazy dog"
sentence3 = "The sky is blue"

# Calculate cosine similarity between sentence1 and sentence2

similarity_score_1_2 = calculate_cosine_similarity(sentence1, sentence2, glove_embeddings_index)
print("Cosine similarity between sentence1 and sentence2:", similarity_score_1_2)

# Calculate cosine similarity between sentence1 and sentence3

similarity_score_1_3 = calculate_cosine_similarity(sentence1, sentence3, glove_embeddings_index)
print("Cosine similarity between sentence1 and sentence3:", similarity_score_1_3)
RESULT

The study demonstrates a significant enhancement in sentence similarity measurement by

integrating neural embeddings with cosine similarity. This approach effectively captures
subtle semantic nuances and intricate relationships between sentences, paving the way for
more accurate and nuanced natural language processing applications. The comparison of
"The quick brown fox jumps over the lazy dog" and "A brown fox jumps over a lazy dog"
yielded a cosine similarity score of 0.9773, indicating a high level of similarity between the
two sentences. This methodology showcases the potential for improving semantic
understanding and information retrieval in various NLP tasks.

REFERENCES

[1] Si, S., Zheng, W., Zhou, L., & Zhang, M. (2019). Sentence Similarity Computation in
Question Answering Robot. Journal of Physics: Conference Series, 1237(2), 022093.
https://dx.doi.org/10.1088/1742-6596/1237/2/022093

[2] Lopez-Gazpio, I., Maritxalar, M., Lapata, M., & Agirre, E. (2019) 'Word n-gram
attention models for sentence similarity and inference', Expert Systems with
Applications, 132, pp. 1–11. doi:10.1016/j.eswa.2019.04.054.

[3] Singh, R., Singh, S. (2021) 'Text Similarity Measures in News Articles by Vector
Space Model Using NLP', J. Inst. Eng. India Ser. B, 102, pp. 329–338.
doi:10.1007/s40031-020-00501-5.

[4] Chandrasekaran, D., & Mago, V. (2021) 'Evolution of Semantic Similarity—A

Survey', ACM Computing Surveys, 54(2), Article 41. doi:10.1145/3440755.
[5] Zhao, Y., Xia, T., Jiang, Y., & Tian, Y. (2024) 'Enhancing inter-sentence attention for
Semantic Textual Similarity', Information Processing & Management, 61(1), 103535.
doi:10.1016/j.ipm.2023.103535.

[6] Quan, Z., Wang, Z. -J., Le, Y., Yao, B., Li, K., & Yin, J. (2019) 'An Efficient
Framework for Sentence Similarity Modeling', IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 27(4), pp. 853-865. doi:
10.1109/TASLP.2019.2899494.

[7] Wang, X., He, J., Wang, P., Zhou, Y., Sun, T., & Qiu, X. (2024) 'DenoSent: A
Denoising Objective for Self-Supervised Sentence Representation Learning', arXiv
preprint arXiv:2401.13621.

[8] Piroozfar, P., Abdous, M., Minaei Bidgoli, B., et al. (2024) 'Ensemble Transformer
for Cross Lingual Semantic Textual Similarity', Preprint, Research Square,
[https://doi.org/10.21203/rs.3.rs-3860915/v1] (Version 1), 16 January.

[9] Gupta, A., Sharma, K., & Goyal, K. K. (2023) 'Computation of Similarity Between
Two Pair of Sentence Using Word-Net', International Journal of Intelligent Systems
and Applications in Engineering, 11(5s), pp. 458–467.

[10] Kale, N., Dahake, R. P., & Metre, K. V. (2023) 'Text summarization based on human
behavioural learning model', Journal of Integrated Science and Technology, 12(2), p.
741.

[11] Beken Fikri, F., Oflazer, K., & Yanikoglu, B. (2021) 'Semantic Similarity Based
Evaluation for Abstractive News Summarization', In Proceedings of the 1st Workshop
on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp. 24–33.
Association for Computational Linguistics.

[12] Abujar, S., Hasan, M., Hossain, S.A. (2019) 'Sentence Similarity Estimation for Text
Summarization Using Deep Learning', In: Kulkarni, A., Satapathy, S., Kang, T.,
Kashan, A. (eds) Proceedings of the 2nd International Conference on Data
Engineering and Communication Technology. Advances in Intelligent Systems and
Computing, vol 828. Springer, Singapore.

[13] Yoo, Yongmin, Tak-Sung Heo, Yeongjoon Park, and Kyungsun Kim. 2021. 'A Novel
Hybrid Methodology of Measuring Sentence Similarity', Symmetry 13(8), p. 1442.

[14] Sun, X., Meng, Y., Ao, X., Wu, F., Zhang, T., Li, J., & Fan, C. (2022) 'Sentence
Similarity Based on Contexts', Transactions of the Association for Computational
Linguistics, 10, pp. 573–588. doi:10.1162/tacl_a_00477.

[15] Farouk, M. (2020) 'Measuring Text Similarity Based on Structure and Word
Embedding', Cognitive Systems Research. doi:10.1016/j.cogsys.2020.04.002.

[16] Li, Y., McLean, D., Bandar, Z. A., O’Shea, J. D., & Crockett, K. (2006) 'Sentence
similarity based on semantic nets and corpus statistics', IEEE Transactions on
Knowledge and Data Engineering, 18(8), pp. 1138–1150. doi: 10.1109/tkde.2006.130.
[17] Wang, Z., Mi, H., Ittycheriah, A. (2017) 'Sentence Similarity Learning by Lexical
Decomposition and Composition', arXiv preprint arXiv:1602.07019.

[18] Haque, R., Naskar, S. K., Way, A., Costa-jussa, M. R., & Banchs, R. E. (2010)
'Sentence Similarity-Based Source Context Modelling in PBSMT', 2010 International
Conference on Asian Language Processing. doi: 10.1109/ialp.2010.45.

[19] Ramaprabha, J., Das, S., Mukerjee, P. (2018) 'Survey on Sentence Similarity
Evaluation using Deep Learning', Journal of Physics: Conference Series, 1000,
012070.

[20] Zhang, P., Huang, X., Wang, Y., Jiang, C., He, S., & Wang, H. (2021) 'Semantic
Similarity Computing Model Based on Multi Model Fine-Grained Nonlinear Fusion',
IEEE Access, 9, pp. 8433–8443. doi:10.1109/access.2021.3049378.

[21] Tien, N. H., Le, N. M., Tomohiro, Y., & Tatsuya, I. (2019) 'Sentence modeling via
multiple word embeddings and multi-level comparison for semantic textual
similarity', Information Processing & Management, 56(6), 102090.
doi:10.1016/j.ipm.2019.102090.

[22] Zhang, S., Zheng, X., & Hu, C. (2015) 'A survey of semantic similarity and its
application to social network analysis', 2015 IEEE International Conference on Big
Data (Big Data), Santa Clara, CA, USA, pp. 2362-236

[23] Yunhong Xu, Xitong Guo, Jinxing Hao, Jian Ma, Raymond Y.K. Lau, Wei Xu (2012)
'Combining social network and semantic concept analysis for personalized academic
researcher recommendation', Decision Support Systems, 54(1), pp. 564-573. ISSN
0167-9236.

[24] Saad, S. M., & Kamarudin, S. S. (2013) 'Comparative analysis of similarity measures
for sentence level semantic measurement of text', 2013 IEEE International Conference
on Control System, Computing and Engineering. doi:10.1109/iccsce.2013.671993.

[25] Blagec, K., Xu, H., Agibetov, A., & Samwald, M. (2019). 'Neural sentence
embedding models for semantic similarity estimation in the biomedical domain.'
BMC Bioinformatics, 20(1), doi:10.1186/s12859-019-2789-2.

[26] Qurashi, A. W., Holmes, V., & Johnson, A. P. (2020) 'Document Processing:
Methods for Semantic Text Similarity Analysis', 2020 International Conference on
INnovations in Intelligent SysTems and Applications (INISTA).
doi:10.1109/inista49547.2020.9194665.

[27] Soler, A.G., Apidianaki, M., & Allauzen, A. (2019) 'Word Usage Similarity
Estimation with Sentence Representations and Automatic Substitutes', arXiv preprint
arXiv:1905.08377.

[28] EZZIKOURI, H., MADANI, Y., ERRITALI, M., & OUKESSOU, M. (2019) 'A New
Approach for Calculating Semantic Similarity between Words Using WordNet and
Set Theory', Procedia Computer Science, 151, pp. 1261–1265.
doi:10.1016/j.procs.2019.04.182.
[29] Shahmirzadi, O., Lugowski, A., & Younge, K. (2019) 'Text Similarity in Vector
Space Models: A Comparative Study', 2019 18th IEEE International Conference On
Machine Learning And Applications (ICMLA). doi:10.1109/icmla.2019.00120.

[30] Pawar, A., & Mago, V. (2019) 'Challenging the boundaries of unsupervised learning
for semantic similarity', IEEE Access, 1–1. doi:10.1109/access.2019.2891692.

KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Boarding Pass 5 Promo
0% (1)
Boarding Pass 5 Promo
28 pages
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Orientation - Induction Agenda APR25 POTSDAM
No ratings yet
Orientation - Induction Agenda APR25 POTSDAM
4 pages
Mongodb 2
No ratings yet
Mongodb 2
1 page
2024 NGN HESI EXIT RN Exam V1, V2, V3, V4, V5, V6, Each Exam With 160 Latest Questions and Answers Updated (Verified Revised Full Exam)
No ratings yet
2024 NGN HESI EXIT RN Exam V1, V2, V3, V4, V5, V6, Each Exam With 160 Latest Questions and Answers Updated (Verified Revised Full Exam)
462 pages
Acromegaly Poster
No ratings yet
Acromegaly Poster
1 page
Internship Report Part2 Main Chapters
No ratings yet
Internship Report Part2 Main Chapters
3 pages
PDF DF Merged
No ratings yet
PDF DF Merged
14 pages
Architecture
No ratings yet
Architecture
1 page
MongoDB Is A Popular NoSQL Database That Utilizes A Document Model
No ratings yet
MongoDB Is A Popular NoSQL Database That Utilizes A Document Model
1 page
Decoding Reviews - AI-Powered Product Review Analysis-Final Draft
No ratings yet
Decoding Reviews - AI-Powered Product Review Analysis-Final Draft
7 pages
Revolutionizinginsights Plag Report
No ratings yet
Revolutionizinginsights Plag Report
7 pages
AI-web Scraping
No ratings yet
AI-web Scraping
18 pages
Efficacy of Deep Neural Embeddings Based Semantic Similarity 1o9uaupg
No ratings yet
Efficacy of Deep Neural Embeddings Based Semantic Similarity 1o9uaupg
14 pages
Mabel Amos Special Fiduciary
No ratings yet
Mabel Amos Special Fiduciary
10 pages
Sun 等 - 2022 - Sentence Similarity Based on Contexts
No ratings yet
Sun 等 - 2022 - Sentence Similarity Based on Contexts
16 pages
NLP Proj
No ratings yet
NLP Proj
13 pages
Rishi Sunak's Five Promises What Progress Has He Made
No ratings yet
Rishi Sunak's Five Promises What Progress Has He Made
5 pages
Warda Resume
No ratings yet
Warda Resume
4 pages
Texim Fast: Text-To-Image Encoding For Semantic Similarity Evaluation of Disproportionate Sequences
No ratings yet
Texim Fast: Text-To-Image Encoding For Semantic Similarity Evaluation of Disproportionate Sequences
23 pages
TLV - Riduttore Is
No ratings yet
TLV - Riduttore Is
5 pages
Legal Basis of International Relation
No ratings yet
Legal Basis of International Relation
4 pages
(Archives of Electrical Engineering) Modeling Simulation and Experimental Analysis of Permanent Magnet Brushless DC Motors For Sensorless Operation
No ratings yet
(Archives of Electrical Engineering) Modeling Simulation and Experimental Analysis of Permanent Magnet Brushless DC Motors For Sensorless Operation
17 pages
Siamese Neural Networks Method For Semantic Requirements Similarity Detection
No ratings yet
Siamese Neural Networks Method For Semantic Requirements Similarity Detection
16 pages
Machine Learning Ai in Medical Devices
No ratings yet
Machine Learning Ai in Medical Devices
24 pages
A Survey of Text-Matching Techniques
No ratings yet
A Survey of Text-Matching Techniques
53 pages
NLP 2
No ratings yet
NLP 2
8 pages
If You
No ratings yet
If You
2 pages
2020 Lrec-1 851
No ratings yet
2020 Lrec-1 851
6 pages
Mathematics 12 03990 v2
No ratings yet
Mathematics 12 03990 v2
20 pages
A Survey of Numerous Text Similarity Approach
No ratings yet
A Survey of Numerous Text Similarity Approach
10 pages
Annual Report On CSR Activities 2021-22
No ratings yet
Annual Report On CSR Activities 2021-22
16 pages
Alshammari 2023 Ijca 922667
No ratings yet
Alshammari 2023 Ijca 922667
4 pages
Homework: Level 3 BTEC Applied Science Unit 1 Past Paper Exam Questions
No ratings yet
Homework: Level 3 BTEC Applied Science Unit 1 Past Paper Exam Questions
3 pages
Biju Expence Details
No ratings yet
Biju Expence Details
2 pages
Heat Exchanger Formulas
No ratings yet
Heat Exchanger Formulas
2 pages
Improving WordNet Using Word Embeddings
No ratings yet
Improving WordNet Using Word Embeddings
8 pages
The Influence of Work Motivation On Emotional Intelligence and Team Effectiveness Relationship
No ratings yet
The Influence of Work Motivation On Emotional Intelligence and Team Effectiveness Relationship
14 pages
8-Measuring Text Similarity Based On Structure and Word Embedding
No ratings yet
8-Measuring Text Similarity Based On Structure and Word Embedding
20 pages
Case Daka
No ratings yet
Case Daka
7 pages
12 Questions That Will Change Your
100% (1)
12 Questions That Will Change Your
4 pages
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
No ratings yet
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
140 pages
PESTS: Persian - English Cross Lingual Corpus For Semantic Textual Similarity
No ratings yet
PESTS: Persian - English Cross Lingual Corpus For Semantic Textual Similarity
21 pages
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
No ratings yet
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
8 pages
AAAI06-123 (Revisar para Referencias)
No ratings yet
AAAI06-123 (Revisar para Referencias)
6 pages
NLP - Experiment - 8 - A10
No ratings yet
NLP - Experiment - 8 - A10
16 pages
Facility Inspection Checklist
No ratings yet
Facility Inspection Checklist
2 pages
Paper 125
No ratings yet
Paper 125
11 pages
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
No ratings yet
Short Text Similarity Calculation Based On Jaccard and Semantic Mixture
9 pages
Catalogo Hiab 122
No ratings yet
Catalogo Hiab 122
4 pages
Unit 5 DL
No ratings yet
Unit 5 DL
11 pages
10 1002@cpe 5971
No ratings yet
10 1002@cpe 5971
17 pages
Semantic Similarity Between Medium-Sized Texts
No ratings yet
Semantic Similarity Between Medium-Sized Texts
13 pages
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
No ratings yet
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
3 pages
Evolution of Semantic Similarity - A Survey
No ratings yet
Evolution of Semantic Similarity - A Survey
35 pages
A Novel Hybrid Methodology of Measuring
No ratings yet
A Novel Hybrid Methodology of Measuring
10 pages
Urban Designer, Urban Planner & Architect
No ratings yet
Urban Designer, Urban Planner & Architect
2 pages
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
No ratings yet
A Soft Introduction To NLP - Semantic Similarity Calculations Using Python - Medium
13 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Semantic Text Analysis
No ratings yet
Semantic Text Analysis
6 pages
Finding The Similarity Between Two Arabic Texts
No ratings yet
Finding The Similarity Between Two Arabic Texts
12 pages
The Fastest Indian Vegetarian Diet To Lose Weight - 7 Days GM Diet
50% (2)
The Fastest Indian Vegetarian Diet To Lose Weight - 7 Days GM Diet
14 pages
A Cognitive Study On Semantic Similarity Analysis
No ratings yet
A Cognitive Study On Semantic Similarity Analysis
6 pages
Gcse Ict: by The End of This Session, You Will Be Able To
No ratings yet
Gcse Ict: by The End of This Session, You Will Be Able To
10 pages
Mridul 2021 Ijca 921582
No ratings yet
Mridul 2021 Ijca 921582
7 pages
Evaluating of Efficacy Semantic Similarity Methods
No ratings yet
Evaluating of Efficacy Semantic Similarity Methods
8 pages
Can Computers Understand Words Like Human Do
No ratings yet
Can Computers Understand Words Like Human Do
28 pages
Planning Engineer or Business Analyst or Data Analyst or Plannin
No ratings yet
Planning Engineer or Business Analyst or Data Analyst or Plannin
2 pages
Technical Report: Learning Compound Noun Semantics
No ratings yet
Technical Report: Learning Compound Noun Semantics
167 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Published Paper
No ratings yet
Published Paper
12 pages
Analysis of Soft Drink
No ratings yet
Analysis of Soft Drink
9 pages
Measure Term Similarity Using A Semantic Network Approach
No ratings yet
Measure Term Similarity Using A Semantic Network Approach
5 pages
A Comparison of Document Similarity Algorithms
No ratings yet
A Comparison of Document Similarity Algorithms
10 pages
Secondary Market DR S Sreenivasa Murthy
No ratings yet
Secondary Market DR S Sreenivasa Murthy
33 pages
Occlusion and Periodontal Health
No ratings yet
Occlusion and Periodontal Health
8 pages
A Comparative Analysis of Temporal Long Text Similarity: Application To Financial Documents
No ratings yet
A Comparative Analysis of Temporal Long Text Similarity: Application To Financial Documents
15 pages
Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder For Long-Form Document Matching
No ratings yet
Beyond 512 Tokens: Siamese Multi-Depth Transformer-Based Hierarchical Encoder For Long-Form Document Matching
10 pages
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
No ratings yet
Semantic Similarity For English and Arabic Texts: A Review: Alzahrani 2016
29 pages
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
No ratings yet
Expert Systems With Applications: David Sánchez, Montserrat Batet, David Isern, Aida Valls
11 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Text Semantic Similarity
No ratings yet
Text Semantic Similarity
17 pages
Expert Systems With Applications: Raja Muhammad Suleman, Ioannis Korkontzelos
No ratings yet
Expert Systems With Applications: Raja Muhammad Suleman, Ioannis Korkontzelos
9 pages
Combining Xxsentence Similarities Measures To Identify Paraphrases
No ratings yet
Combining Xxsentence Similarities Measures To Identify Paraphrases
15 pages
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
No ratings yet
Sentence-Level Semantic Textual Similarity Using Word-Level Semantics
4 pages
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
No ratings yet
Measuring Similarity Between Question Pair in Online Forums: 1 Pramod Kumar Rai 2 Kunal Chakma
5 pages
Deep Learning For Semantic Similarity
No ratings yet
Deep Learning For Semantic Similarity
7 pages
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
No ratings yet
A Survey of Text Similarity Approaches: Wael H. Gomaa Aly A. Fahmy
6 pages
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
No ratings yet
Data & Knowledge Engineering: Jesús Oliva, José Ignacio Serrano, María Dolores Del Castillo, Ángel Iglesias
3 pages
Semantic Similarity
No ratings yet
Semantic Similarity
14 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
Review On NLP Paraphrase Detection Approaches
No ratings yet
Review On NLP Paraphrase Detection Approaches
4 pages
Text Similarity
No ratings yet
Text Similarity
31 pages

NLP Project

Uploaded by

NLP Project

Uploaded by

SENTENCE SIMILARITY USING COSINE

In the ever-expanding landscape of Natural Language Processing (NLP), the assessment of

Additionally, question-answering systems use sentence similarity to match queries with

LITERATURE REVIEW AND RELATED WORK

Lopez-Gazpio, I.et. al.[2] introduces an innovative approach to attention models in natural

Wang, X. et. al.[7] introduces DenoSent, a novel self-supervised sentence representation

Sun, X. et. al.[14] addresses challenges in sentence similarity measurement by introducing a

Farouk, M. et. al.[15] introduces an innovative approach to enhance sentence similarity

Haque, R. et. al.[18]. enhances phrase-based statistical machine translation (PBSMT) by

Fig. Proposed Architecture of Sentence Similarity using Cosine Similiarity

Details explanation of each steps involved

3. Cosine Similarity Calculation:

4. Interpreting the Similarity Score:

5. Thresholding and Decision Making:

# Download NLTK resources

def preprocess_sentence(sentence, embeddings_index):

# Lowercasing and removing stopwords

# Vectorization using GloVe embeddings

def calculate_cosine_similarity(sentence1, sentence2, embeddings_index):

if sentence_vector1 is None or sentence_vector2 is None:

# Calculate cosine similarity

# Load GloVe embeddings

# Calculate cosine similarity between sentence1 and sentence2

# Calculate cosine similarity between sentence1 and sentence3

The study demonstrates a significant enhancement in sentence similarity measurement by

[4] Chandrasekaran, D., & Mago, V. (2021) 'Evolution of Semantic Similarity—A

You might also like