0% found this document useful (0 votes)
117 views

Text Paraphrase Detection

The document discusses text paraphrase detection. It begins with introducing paraphrase detection as determining if two sentences have roughly the same meaning. Common evaluation metrics are accuracy and F1 score, and common corpora used are MRPC, Quora Question Pairs, and GLUE. Sentence BERT is then discussed as an improvement over BERT for paraphrase detection by using sentence embeddings rather than token embeddings. Finally, some optimization techniques for paraphrase detection are covered such as cross-encoding, clustering, knowledge distillation, and augmented SBERT.

Uploaded by

Thanh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views

Text Paraphrase Detection

The document discusses text paraphrase detection. It begins with introducing paraphrase detection as determining if two sentences have roughly the same meaning. Common evaluation metrics are accuracy and F1 score, and common corpora used are MRPC, Quora Question Pairs, and GLUE. Sentence BERT is then discussed as an improvement over BERT for paraphrase detection by using sentence embeddings rather than token embeddings. Finally, some optimization techniques for paraphrase detection are covered such as cross-encoding, clustering, knowledge distillation, and augmented SBERT.

Uploaded by

Thanh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Topic 8: Text

Paraphrase Detection

Giáo viên: Nhóm 6 :


Lê Thanh Tùng 21C11030 Lê Trung Thành
21C11001 Lại Việt Anh
21C11009 Nguyễn Lê Quang Hùng
21C11029 Hoàng Minh Thanh
1. Introduction

2. Related work

Contents 3. Sentence BERT (SBERT)

4. Optimization techniques

5. Demo

6. Q & A
1. Introduction
Introduction

❑ Paraphrase Detection
❑ Common Evaluation Metrics
❑ Common Corpora
❑ Text Paraphrase Detection Challenge
What is Paraphrase Detection?

❑ Given two sentences, determine whether they


roughly have the same meaning [1].
❑ Usually formalized as a binary classification
problem [1].

Examples:
❑ Mary gave birth to a son in 2000 [1].
❑ He is 14 years old, and his mother is Mary [1].
Common Evaluation Metrics

❑ Accuracy
❑ F1 score
Common Corpora

Number of papers mentioning the


dataset ❑ MRPC [2]
2022 ❑ Quora Question
2021
Pairs [3]
2020
❑ GLUE [4]
2019

2018

0 100 200 300 400 500 600 700


GLUE MRPC Quora Question Pairs
Text Paraphrase Detection Challenge

❑ Plagiarism is a serious problem in science.


❑ However, paraphrasing plagiarism has not been
extensively explored yet. As a preliminary step before
detecting paraphrase plagiarism.
❑ The purpose of this competition is to invite researchers
to contribute new methods to solve our proposed
problem and text paraphrase detection in general.
❑ The completion of this task promises to advance
techniques for paraphrase plagiarism detection.
Text Paraphrase Detection Challenge

❑ Evaluation metric:
▪ F1 score
❑ Baseline:
▪ sent_tokenize from nltk.tokenize
▪ SBERT embeddings
▪ PyNNDescent for fast Approximate Nearest
Neighbors
2. Related work
Methods

❑ Rule Based [5]


❑ Machine Learning Based [6]
❑ Deep Learning Based

Rule Based
Deep Learning Based

❑ Consider sentences as a sequence of characters


or terms
❑ Represent given sentences into vector space
❑ Lexical
❑ Syntactic
❑ Semantic
❑ Compare similarity between vectors
❑ Euclidean distance
❑ Cosine distance
Deep Learning Based

❑ CNN
❑ RNN-based
❑ LSTM
❑ Transformer-based
3. Sentence BERT (SBERT)
BERT in Paraphrase Detection

❑ BERT using token embeddings.


❑ BERT present each token as an embedding vector
Why Sentence BERT?

❑ BERT in paraphase detection is slow

Dataset 10k key-pair sentence

=> 50.000.000 calculation

65 hours​
Why Sentence BERT?

❑ SBERT?

https://github.com/UKPLab/sentence-transformers/issues/924
Compare SBERT vs BERT

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)


❑ BERT use token embeddings, SBERT use sentence
embedding.
❑ BERT use Classifier, SBERT use Cosine-similarity
Sentence BERT vs BERT

❑ SBERT use sentence embeddings.


Sentence BERT vs BERT
Retrieve & Re-Rank

❑ For complex semantic search scenarios, a retrieve & re-


rank pipeline is advisable:
Semantic Search
❑ Embed all data in your corpus.
Ex:
● How to learn Python online?
● How to learn Python on the web?
● What is Python ?
❑ Type :
● Symmetric Semantic search : SBERT
● Asymmetric Semantic search : Marco
❑ Method :
● Elastic Search
● Approximate Nearest Neighbor
4. Optimization techniques
Optimization techniques

❑ Cross-encoder
❑ Clustering
❑ Embedding in Knowledge Graph
❑ Concurrent Paraphrase Mining
❑ Model Distillation
❑ Augmented SBERT (Domain-transfer)
Cross-encoder

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

SBERT
Weight
Weight
Weight
SBERT

Vector

Vector Cosine-
similarity
Clustering and BERTopic

❑ Clustering and BERTTopic


❑ Paraphase detect on new topic
Embedding in Knowledge Graph
❑ PyNNDescent : for fast Approximate Nearest Neighbors
❑ Building neighbor graphs
❑ Searching a nearest neighbor graph
Concurrent Paraphrase Mining
Concurrent Paraphrase Mining
❑ top_k – For each sentence, we retrieve up to top_k
other sentences

20k sentences Chunk it to 20x1000 sentences​


Distill in Paraphase detection

❑ Knowledge Distillation

❑ Dimensionality Reduction
❑ Quantization
Augmented SBERT
❑ Scenario 1: Limited or small annotated
datasets
❑ Step 1: Train a cross-encoder (BERT)
over the small (gold or annotated) dataset
❑ Step 2.1: Create pairs by recombination
and reduce the pairs via BM25 or
semantic search
❑ Step 2.2: Weakly label new pairs with
cross-encoder (BERT). These are silver
pairs or (silver) dataset
 Step 3: Finally, train a bi-encoder
(SBERT) on the extended (gold + silver)
training dataset
Augmented SBERT

❑ Scenario 2: No annotated datasets


● Step 1: Train from scratch a cross-encoder (BERT) over a source
dataset, for which we contain annotations
● Step 2: Use this cross-encoder (BERT) to label your target
dataset i.e. unlabled sentence pairs
● Step 3: Finally, train a bi-encoder (SBERT) on the labeled target
dataset
5. Demo
SentenceBERT

❑ SentenceBERT
● SentenceBERT
https://colab.research.google.com/drive/1JiiMKFIsnRmESeS3
GWLR3nf8J5iI-_1b?usp=sharing
● SentenceBERT in Publications Paper
https://colab.research.google.com/drive/1zyjffqQZVViCH79RP
UZGEK-slN3LX3A9?usp=sharing
❑ VietnameseBERT
● Paraphase detection in Vietnamese using SBERT
https://colab.research.google.com/drive/1vff8gXZufZ70_GF2Xr
M1f1GzNXVkmWyT?usp=sharing
Q&A
References

1. Convolutional Neural Network for Paraphrase Identification (Yin &


Schütze, NAACL 2015)
2. William B. Dolan and Chris Brockett. 2005. Automatically Constructing
a Corpus of Sentential Paraphrases. In Proceedings of the Third
International Workshop on Paraphrasing (IWP2005).
3. First Quora Dataset Release: Question Pairs - Data @ Quora
4. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding (Wang et al., 2018)
5. Rahul Bhagat, Eduard Hovy; What Is a Paraphrase?. Computational
Linguistics 2013; 39 (3): 463–472.
doi: https://doi.org/10.1162/COLI_a_00166
6. Vrbanec, T.; Meštrović, A. Corpus-Based Paraphrase Detection
Experiments and Review. Information 2020, 11, 241.
https://doi.org/10.3390/info11050241
Thanks

You might also like