Text Paraphrase Detection
Text Paraphrase Detection
Paraphrase Detection
2. Related work
4. Optimization techniques
5. Demo
6. Q & A
1. Introduction
Introduction
❑ Paraphrase Detection
❑ Common Evaluation Metrics
❑ Common Corpora
❑ Text Paraphrase Detection Challenge
What is Paraphrase Detection?
Examples:
❑ Mary gave birth to a son in 2000 [1].
❑ He is 14 years old, and his mother is Mary [1].
Common Evaluation Metrics
❑ Accuracy
❑ F1 score
Common Corpora
2018
❑ Evaluation metric:
▪ F1 score
❑ Baseline:
▪ sent_tokenize from nltk.tokenize
▪ SBERT embeddings
▪ PyNNDescent for fast Approximate Nearest
Neighbors
2. Related work
Methods
Rule Based
Deep Learning Based
❑ CNN
❑ RNN-based
❑ LSTM
❑ Transformer-based
3. Sentence BERT (SBERT)
BERT in Paraphrase Detection
65 hours
Why Sentence BERT?
❑ SBERT?
https://github.com/UKPLab/sentence-transformers/issues/924
Compare SBERT vs BERT
❑ Cross-encoder
❑ Clustering
❑ Embedding in Knowledge Graph
❑ Concurrent Paraphrase Mining
❑ Model Distillation
❑ Augmented SBERT (Domain-transfer)
Cross-encoder
SBERT
Weight
Weight
Weight
SBERT
Vector
Vector Cosine-
similarity
Clustering and BERTopic
❑ Knowledge Distillation
❑ Dimensionality Reduction
❑ Quantization
Augmented SBERT
❑ Scenario 1: Limited or small annotated
datasets
❑ Step 1: Train a cross-encoder (BERT)
over the small (gold or annotated) dataset
❑ Step 2.1: Create pairs by recombination
and reduce the pairs via BM25 or
semantic search
❑ Step 2.2: Weakly label new pairs with
cross-encoder (BERT). These are silver
pairs or (silver) dataset
Step 3: Finally, train a bi-encoder
(SBERT) on the extended (gold + silver)
training dataset
Augmented SBERT
❑ SentenceBERT
● SentenceBERT
https://colab.research.google.com/drive/1JiiMKFIsnRmESeS3
GWLR3nf8J5iI-_1b?usp=sharing
● SentenceBERT in Publications Paper
https://colab.research.google.com/drive/1zyjffqQZVViCH79RP
UZGEK-slN3LX3A9?usp=sharing
❑ VietnameseBERT
● Paraphase detection in Vietnamese using SBERT
https://colab.research.google.com/drive/1vff8gXZufZ70_GF2Xr
M1f1GzNXVkmWyT?usp=sharing
Q&A
References