0% found this document useful (0 votes)

142 views

Text Paraphrase Detection

The document discusses text paraphrase detection. It begins with introducing paraphrase detection as determining if two sentences have roughly the same meaning. Common evaluation metrics are accuracy and F1 score, and common corpora used are MRPC, Quora Question Pairs, and GLUE. Sentence BERT is then discussed as an improvement over BERT for paraphrase detection by using sentence embeddings rather than token embeddings. Finally, some optimization techniques for paraphrase detection are covered such as cross-encoding, clustering, knowledge distillation, and augmented SBERT.

Uploaded by

Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views

Text Paraphrase Detection

Uploaded by

Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Topic 8: Text

Paraphrase Detection

Giáo viên: Nhóm 6 :

Lê Thanh Tùng 21C11030 Lê Trung Thành
21C11001 Lại Việt Anh
21C11009 Nguyễn Lê Quang Hùng
21C11029 Hoàng Minh Thanh
1. Introduction

2. Related work

Contents 3. Sentence BERT (SBERT)

4. Optimization techniques

5. Demo

6. Q & A
1. Introduction
Introduction

❑ Paraphrase Detection
❑ Common Evaluation Metrics
❑ Common Corpora
❑ Text Paraphrase Detection Challenge
What is Paraphrase Detection?

❑ Given two sentences, determine whether they

roughly have the same meaning [1].
❑ Usually formalized as a binary classification
problem [1].

Examples:
❑ Mary gave birth to a son in 2000 [1].
❑ He is 14 years old, and his mother is Mary [1].
Common Evaluation Metrics

❑ Accuracy
❑ F1 score
Common Corpora

Number of papers mentioning the

dataset ❑ MRPC [2]
2022 ❑ Quora Question
2021
Pairs [3]
2020
❑ GLUE [4]
2019

2018

0 100 200 300 400 500 600 700

GLUE MRPC Quora Question Pairs
Text Paraphrase Detection Challenge

❑ Plagiarism is a serious problem in science.

❑ However, paraphrasing plagiarism has not been
extensively explored yet. As a preliminary step before
detecting paraphrase plagiarism.
❑ The purpose of this competition is to invite researchers
to contribute new methods to solve our proposed
problem and text paraphrase detection in general.
❑ The completion of this task promises to advance
techniques for paraphrase plagiarism detection.
Text Paraphrase Detection Challenge

❑ Evaluation metric:
▪ F1 score
❑ Baseline:
▪ sent_tokenize from nltk.tokenize
▪ SBERT embeddings
▪ PyNNDescent for fast Approximate Nearest
Neighbors
2. Related work
Methods

❑ Rule Based [5]

❑ Machine Learning Based [6]
❑ Deep Learning Based

Rule Based
Deep Learning Based

❑ Consider sentences as a sequence of characters

or terms
❑ Represent given sentences into vector space
❑ Lexical
❑ Syntactic
❑ Semantic
❑ Compare similarity between vectors
❑ Euclidean distance
❑ Cosine distance
Deep Learning Based

❑ CNN
❑ RNN-based
❑ LSTM
❑ Transformer-based
3. Sentence BERT (SBERT)
BERT in Paraphrase Detection

❑ BERT using token embeddings.

❑ BERT present each token as an embedding vector
Why Sentence BERT?

❑ BERT in paraphase detection is slow

Dataset 10k key-pair sentence

=> 50.000.000 calculation

65 hours
Why Sentence BERT?

❑ SBERT?

https://github.com/UKPLab/sentence-transformers/issues/924
Compare SBERT vs BERT

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

❑ BERT use token embeddings, SBERT use sentence
embedding.
❑ BERT use Classifier, SBERT use Cosine-similarity
Sentence BERT vs BERT

❑ SBERT use sentence embeddings.

Sentence BERT vs BERT
Retrieve & Re-Rank

❑ For complex semantic search scenarios, a retrieve & re-

rank pipeline is advisable:
Semantic Search
❑ Embed all data in your corpus.
Ex:
● How to learn Python online?
● How to learn Python on the web?
● What is Python ?
❑ Type :
● Symmetric Semantic search : SBERT
● Asymmetric Semantic search : Marco
❑ Method :
● Elastic Search
● Approximate Nearest Neighbor
4. Optimization techniques
Optimization techniques

❑ Cross-encoder
❑ Clustering
❑ Embedding in Knowledge Graph
❑ Concurrent Paraphrase Mining
❑ Model Distillation
❑ Augmented SBERT (Domain-transfer)
Cross-encoder

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

SBERT
Weight
Weight
Weight
SBERT

Vector

Vector Cosine-
similarity
Clustering and BERTopic

❑ Clustering and BERTTopic

❑ Paraphase detect on new topic
Embedding in Knowledge Graph
❑ PyNNDescent : for fast Approximate Nearest Neighbors
❑ Building neighbor graphs
❑ Searching a nearest neighbor graph
Concurrent Paraphrase Mining
Concurrent Paraphrase Mining
❑ top_k – For each sentence, we retrieve up to top_k
other sentences

20k sentences Chunk it to 20x1000 sentences

Distill in Paraphase detection

❑ Knowledge Distillation

❑ Dimensionality Reduction
❑ Quantization
Augmented SBERT
❑ Scenario 1: Limited or small annotated
datasets
❑ Step 1: Train a cross-encoder (BERT)
over the small (gold or annotated) dataset
❑ Step 2.1: Create pairs by recombination
and reduce the pairs via BM25 or
semantic search
❑ Step 2.2: Weakly label new pairs with
cross-encoder (BERT). These are silver
pairs or (silver) dataset
 Step 3: Finally, train a bi-encoder
(SBERT) on the extended (gold + silver)
training dataset
Augmented SBERT

❑ Scenario 2: No annotated datasets

● Step 1: Train from scratch a cross-encoder (BERT) over a source
dataset, for which we contain annotations
● Step 2: Use this cross-encoder (BERT) to label your target
dataset i.e. unlabled sentence pairs
● Step 3: Finally, train a bi-encoder (SBERT) on the labeled target
dataset
5. Demo
SentenceBERT

❑ SentenceBERT
● SentenceBERT
https://colab.research.google.com/drive/1JiiMKFIsnRmESeS3
GWLR3nf8J5iI-_1b?usp=sharing
● SentenceBERT in Publications Paper
https://colab.research.google.com/drive/1zyjffqQZVViCH79RP
UZGEK-slN3LX3A9?usp=sharing
❑ VietnameseBERT
● Paraphase detection in Vietnamese using SBERT
https://colab.research.google.com/drive/1vff8gXZufZ70_GF2Xr
M1f1GzNXVkmWyT?usp=sharing
Q&A
References

1. Convolutional Neural Network for Paraphrase Identification (Yin &

Schütze, NAACL 2015)
2. William B. Dolan and Chris Brockett. 2005. Automatically Constructing
a Corpus of Sentential Paraphrases. In Proceedings of the Third
International Workshop on Paraphrasing (IWP2005).
3. First Quora Dataset Release: Question Pairs - Data @ Quora
4. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural
Language Understanding (Wang et al., 2018)
5. Rahul Bhagat, Eduard Hovy; What Is a Paraphrase?. Computational
Linguistics 2013; 39 (3): 463–472.
doi: https://doi.org/10.1162/COLI_a_00166
6. Vrbanec, T.; Meštrović, A. Corpus-Based Paraphrase Detection
Experiments and Review. Information 2020, 11, 241.
https://doi.org/10.3390/info11050241
Thanks

Tài Liệu Tiếng Anh: Ôn tập thi đầu vào Thạc sĩ
100% (3)
Tài Liệu Tiếng Anh: Ôn tập thi đầu vào Thạc sĩ
90 pages
Quantum Computation - Theory and Implementation
No ratings yet
Quantum Computation - Theory and Implementation
181 pages
HKBK College of Engineering Department of Computer Science and Engineering
No ratings yet
HKBK College of Engineering Department of Computer Science and Engineering
24 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
A Hybrid Approach of Weighted Fine Tuned BERT Extraction With Deep Siamese Bi - LSTM Model For Semantic Text Similarity Identification
No ratings yet
A Hybrid Approach of Weighted Fine Tuned BERT Extraction With Deep Siamese Bi - LSTM Model For Semantic Text Similarity Identification
27 pages
data_mining_report
No ratings yet
data_mining_report
17 pages
11 Bert
No ratings yet
11 Bert
66 pages
BERT For Coreference Resolution: Baselines and Analysis
No ratings yet
BERT For Coreference Resolution: Baselines and Analysis
6 pages
BERT
No ratings yet
BERT
4 pages
1903.10318 - Fine-Tune BERT For Extractive Summarization
No ratings yet
1903.10318 - Fine-Tune BERT For Extractive Summarization
6 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Baselines and Analysis
No ratings yet
Baselines and Analysis
6 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
2102.00291_bert
No ratings yet
2102.00291_bert
5 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
100% (1)
BERT Explained - State of The Art Language Model For NLP - by Rani Horev - Towards Data Science
8 pages
project-handout
No ratings yet
project-handout
30 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
Semantic Text Similarity
No ratings yet
Semantic Text Similarity
2 pages
C4_W3
No ratings yet
C4_W3
98 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Pretraining-Based Natural Language Generation For Text Summarization
No ratings yet
Pretraining-Based Natural Language Generation For Text Summarization
7 pages
Ensemble_BERT_A_Student_Social_Network_Text_Sentiment_Classification_Model_Based_on_Ensemble_Learning_and_BERT_Architecture
No ratings yet
Ensemble_BERT_A_Student_Social_Network_Text_Sentiment_Classification_Model_Based_on_Ensemble_Learning_and_BERT_Architecture
4 pages
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
No ratings yet
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
5 pages
ACM Conference Proceedings Primary Article Template
No ratings yet
ACM Conference Proceedings Primary Article Template
2 pages
Combining Xxsentence Similarities Measures To Identify Paraphrases
No ratings yet
Combining Xxsentence Similarities Measures To Identify Paraphrases
15 pages
wahle2022b
No ratings yet
wahle2022b
23 pages
Identifying Lexical Relationships and Entailments With Distributional Semantics
No ratings yet
Identifying Lexical Relationships and Entailments With Distributional Semantics
39 pages
Boosting The Performance of Transformer Architectu
No ratings yet
Boosting The Performance of Transformer Architectu
6 pages
Paper
No ratings yet
Paper
8 pages
Identifying Machine-Paraphrased Plagiarism: Bibtex Ris Enw
No ratings yet
Identifying Machine-Paraphrased Plagiarism: Bibtex Ris Enw
22 pages
UNIT-5 and 6
No ratings yet
UNIT-5 and 6
40 pages
BERT
No ratings yet
BERT
98 pages
BERT Summarization MP IA1Final
No ratings yet
BERT Summarization MP IA1Final
12 pages
LSTM to BERT
No ratings yet
LSTM to BERT
30 pages
Japanese Abstractive Summarization
No ratings yet
Japanese Abstractive Summarization
5 pages
Thesis Trinh Khoi
No ratings yet
Thesis Trinh Khoi
110 pages
T-BERTSum Topic-Aware Text Summarization Based on BERT
No ratings yet
T-BERTSum Topic-Aware Text Summarization Based on BERT
12 pages
Whitening Sentence Representations For Better Semantics and Faster Retrieval
No ratings yet
Whitening Sentence Representations For Better Semantics and Faster Retrieval
9 pages
SGPT: GPT Sentence Embeddings For Semantic Search: Preprint. Under Review
No ratings yet
SGPT: GPT Sentence Embeddings For Semantic Search: Preprint. Under Review
19 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Thesis
No ratings yet
Thesis
16 pages
D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862
No ratings yet
D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862
17 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
7 Transformers
No ratings yet
7 Transformers
20 pages
cor_res
No ratings yet
cor_res
6 pages
ASSIGNMENT 05 CL[1]
No ratings yet
ASSIGNMENT 05 CL[1]
3 pages
Tacl A 00300
No ratings yet
Tacl A 00300
14 pages
BERT Summarization MP IA1
No ratings yet
BERT Summarization MP IA1
16 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Incorporating BERT Into NMT-1
No ratings yet
Incorporating BERT Into NMT-1
20 pages
S7 PROJECT REPORT.docx (5)
No ratings yet
S7 PROJECT REPORT.docx (5)
52 pages
Neural Machine Translation: Shusen Wang
No ratings yet
Neural Machine Translation: Shusen Wang
57 pages
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
No ratings yet
Learning To Answer by Learning To Ask - Getting The Best of GPT-2 and BERT Worlds PDF
10 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
NLP_CT2_SET A_Answer Key
No ratings yet
NLP_CT2_SET A_Answer Key
10 pages
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
3 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
From Everand
Python Machine Learning By Example: Unlock machine learning best practices with real-world use cases
Yuxi (Hayden) Liu
No ratings yet
Mathematics For AI
No ratings yet
Mathematics For AI
4 pages
Master Test
No ratings yet
Master Test
90 pages
Listening Answer
No ratings yet
Listening Answer
52 pages
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
No ratings yet
SPE-181049-MS Reservoir Uncertainty Analysis: The Trends From Probability To Algorithms and Machine Learning
5 pages
Using The Weighted Sum of Dependent Gaussians Formula For Markowitz Portfolio Optimization
No ratings yet
Using The Weighted Sum of Dependent Gaussians Formula For Markowitz Portfolio Optimization
10 pages
Handwritten Text Recognition Using Tensorflow 2.0: Computer Vision
No ratings yet
Handwritten Text Recognition Using Tensorflow 2.0: Computer Vision
37 pages
Truss Analysis Calculator - Free Online Truss Tool - Encomp
No ratings yet
Truss Analysis Calculator - Free Online Truss Tool - Encomp
17 pages
Cheat Sheet: Summarize Data Estimate Models, 1/2
No ratings yet
Cheat Sheet: Summarize Data Estimate Models, 1/2
2 pages
Ch2 THQ All Methods
No ratings yet
Ch2 THQ All Methods
6 pages
MA201 Lecture11 Hand
No ratings yet
MA201 Lecture11 Hand
20 pages
Repeating The Steps 1, 2, and 3:: rajeshs/QR
No ratings yet
Repeating The Steps 1, 2, and 3:: rajeshs/QR
15 pages
MSE 600 Final Exam Study Guide
100% (1)
MSE 600 Final Exam Study Guide
3 pages
Eco No Metrics Course Outline
No ratings yet
Eco No Metrics Course Outline
2 pages
2 - Mathematical Modelling PDF
No ratings yet
2 - Mathematical Modelling PDF
20 pages
Deep Learning A Z PDF
100% (6)
Deep Learning A Z PDF
799 pages
Sensitivity Analysis Computer Solution
No ratings yet
Sensitivity Analysis Computer Solution
10 pages
Team GMT3: AA274 Final Project Presentation Autumn 2020 Lydia Chan, Clara Keng, Natasha Ong, David Yue
No ratings yet
Team GMT3: AA274 Final Project Presentation Autumn 2020 Lydia Chan, Clara Keng, Natasha Ong, David Yue
13 pages
On The Evaluation of Model Performance in Physical Geography
No ratings yet
On The Evaluation of Model Performance in Physical Geography
2 pages
Experiment 1
No ratings yet
Experiment 1
10 pages
Assignment_03
No ratings yet
Assignment_03
2 pages
Mangtas JD - Data Analytics
No ratings yet
Mangtas JD - Data Analytics
2 pages
WoE & Regresi Logistik Sukabumi
No ratings yet
WoE & Regresi Logistik Sukabumi
6 pages
FEA_II Sem Notes by Prof. Kajal Pachdhare_UNIT I Part 1
No ratings yet
FEA_II Sem Notes by Prof. Kajal Pachdhare_UNIT I Part 1
11 pages
Mod6_Slides
No ratings yet
Mod6_Slides
27 pages
Maintenance Management DBA7076
No ratings yet
Maintenance Management DBA7076
10 pages
An Approach To The Generalized Displacement Control Method
No ratings yet
An Approach To The Generalized Displacement Control Method
9 pages
Dbms Complete Notes
No ratings yet
Dbms Complete Notes
781 pages
JL nns1
No ratings yet
JL nns1
24 pages
ANALYSIS TO PREDICTE THE NUMBER ARIMA
No ratings yet
ANALYSIS TO PREDICTE THE NUMBER ARIMA
6 pages
Candlestick Chart Based Trading System Using Ensemble Learning For Financial Assets
No ratings yet
Candlestick Chart Based Trading System Using Ensemble Learning For Financial Assets
10 pages
s12525-023-00654-3
No ratings yet
s12525-023-00654-3
17 pages
Mathway - Algebra Problem Solver
No ratings yet
Mathway - Algebra Problem Solver
2 pages

Text Paraphrase Detection

Uploaded by

Text Paraphrase Detection

Uploaded by

Topic 8: Text

Giáo viên: Nhóm 6 :

Contents 3. Sentence BERT (SBERT)

❑ Given two sentences, determine whether they

Number of papers mentioning the

0 100 200 300 400 500 600 700

❑ Plagiarism is a serious problem in science.

❑ Rule Based [5]

❑ Consider sentences as a sequence of characters

❑ BERT using token embeddings.

❑ BERT in paraphase detection is slow

Dataset 10k key-pair sentence

=> 50.000.000 calculation

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

❑ SBERT use sentence embeddings.

❑ For complex semantic search scenarios, a retrieve & re-

❑ SBERT (Bi-Encoder) vs BERT (Cross-Encoder)

❑ Clustering and BERTTopic

20k sentences Chunk it to 20x1000 sentences​

❑ Scenario 2: No annotated datasets

1. Convolutional Neural Network for Paraphrase Identification (Yin &

You might also like

20k sentences Chunk it to 20x1000 sentences