0% found this document useful (0 votes)

13 views13 pages

(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Uploaded by

mrkt1138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views13 pages

(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Uploaded by

mrkt1138

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

29/06/2024, 11:47 [2401.

[2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

Semantic Similarity Matching for Patent

Documents Using Ensemble BERT-related
Model and Novel Text Processing Method
Liqiang Yu Bo Liu*
Software Engineering
The University of Chicago Zhejiang University
Irvine, USA Shanghai, China
[email protected] [email protected]

Qunwei Lin Xinyu Zhao

Information Studies Information Studies
Trine University Trine University
Phoenix, USA Phoenix, USA
[email protected] [email protected]

Chang Che
Mechanical engineering
The George Washington University
Atlanta, USA
[email protected]

Abstract
In the realm of patent document analysis, assessing semantic similarity between phrases
presents a significant challenge, notably amplifying the inherent complexities of
Cooperative Patent Classification (CPC) research. Firstly, this study addresses these chal-
lenges, recognizing early CPC work while acknowledging past struggles with language
barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC
research.

To overcome these challenges and bolster the CPC system, This paper presents two key in-
novations. Firstly, it introduces an ensemble approach that incorporates four BERT-re-
lated models, enhancing semantic similarity accuracy through weighted averaging.

https://ar5iv.labs.arxiv.org/html/2401.06782v1 1/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

Secondly, a novel text preprocessing method tailored for patent documents is introduced,
featuring a distinctive input structure with token scoring that aids in capturing semantic
relationships during CPC context training, utilizing BCELoss. Our experimental findings
conclusively establish the effectiveness of both our Ensemble Model and novel text pro-
cessing strategies when deployed on the U.S. Patent Phrase to Phrase Matching dataset.

Index Terms: Cooperative Patent Classification (CPC), Data Processing Method, DeBERTa
(Decoding-enhanced BERT).

I Introduction
In the realm of patent document analysis, the precise evaluation of semantic similarity
between phrases poses a significant and fundamental challenge. This paper focuses on ad-
dressing this critical task, highlighting its particular relevance within the context of
Cooperative Patent Classification (CPC). While early publications by Lent et al. [1], Larkey
[2], and Gey et al. [3] laid the foundation for CPC, they also exposed limitations related to
language barriers, precision, and adapting to the complexity of patent documents.

Subsequent research efforts aimed to tackle these challenges by proposing innovative so-
lutions. However, these solutions had their shortcomings. Chen and Chiu [4] focused on
cross-language matching, with potential limitations in handling diverse patent document
formats. Al-Shboul and Myaeng [5] employ Wikipedia for effective query expansion but
face limitations with specialized technical terms.

Ever since the introduction of Deep Learning, Deep Learning-related techniques have
seen extensive utilization in the field of CPC research. Prasad[6] utilized CPC for bioreme-
diation patent search, enhancing domain understanding. Shalaby et al. [7] introduced
LSTM, boosting patent classification accuracy and adaptability to changing taxonomies. Li
et al. [8], in their deep learning approach, demonstrated improved classification accuracy
but required extensive computational resources.

Moreover, the application of BERT-related techniques, as evidenced in the works by Lee

and Hsiang [9], and Bekamiri et al. [10], significantly advanced CPC research by enhanc-
ing classification accuracy and efficiency. However, these methods still face challenges re-
lated to model scalability and data processing concerns.

In the ever-evolving landscape of patent analysis and classification, the year 2023 has wit-
nessed the emergence of significant research contributions. Yoo et al.[11] delve into multi-
label classification of Artificial Intelligence-related patents, employing advanced tech-
niques. Ha and Lee [12] focus on evaluating the Cooperative Patent Classification (CPC)
system, with a particular emphasis on patent embeddings. Hoshino et al. [13] explore IPC
prediction using neural networks and CPC’s IPC classification. Additionally, Pais [14] in-
vestigates the CPC system’s link to entity identification in patent text analysis. It is essen-
https://ar5iv.labs.arxiv.org/html/2401.06782v1 2/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

tial to acknowledge that these studies may exhibit certain limitations, offering opportuni-
ties for further research and refinement in the patent analysis field.

To overcome these challenges and further enhance the capabilities of the CPC system, this
paper introduces an ensemble approach. In contrast to the traditional methods men-
tioned earlier, the ensemble method leverages the strengths of multiple BERT-related
models, including DeBERTaV3[15] related models Microsoft’s DeBERTa-v3-large,
MoritzLaurer’s DeBERTa-v3-large-mnli-fever-anli-ling-wanli, Anferico’s BERT for patents [
16], and Google’s ELECTRA-large-discriminator [17]. This ensemble approach seeks to pro-
vide a comprehensive solution to the issues faced in previous research, thereby advanc-
ing the field of CPC.

Our approach involves a novel text preprocessing method (V3) that groups and aggregates
anchor and context pairs, resulting in each pair having an associated target list and score
list. This structured input format adheres to a well-defined pattern, including tokens like
[CLS], [SEP], and [TAR], designed to facilitate the model’s understanding and analysis of
patent document content. Our experiment results demonstrate the effectiveness of our
Ensemble Model and novel text processing strategies when applied to the U.S. Patent
Phrase to Phrase Matching dataset. The main contributions of this work can be summa-
rized as follows:

• We proposed an ensemble of four deep learning models, including DeBERTaV3,

Microsoft’s DeBERTa-v3-large, MoritzLaurer’s DeBERTa-v3-large-mnli-fever-anli-ling-
wanli, Anferico’s BERT for patents, and Google’s ELECTRA-large-discriminator to en-
hances patent document analysis.

• We proposed a novel text preprocessing (V3) to group and aggregate anchor-context

pairs, creating associated target and score lists.

• Our experiments confirm the efficacy of our Ensemble Model and novel text processing
strategies on the U.S. Patent Matching dataset.

Moreover, this paper is structured as follows: The introduction sets the stage for under-
standing the challenges in CPC research. The related work section provides an overview
of prior research efforts in the field. The algorithm and model section delves into the in-
novative ensemble approach and novel text preprocessing method. The conclusion sec-
tion summarizes the contributions and the potential impact of this research on CPC
analysis.

II RELATED WORK
A number of initial publications established the groundwork for the Cooperative Patent
Classification (CPC) system. Lent et al. [1] explored text data trends, relevant to CPC’s
https://ar5iv.labs.arxiv.org/html/2401.06782v1 3/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

patent document organization. Larkey [2] contributed to patent search and classification,
aligning with CPC’s goal of effective categorization.

CPC research has spanned language barriers, precision, and deep learning to advance
patent classification and analysis. Notably, Al-Shboul and Myaeng’s work [5] introduced
”Wikipedia-based query phrase expansion” to enhance CPC’s search precision and recall.

Due to the rapid advancements in deep learning, an increasing number of studies are be-
ing employed in the realm of CPC research. Prasad[6] employed Cooperative Patent
Classification (CPC) to conduct a comprehensive search for bioremediation patents, con-
tributing to an enhanced understanding of the patent landscape in this domain. Shalaby
et al. [7] introduced an innovative method using Long Short-Term Memory (LSTM) net-
works enhanced the accuracy of patent classification, offering greater adaptability to
changing patent taxonomies and more efficient patent organization and retrieval. Li et
al.’s [8] ”DeepPatent” with convolutional neural networks and word embeddings contrib-
utes to evolving and refining CPC’s capabilities. Furthermore, studies enhanced CPC using
BERT techniques, elevating patent document classification accuracy and efficiency. Lee
and Hsiang fine-tuned a BERT model for patent classification in their pioneering work
”PatentBERT” [9].

In the latest research conducted in 2023, the exploration of the Cooperative Patent
Classification (CPC) system has continued to evolve. Yoo et al[11]. examine multi-label
classification of Artificial Intelligence-related patents, utilizing Modified D2SBERT and
Sentence Attention mechanisms. Meanwhile, Ha and Lee’s [12] article explores the effec-
tiveness of the CPC system, focusing on patent embeddings. Hoshino et al. [13] investigate
IPC prediction using neural networks and CPC’s IPC classification for patent document
content. Additionally, Pais [14] delves into the CPC system’s connection with entity linking
in patent text analysis. Together, these studies significantly enhance our comprehension
of the CPC system’s role in patent analysis and classification, reflecting the latest advance-
ments in the field.

III ALGORITHM AND MODEL

III-A Ensemble Model
To overcome the challenges inherent in patent document analysis and enhance the
Cooperative Patent Classification (CPC) system’s capabilities, we propose an innovative ap-
proach to improve patent document analysis and enhance the Cooperative Patent
Classification (CPC) system. Instead of relying on single-model methods, our approach uti-
lizes a diverse ensemble of deep learning models, including DeBERTaV3, Microsoft’s
DeBERTa-v3-large, MoritzLaurer’s DeBERTa-v3-large-mnli-fever-anli-ling-wanli, Anferico’s
BERT for patents, and Google’s ELECTRA-large-discriminator. Each model is chosen for its

https://ar5iv.labs.arxiv.org/html/2401.06782v1 4/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

specific capabilities in capturing semantic relationships and nuances in patent

documents.

Figure 1: Ensemble Model

As shown in Fig.1, this ensemble model is designed to offer a comprehensive and fine-
grained understanding of patent texts. The core principle of our ensemble model is the
weighted averaging of predictions from individual models, where the weights are deter-
mined based on their performance on the validation data. Mathematically, this is repre-
sented as:

𝑁
^ = ∑ 𝑤𝑖 ⋅ 𝑦
𝑦 𝑒
^
𝑖
(1)
𝑖=1

Where 𝑦^ 𝑒 represents the ensemble prediction, 𝑦^ 𝑖 represents the prediction from the 𝑖-th
model, and 𝑤𝑖 represents the weight assigned to the 𝑖-th model. The weights 𝑤𝑖 are opti-
mized through a validation process to maximize the ensemble’s overall accuracy and se-
mantic understanding of patent documents. This ensemble approach ensures that the CPC
system benefits from the strengths of each individual model while mitigating their weak-
nesses, resulting in improved accuracy and efficiency in patent document analysis.

III-B Novel Text Preprocessing Method

https://ar5iv.labs.arxiv.org/html/2401.06782v1 5/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

Our approach involves a meticulous text preprocessing method V3, where anchor and
context pairs are thoughtfully grouped and aggregated, resulting in each pair having an
associated target list and score list. This text preparation is essential for effectively assess-
ing semantic similarity in patent documents. As shown in Fig.2, the heart of our method-
ology lies in the structured input format we employ. This format adheres to a well-defined
pattern, which includes tokens like [CLS], [SEP], and [TAR]. This structured input is de-
signed to facilitate the model’s understanding and analysis of patent document content.

Figure 2: Data processing methods

In our model, each token is assigned a score, a process efficiently executed within the
TrainDataset class during data processing. This step ensures that the model can discern
the significance of individual tokens within the Cooperative Patent Classification (CPC)
context. The model’s output is a sequence of the same length as the input, with each token
receiving a predicted score. Even non-target tokens, such as [CLS], [SEP], and [TAR], re-
ceive scores, albeit with a true score set to -1, as they are not directly relevant to the se-
mantic similarity assessment.

For effective training and fine-tuning, we use BCELoss, a loss function comparing pre-
dicted scores to ground truth, aiming to align predicted and true scores, enhancing our

https://ar5iv.labs.arxiv.org/html/2401.06782v1 6/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

model’s patent document phrase similarity assessment. The loss function used is Binary
Cross-Entropy Loss (BCELoss), which is defined as:

1 𝑁
ℒ=− ∑ ( 𝐺 ⋅ log⁡(𝑃𝑖 ) + (1 − 𝐺𝑖 ) ⋅ log⁡(1 − 𝑃𝑖 ) ) (2)
𝑁 𝑖=1 𝑖

Here, ℒ represents the overall loss for a batch of tokens, 𝑁 is the total number of tokens in
the batch, 𝐺𝑖 is the ground truth score for token 𝑖, and 𝑃𝑖 is the predicted score for the
same token. BCELoss guides the model during training to minimize the discrepancies be-
tween predicted and ground truth scores, facilitating the accurate assessment of semantic
similarity between patent document phrases.

III-C Datasets
The dataset provided for this task consists of pairs of phrases, which include an anchor
phrase and a target phrase. The primary objective is to evaluate the degree of similarity
between these phrases, utilizing a rating scale that ranges from 0 (indicating no similar-
ity) to 1 (representing identical meaning). This assessment of similarity is unique in that it
is conducted within the context of patent subject classification, specifically based on the
Cooperative Patent

Scores in the dataset range from 0 to 1, with increments of 0.25, each representing a spe-
cific level of similarity. The entire dataset contains 48, 548 entries with 973 unique an-
chors, split into a training (75%), validation (5%), and test (20%) sets. When splitting the
data all of the entries with the same anchor are kept together in the same set. There are
106 different context CPC classes and all of them are represented in the training set.

III-D Evalution metrics

The Pearson Correlation Coefficient (𝑟) [18] is a statistical measure used to assess the
strength and direction of the linear relationship between two variables. Its values lie
within the range of -1 to 1.

The evaluation metric used was the Pearson correlation coefficient (𝑟) between the pre-
dicted and actual similarity scores, where a higher 𝑟 indicates a stronger linear relation-
ship between predictions and ground truth scores. Submissions were assessed based on
the Pearson correlation coefficient (𝑟) between the predicted (𝑦^ 𝑖 ) and actual (𝑦𝑖 ) similarity
scores, calculated as:

(𝑥 𝑖 − 𝑥)
¯ (𝑦𝑖 − 𝑦)
¯
∑
𝑛
𝑟= (3)
2 2
¯ ∑(𝑦𝑖 − 𝑦)
∑(𝑥 𝑖 − 𝑥) ¯

𝑛 𝑛

https://ar5iv.labs.arxiv.org/html/2401.06782v1 7/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

where 𝑥𝑖 and 𝑦𝑖 represent individual data points, 𝑥¯ and 𝑦¯ are the means of 𝑥 and 𝑦, respec-
tively, and 𝑛 is the number of data set samples. The Pearson correlation coefficient mea-
sures the strength of the linear relationship between predicted and actual similarity
scores, reflecting model performance in patent phrase similarity.

We rigorously evaluated our model’s performance and generalization with a 4-fold Cross-
Validation [19] approach, maintaining label balance using MultiLabelStratifiedKFold. This
method comprehensively assessed effectiveness across dataset subsets.

III-E Results
In this section, we present the performance evaluation of our model variants (denoted as
V1, V2, and V3) using the DeBERTa-v3-large architecture. We assessed the model’s capabil-
ities in U.S. Patent Phrase-to-Phrase Matching across different text processing strategies.
Specifically, we considered the following variants:

• V1: The input text utilized the input structure: [CLS] anchor [SEP] target [SEP] context.

• V2: The input text incorporated the input structure: [CLS] anchor [SEP] target [SEP] con-
text [SEP] context…

• V3: As demonstrated in the section dedicated to Text Preprocessing Method.

The results are summarized in Table 1:

TABLE I: Performance of Model Variants

Model Variant CV Score

V1 0.8347
V2 0.8369
V3 0.8512

Our experiment results demonstrate the effectiveness of various text processing strate-
gies, particularly highlighting the superior performance of text preprocessing method V3
among the tested approaches.

Moreover, the ensemble model’s hyperparameters were meticulously selected to optimize

its performance in measuring semantic similarity within patent documents. These param-
eters included a maximum sequence length of 400, a learning rate of 1×10−5 , attention

https://ar5iv.labs.arxiv.org/html/2401.06782v1 8/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

weight perturbation epsilon of 1×10−2 , an Adversarial Weight Perturbation (AWP)[20]

learning rate of 1×10−4 , a maximum gradient norm of 1000, epsilon of 1×10−5 , and the uti-
lization of Text Processing Strategy V3 with BCELoss and AWP. These settings were fine-
tuned to ensure the ensemble model’s effectiveness in combining the strengths of multi-
ple deep learning models, resulting in an impressive ensemble score as previously dis-
cussed. The outcomes of the experiments are presented in Table 2, depicted as follows:

TABLE II: Ensemble Model Results

Model Weight CV Score

Microsoft/DeBERTa-v3-large 0.35 0.8512
Anferico/BERT-for-Patents 0.2 0.8382
Google/ELECTRA-large 0.25 0.8503
MoritzLaurer/DeBERTa-v3-large 0.2 0.8385
Ensemble Model 0.8534

The ensemble strategy incorporated these models with different weights to maximize
their collective efficacy. The ensemble model’s impressive performance was demonstrated
by its Cross-Validation (CV) score, with Microsoft’s DeBERTa-v3-large contributing a CV
score of 0.8512, Anferico’s BERT for patents with a CV score of 0.8382, Google’s ELECTRA-
large-discriminator scoring 0.8503 in CV, and MoritzLaurer’s DeBERTa-v3-large-mnli-
fever-anli-ling-wanli achieving a CV score of 0.8385. These models were blended with
weights of 0.35, 0.2, 0.25, and 0.2, respectively, to create the ensemble. The final ensemble
score, measured using the Pearson correlation coefficient, reached an impressive 0.8534,
underscoring the success of this approach in enhancing semantic similarity measurement
for patent documents. The table summarizes these results for clarity.

IV Conclusion
AI, notably in Bioinformatics, drives medical AI integration’s rapid growth across diverse
fields. Amidst the rapid advancement of artificial intelligence in diverse fields, our study
delves into the intricate realm of semantic similarity assessment within patent docu-
ments, particularly in the context of the Cooperative Patent Classification (CPC) frame-
work. While prior research laid the CPC foundation, it grappled with language barriers
and precision issues. Subsequent innovative solutions faced constraints, and recent
strides using BERT-related techniques showed promise but raised scalability and text pro-
cessing concerns.

https://ar5iv.labs.arxiv.org/html/2401.06782v1 9/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

To overcome these challenges and bolster the CPC system, our paper introduces an en-
semble approach, harnessing multiple deep learning models, including DeBERTaV3-re-
lated ones, each meticulously trained with BCELoss. We also present creative data pro-
cessing methods tailored to patent document nuances, featuring an innovative input
structure that assigns scores to individual tokens. The incorporation of BCELoss during
training leverages both predicted and ground truth scores, enabling fine-grained semantic
analysis.

By merging these innovations with traditional similarity assessment, our work aims to
significantly enhance patent document analysis efficiency and precision. Our experimen-
tal findings conclusively establish the effectiveness of both our Ensemble Model and novel
text processing strategies when deployed on the U.S. Patent Phrase to Phrase Matching
dataset.

References
[1] B. Lent, R. Agrawal, and R. Srikant,
“Discovering trends in text databases.” in
KDD, vol. 97, 1997, pp. 227–230.

[2] L. S. Larkey, “A patent search and

classification system,” in Proceedings of
the fourth ACM conference on Digital
libraries, 1999, pp. 179–187.

[3] F. Gey, M. Buckland, A. Chen, and

R. Larson, “Entry vocabulary-a
technology to enhance digital search,” in
Proceedings of the first international
conference on Human language
technology research, 2001.

[4] Y.-L. Chen and Y.-T. Chiu, “Cross-language

patent matching via an international
patent classification-based concept
bridge,” Journal of information science,
vol. 39, no. 6, pp. 737–753, 2013.

[5] B. Al-Shboul and S.-H. Myaeng,

“Wikipedia-based query phrase
expansion in patent class search,”
Information retrieval, vol. 17, pp. 430–
451, 2014.

https://ar5iv.labs.arxiv.org/html/2401.06782v1 10/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

[6] R. Prasad, “Searching bioremediation

patents through cooperative patent
classification (cpc),” Reviews on
Environmental Health, vol. 31, no. 1, pp.
53–56, 2016.

[7] M. Shalaby, J. Stutzki, M. Schubert, and

S. Günnemann, “An lstm approach to
patent classification based on fixed
hierarchy vectors,” in Proceedings of the
2018 SIAM International Conference on
Data Mining. SIAM, 2018, pp. 495–503.

[8] S. Li, J. Hu, Y. Cui, and J. Hu, “Deeppatent:

patent classification with convolutional
neural networks and word embedding,”
Scientometrics, vol. 117, pp. 721–744,
2018.

[9] J.-S. Lee and J. Hsiang, “Patentbert:

Patent classification with fine-tuning a
pre-trained bert model,” arXiv preprint
arXiv:1906.02124, 2019.

[10] H. Bekamiri, D. S. Hain, and

R. Jurowetzki, “Patentsberta: A deep nlp
based hybrid model for patent distance
and classification using augmented
sbert,” arXiv preprint arXiv:2103.11933,
2021.

[11] Y. Yoo, T.-S. Heo, D. Lim, and D. Seo,

“Multi label classification of artificial
intelligence related patents using
modified d2sbert and sentence attention
mechanism,” arXiv preprint
arXiv:2303.03165, 2023.

[12] T. Ha and J.-M. Lee, “Examine the

effectiveness of patent embedding-based
company comparison method,” IEEE
Access, vol. 11, pp. 23 455–23 461, 2023.

https://ar5iv.labs.arxiv.org/html/2401.06782v1 11/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

[13] Y. Hoshino, Y. Utsumi, Y. Matsuda,

Y. Tanaka, and K. Nakata, “Ipc prediction
of patent documents using neural
network with attention for hierarchical
structure,” Plos one, vol. 18, no. 3, p.
e0282361, 2023.

[14] N. D. R. Pais, “Bert mapper: An entity

linking method for patent text,” Ph.D.
dissertation, 2023.

[15] P. He, J. Gao, and W. Chen, “Debertav3:

Improving deberta using electra-style
pre-training with gradient-disentangled
embedding sharing,” arXiv preprint
arXiv:2111.09543, 2021.

[16] J. Devlin, M.-W. Chang, K. Lee, and

K. Toutanova, “Bert: Pre-training of deep
bidirectional transformers for language
understanding,” arXiv preprint
arXiv:1810.04805, 2018.

[17] K. Clark, M.-T. Luong, Q. V. Le, and C. D.

Manning, “Electra: Pre-training text
encoders as discriminators rather than
generators,” arXiv preprint
arXiv:2003.10555, 2020.

[18] I. Cohen, Y. Huang, J. Chen, J. Benesty,

J. Benesty, J. Chen, Y. Huang, and
I. Cohen, “Pearson correlation
coefficient,” Noise reduction in speech
processing, pp. 1–4, 2009.

[19] M. W. Browne, “Cross-validation

methods,” Journal of mathematical
psychology, vol. 44, no. 1, pp. 108–132,
2000.

[20] D. Wu, S.-T. Xia, and Y. Wang,

“Adversarial weight perturbation helps
robust generalization,” Advances in
Neural Information Processing Systems,
vol. 33, pp. 2958–2969, 2020.

https://ar5iv.labs.arxiv.org/html/2401.06782v1 12/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…

◄ Feeling
lucky?
Conversion
report (OK)
Report
an issue
View original
on arXiv ►

🌙 Copyright Privacy Policy Generated on Tue Feb 27 09:30:51 2024 by LT

a eXML

https://ar5iv.labs.arxiv.org/html/2401.06782v1 13/13

Aviat PV User Manual PDF
100% (3)
Aviat PV User Manual PDF
568 pages
SOP - University of Texas at Austin
No ratings yet
SOP - University of Texas at Austin
2 pages
MTS 2020 06 October 2021 Shift 3 in English
No ratings yet
MTS 2020 06 October 2021 Shift 3 in English
29 pages
Random Forest
No ratings yet
Random Forest
9 pages
Internship - Report NETWORKING PDF
No ratings yet
Internship - Report NETWORKING PDF
24 pages
Week 3 Language Translators
No ratings yet
Week 3 Language Translators
6 pages
Direct Memory Access - GeeksforGeeks
No ratings yet
Direct Memory Access - GeeksforGeeks
4 pages
Chapter 4 - Identity and Access Management Part 1 - Section A
No ratings yet
Chapter 4 - Identity and Access Management Part 1 - Section A
8 pages
2HRMS
No ratings yet
2HRMS
4 pages
5th Paper
No ratings yet
5th Paper
84 pages
Exploring Artificial Intelligence WP Upload 2022 5
No ratings yet
Exploring Artificial Intelligence WP Upload 2022 5
61 pages
BMC Resmart Gii Y30t Bipap Humidifier
No ratings yet
BMC Resmart Gii Y30t Bipap Humidifier
4 pages
Accenture Human Capital Services For SuccessFactors
No ratings yet
Accenture Human Capital Services For SuccessFactors
8 pages
BERT-Based Fine-Tuning For Efficient Context Simil
No ratings yet
BERT-Based Fine-Tuning For Efficient Context Simil
15 pages
Multi Label Classification of Artificial Intelligence Related Patents Using Modified D2SBERT and Sentence Attention Mechanism Yoo Et Al. (2023)
No ratings yet
Multi Label Classification of Artificial Intelligence Related Patents Using Modified D2SBERT and Sentence Attention Mechanism Yoo Et Al. (2023)
8 pages
CinPatent Datasets For Patent Classification - Nguyen Et Al. (2024)
No ratings yet
CinPatent Datasets For Patent Classification - Nguyen Et Al. (2024)
8 pages
Patent Classification by Fine-Tuning BERT Language Model - Lee & Hsiang (2019)
No ratings yet
Patent Classification by Fine-Tuning BERT Language Model - Lee & Hsiang (2019)
6 pages
NI Serial Hardware Specifications PDF
No ratings yet
NI Serial Hardware Specifications PDF
62 pages
US11263223
No ratings yet
US11263223
15 pages
Untitled
No ratings yet
Untitled
3 pages
Digitalgovernmentreviewbrazil Oecd
No ratings yet
Digitalgovernmentreviewbrazil Oecd
23 pages
Semantic Search
No ratings yet
Semantic Search
9 pages
Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method
No ratings yet
Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method
5 pages
Vinoya, Maria Karla N. - PhD-ELS - Qualitative Research Proposal
No ratings yet
Vinoya, Maria Karla N. - PhD-ELS - Qualitative Research Proposal
33 pages
Kollmorgen AKM - Servomotor
No ratings yet
Kollmorgen AKM - Servomotor
44 pages
MANUAL AMPLIFICADOR KENWOOD Ar304
No ratings yet
MANUAL AMPLIFICADOR KENWOOD Ar304
24 pages
Brackets Lesson For Coding and Programming by Slidesgo
No ratings yet
Brackets Lesson For Coding and Programming by Slidesgo
57 pages
SSRN 4385564
No ratings yet
SSRN 4385564
27 pages
A Survey On Deep Learning For Patent Analysis
No ratings yet
A Survey On Deep Learning For Patent Analysis
13 pages
38942089968
No ratings yet
38942089968
2 pages
KehuaFrance 3kW
No ratings yet
KehuaFrance 3kW
2 pages
B V M Catalogue
No ratings yet
B V M Catalogue
24 pages
Lab Report 2 (Circle)
No ratings yet
Lab Report 2 (Circle)
4 pages
Finding Similar Patents Through Semantic Query Expansion: Sciencedirect
No ratings yet
Finding Similar Patents Through Semantic Query Expansion: Sciencedirect
6 pages
A Comparative Analysis of Embedding Models For Patent Similarity
No ratings yet
A Comparative Analysis of Embedding Models For Patent Similarity
12 pages
Rohini 56509347058
No ratings yet
Rohini 56509347058
4 pages
Powin - SAMPLE Commissioning Schedule 22NOV2021
No ratings yet
Powin - SAMPLE Commissioning Schedule 22NOV2021
1 page
Understanding The Security Architecture of The One Identity Safeguard Appliance
No ratings yet
Understanding The Security Architecture of The One Identity Safeguard Appliance
6 pages
State Aid 10 Ton ESAL Traffic Forecast Calculator: Default Heavy Commerical Traffic Values
No ratings yet
State Aid 10 Ton ESAL Traffic Forecast Calculator: Default Heavy Commerical Traffic Values
4 pages
Dynamic Difficulty Adjustment Via Fast User Adaptation
No ratings yet
Dynamic Difficulty Adjustment Via Fast User Adaptation
3 pages
Minutes of Meeting Held Between M/S Ultra Tech Sewagram Cements LTD and M/S S.N Enviro Solutions PVT LTD
No ratings yet
Minutes of Meeting Held Between M/S Ultra Tech Sewagram Cements LTD and M/S S.N Enviro Solutions PVT LTD
1 page
Ps Primer: Description
No ratings yet
Ps Primer: Description
2 pages
Writing Portfolio Task 3 Writing Form (Final) Name: Hoh Jia Da Group: 54 MATRIC No.: 200480
No ratings yet
Writing Portfolio Task 3 Writing Form (Final) Name: Hoh Jia Da Group: 54 MATRIC No.: 200480
2 pages
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
From Everand
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Value Engineering Techniques and Applications: Definitive Reference for Developers and Engineers
From Everand
Value Engineering Techniques and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Analysis and Synthesis: A Modern Systems Theory Approach
From Everand
Network Analysis and Synthesis: A Modern Systems Theory Approach
Brian D. O. Anderson
5/5 (2)
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
From Everand
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Computational Intelligence and its Applications
From Everand
Computational Intelligence and its Applications
Vikash Yadav
No ratings yet
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
From Everand
Pinecone Hybrid Search Engineering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
Pijul: Theory and Practice of Patch-Based Version Control: The Complete Guide for Developers and Engineers
From Everand
Pijul: Theory and Practice of Patch-Based Version Control: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Parallel Software Development with Threading Building Blocks: Definitive Reference for Developers and Engineers
From Everand
Parallel Software Development with Threading Building Blocks: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient String Searching with Boyer-Moore: Definitive Reference for Developers and Engineers
From Everand
Efficient String Searching with Boyer-Moore: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
From Everand
Value Creation with Digital Twins: Conceptual Reference Frameworks and Case Study
Linard Dario Barth
No ratings yet
Backtracking Algorithms and Applications: Definitive Reference for Developers and Engineers
From Everand
Backtracking Algorithms and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Haystack for Natural Language Search and Question Answering: The Complete Guide for Developers and Engineers
From Everand
Haystack for Natural Language Search and Question Answering: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Bioinformatics Unveiled
From Everand
Bioinformatics Unveiled
Joan Melody
No ratings yet
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
From Everand
Qdrant Vector Search in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
From Everand
Practical Replication Architectures and Protocols: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Coq Language and Proof Development: Definitive Reference for Developers and Engineers
From Everand
Coq Language and Proof Development: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Vector Database: Definitive Reference for Developers and Engineers
From Everand
Vector Database: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
From Everand
Fluent Simulation and Modeling Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
From Everand
Knuth-Morris-Pratt Algorithm Explained: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure
From Everand
Machine Learning Upgrade: A Data Scientist's Guide to MLOps, LLMs, and ML Infrastructure
Kristen Kehrer
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Deep Learning with Fast.ai: Definitive Reference for Developers and Engineers
From Everand
Deep Learning with Fast.ai: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
From Everand
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
Dr. Hariram Chavan
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
From Everand
Big-O Notation Demystified: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
From Everand
Discrete Structure and Automata Theory for Learners: Learn Discrete Structure Concepts and Automata Theory with JFLAP
Sukhpreet Kaur Gill
No ratings yet
Mivar NETs and logical inference with the linear complexity
From Everand
Mivar NETs and logical inference with the linear complexity
Varlamov, Oleg O.
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet

(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Uploaded by

(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method

Uploaded by

29/06/2024, 11:47 [2401.

Semantic Similarity Matching for Patent

Qunwei Lin Xinyu Zhao

Moreover, the application of BERT-related techniques, as evidenced in the works by Lee

• We proposed an ensemble of four deep learning models, including DeBERTaV3,

• We proposed a novel text preprocessing (V3) to group and aggregate anchor-context

III ALGORITHM AND MODEL

specific capabilities in capturing semantic relationships and nuances in patent

Figure 1: Ensemble Model

III-B Novel Text Preprocessing Method

Figure 2: Data processing methods

III-D Evalution metrics

• V3: As demonstrated in the section dedicated to Text Preprocessing Method.

The results are summarized in Table 1:

TABLE I: Performance of Model Variants

Model Variant CV Score

Moreover, the ensemble model’s hyperparameters were meticulously selected to optimize

weight perturbation epsilon of 1×10−2 , an Adversarial Weight Perturbation (AWP)[20]

TABLE II: Ensemble Model Results

Model Weight CV Score

[2] L. S. Larkey, “A patent search and

[3] F. Gey, M. Buckland, A. Chen, and

[4] Y.-L. Chen and Y.-T. Chiu, “Cross-language

[5] B. Al-Shboul and S.-H. Myaeng,

[6] R. Prasad, “Searching bioremediation

[7] M. Shalaby, J. Stutzki, M. Schubert, and

[8] S. Li, J. Hu, Y. Cui, and J. Hu, “Deeppatent:

[9] J.-S. Lee and J. Hsiang, “Patentbert:

[10] H. Bekamiri, D. S. Hain, and

[11] Y. Yoo, T.-S. Heo, D. Lim, and D. Seo,

[12] T. Ha and J.-M. Lee, “Examine the

[13] Y. Hoshino, Y. Utsumi, Y. Matsuda,

[14] N. D. R. Pais, “Bert mapper: An entity

[15] P. He, J. Gao, and W. Chen, “Debertav3:

[16] J. Devlin, M.-W. Chang, K. Lee, and

[17] K. Clark, M.-T. Luong, Q. V. Le, and C. D.

[18] I. Cohen, Y. Huang, J. Chen, J. Benesty,

[19] M. W. Browne, “Cross-validation

[20] D. Wu, S.-T. Xia, and Y. Wang,

🌙 Copyright Privacy Policy Generated on Tue Feb 27 09:30:51 2024 by LT

You might also like