(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method
(2401.06782) Semantic Similarity Matching For Patent Documents Using Ensemble BERT-related Model and Novel Text Processing Method
[2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
Chang Che
Mechanical engineering
The George Washington University
Atlanta, USA
[email protected]
Abstract
In the realm of patent document analysis, assessing semantic similarity between phrases
presents a significant challenge, notably amplifying the inherent complexities of
Cooperative Patent Classification (CPC) research. Firstly, this study addresses these chal-
lenges, recognizing early CPC work while acknowledging past struggles with language
barriers and document intricacy. Secondly, it underscores the persisting difficulties of CPC
research.
To overcome these challenges and bolster the CPC system, This paper presents two key in-
novations. Firstly, it introduces an ensemble approach that incorporates four BERT-re-
lated models, enhancing semantic similarity accuracy through weighted averaging.
https://ar5iv.labs.arxiv.org/html/2401.06782v1 1/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
Secondly, a novel text preprocessing method tailored for patent documents is introduced,
featuring a distinctive input structure with token scoring that aids in capturing semantic
relationships during CPC context training, utilizing BCELoss. Our experimental findings
conclusively establish the effectiveness of both our Ensemble Model and novel text pro-
cessing strategies when deployed on the U.S. Patent Phrase to Phrase Matching dataset.
Index Terms: Cooperative Patent Classification (CPC), Data Processing Method, DeBERTa
(Decoding-enhanced BERT).
I Introduction
In the realm of patent document analysis, the precise evaluation of semantic similarity
between phrases poses a significant and fundamental challenge. This paper focuses on ad-
dressing this critical task, highlighting its particular relevance within the context of
Cooperative Patent Classification (CPC). While early publications by Lent et al. [1], Larkey
[2], and Gey et al. [3] laid the foundation for CPC, they also exposed limitations related to
language barriers, precision, and adapting to the complexity of patent documents.
Subsequent research efforts aimed to tackle these challenges by proposing innovative so-
lutions. However, these solutions had their shortcomings. Chen and Chiu [4] focused on
cross-language matching, with potential limitations in handling diverse patent document
formats. Al-Shboul and Myaeng [5] employ Wikipedia for effective query expansion but
face limitations with specialized technical terms.
Ever since the introduction of Deep Learning, Deep Learning-related techniques have
seen extensive utilization in the field of CPC research. Prasad[6] utilized CPC for bioreme-
diation patent search, enhancing domain understanding. Shalaby et al. [7] introduced
LSTM, boosting patent classification accuracy and adaptability to changing taxonomies. Li
et al. [8], in their deep learning approach, demonstrated improved classification accuracy
but required extensive computational resources.
In the ever-evolving landscape of patent analysis and classification, the year 2023 has wit-
nessed the emergence of significant research contributions. Yoo et al.[11] delve into multi-
label classification of Artificial Intelligence-related patents, employing advanced tech-
niques. Ha and Lee [12] focus on evaluating the Cooperative Patent Classification (CPC)
system, with a particular emphasis on patent embeddings. Hoshino et al. [13] explore IPC
prediction using neural networks and CPC’s IPC classification. Additionally, Pais [14] in-
vestigates the CPC system’s link to entity identification in patent text analysis. It is essen-
https://ar5iv.labs.arxiv.org/html/2401.06782v1 2/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
tial to acknowledge that these studies may exhibit certain limitations, offering opportuni-
ties for further research and refinement in the patent analysis field.
To overcome these challenges and further enhance the capabilities of the CPC system, this
paper introduces an ensemble approach. In contrast to the traditional methods men-
tioned earlier, the ensemble method leverages the strengths of multiple BERT-related
models, including DeBERTaV3[15] related models Microsoft’s DeBERTa-v3-large,
MoritzLaurer’s DeBERTa-v3-large-mnli-fever-anli-ling-wanli, Anferico’s BERT for patents [
16], and Google’s ELECTRA-large-discriminator [17]. This ensemble approach seeks to pro-
vide a comprehensive solution to the issues faced in previous research, thereby advanc-
ing the field of CPC.
Our approach involves a novel text preprocessing method (V3) that groups and aggregates
anchor and context pairs, resulting in each pair having an associated target list and score
list. This structured input format adheres to a well-defined pattern, including tokens like
[CLS], [SEP], and [TAR], designed to facilitate the model’s understanding and analysis of
patent document content. Our experiment results demonstrate the effectiveness of our
Ensemble Model and novel text processing strategies when applied to the U.S. Patent
Phrase to Phrase Matching dataset. The main contributions of this work can be summa-
rized as follows:
• Our experiments confirm the efficacy of our Ensemble Model and novel text processing
strategies on the U.S. Patent Matching dataset.
Moreover, this paper is structured as follows: The introduction sets the stage for under-
standing the challenges in CPC research. The related work section provides an overview
of prior research efforts in the field. The algorithm and model section delves into the in-
novative ensemble approach and novel text preprocessing method. The conclusion sec-
tion summarizes the contributions and the potential impact of this research on CPC
analysis.
II RELATED WORK
A number of initial publications established the groundwork for the Cooperative Patent
Classification (CPC) system. Lent et al. [1] explored text data trends, relevant to CPC’s
https://ar5iv.labs.arxiv.org/html/2401.06782v1 3/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
patent document organization. Larkey [2] contributed to patent search and classification,
aligning with CPC’s goal of effective categorization.
CPC research has spanned language barriers, precision, and deep learning to advance
patent classification and analysis. Notably, Al-Shboul and Myaeng’s work [5] introduced
”Wikipedia-based query phrase expansion” to enhance CPC’s search precision and recall.
Due to the rapid advancements in deep learning, an increasing number of studies are be-
ing employed in the realm of CPC research. Prasad[6] employed Cooperative Patent
Classification (CPC) to conduct a comprehensive search for bioremediation patents, con-
tributing to an enhanced understanding of the patent landscape in this domain. Shalaby
et al. [7] introduced an innovative method using Long Short-Term Memory (LSTM) net-
works enhanced the accuracy of patent classification, offering greater adaptability to
changing patent taxonomies and more efficient patent organization and retrieval. Li et
al.’s [8] ”DeepPatent” with convolutional neural networks and word embeddings contrib-
utes to evolving and refining CPC’s capabilities. Furthermore, studies enhanced CPC using
BERT techniques, elevating patent document classification accuracy and efficiency. Lee
and Hsiang fine-tuned a BERT model for patent classification in their pioneering work
”PatentBERT” [9].
In the latest research conducted in 2023, the exploration of the Cooperative Patent
Classification (CPC) system has continued to evolve. Yoo et al[11]. examine multi-label
classification of Artificial Intelligence-related patents, utilizing Modified D2SBERT and
Sentence Attention mechanisms. Meanwhile, Ha and Lee’s [12] article explores the effec-
tiveness of the CPC system, focusing on patent embeddings. Hoshino et al. [13] investigate
IPC prediction using neural networks and CPC’s IPC classification for patent document
content. Additionally, Pais [14] delves into the CPC system’s connection with entity linking
in patent text analysis. Together, these studies significantly enhance our comprehension
of the CPC system’s role in patent analysis and classification, reflecting the latest advance-
ments in the field.
https://ar5iv.labs.arxiv.org/html/2401.06782v1 4/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
As shown in Fig.1, this ensemble model is designed to offer a comprehensive and fine-
grained understanding of patent texts. The core principle of our ensemble model is the
weighted averaging of predictions from individual models, where the weights are deter-
mined based on their performance on the validation data. Mathematically, this is repre-
sented as:
𝑁
^ = ∑ 𝑤𝑖 ⋅ 𝑦
𝑦 𝑒
^
𝑖
(1)
𝑖=1
Where 𝑦^ 𝑒 represents the ensemble prediction, 𝑦^ 𝑖 represents the prediction from the 𝑖-th
model, and 𝑤𝑖 represents the weight assigned to the 𝑖-th model. The weights 𝑤𝑖 are opti-
mized through a validation process to maximize the ensemble’s overall accuracy and se-
mantic understanding of patent documents. This ensemble approach ensures that the CPC
system benefits from the strengths of each individual model while mitigating their weak-
nesses, resulting in improved accuracy and efficiency in patent document analysis.
Our approach involves a meticulous text preprocessing method V3, where anchor and
context pairs are thoughtfully grouped and aggregated, resulting in each pair having an
associated target list and score list. This text preparation is essential for effectively assess-
ing semantic similarity in patent documents. As shown in Fig.2, the heart of our method-
ology lies in the structured input format we employ. This format adheres to a well-defined
pattern, which includes tokens like [CLS], [SEP], and [TAR]. This structured input is de-
signed to facilitate the model’s understanding and analysis of patent document content.
In our model, each token is assigned a score, a process efficiently executed within the
TrainDataset class during data processing. This step ensures that the model can discern
the significance of individual tokens within the Cooperative Patent Classification (CPC)
context. The model’s output is a sequence of the same length as the input, with each token
receiving a predicted score. Even non-target tokens, such as [CLS], [SEP], and [TAR], re-
ceive scores, albeit with a true score set to -1, as they are not directly relevant to the se-
mantic similarity assessment.
For effective training and fine-tuning, we use BCELoss, a loss function comparing pre-
dicted scores to ground truth, aiming to align predicted and true scores, enhancing our
https://ar5iv.labs.arxiv.org/html/2401.06782v1 6/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
model’s patent document phrase similarity assessment. The loss function used is Binary
Cross-Entropy Loss (BCELoss), which is defined as:
1 𝑁
ℒ=− ∑ ( 𝐺 ⋅ log(𝑃𝑖 ) + (1 − 𝐺𝑖 ) ⋅ log(1 − 𝑃𝑖 ) ) (2)
𝑁 𝑖=1 𝑖
Here, ℒ represents the overall loss for a batch of tokens, 𝑁 is the total number of tokens in
the batch, 𝐺𝑖 is the ground truth score for token 𝑖, and 𝑃𝑖 is the predicted score for the
same token. BCELoss guides the model during training to minimize the discrepancies be-
tween predicted and ground truth scores, facilitating the accurate assessment of semantic
similarity between patent document phrases.
III-C Datasets
The dataset provided for this task consists of pairs of phrases, which include an anchor
phrase and a target phrase. The primary objective is to evaluate the degree of similarity
between these phrases, utilizing a rating scale that ranges from 0 (indicating no similar-
ity) to 1 (representing identical meaning). This assessment of similarity is unique in that it
is conducted within the context of patent subject classification, specifically based on the
Cooperative Patent
Scores in the dataset range from 0 to 1, with increments of 0.25, each representing a spe-
cific level of similarity. The entire dataset contains 48, 548 entries with 973 unique an-
chors, split into a training (75%), validation (5%), and test (20%) sets. When splitting the
data all of the entries with the same anchor are kept together in the same set. There are
106 different context CPC classes and all of them are represented in the training set.
The evaluation metric used was the Pearson correlation coefficient (𝑟) between the pre-
dicted and actual similarity scores, where a higher 𝑟 indicates a stronger linear relation-
ship between predictions and ground truth scores. Submissions were assessed based on
the Pearson correlation coefficient (𝑟) between the predicted (𝑦^ 𝑖 ) and actual (𝑦𝑖 ) similarity
scores, calculated as:
(𝑥 𝑖 − 𝑥)
¯ (𝑦𝑖 − 𝑦)
¯
∑
𝑛
𝑟= (3)
2 2
¯ ∑(𝑦𝑖 − 𝑦)
∑(𝑥 𝑖 − 𝑥) ¯
𝑛 𝑛
https://ar5iv.labs.arxiv.org/html/2401.06782v1 7/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
where 𝑥𝑖 and 𝑦𝑖 represent individual data points, 𝑥¯ and 𝑦¯ are the means of 𝑥 and 𝑦, respec-
tively, and 𝑛 is the number of data set samples. The Pearson correlation coefficient mea-
sures the strength of the linear relationship between predicted and actual similarity
scores, reflecting model performance in patent phrase similarity.
We rigorously evaluated our model’s performance and generalization with a 4-fold Cross-
Validation [19] approach, maintaining label balance using MultiLabelStratifiedKFold. This
method comprehensively assessed effectiveness across dataset subsets.
III-E Results
In this section, we present the performance evaluation of our model variants (denoted as
V1, V2, and V3) using the DeBERTa-v3-large architecture. We assessed the model’s capabil-
ities in U.S. Patent Phrase-to-Phrase Matching across different text processing strategies.
Specifically, we considered the following variants:
• V1: The input text utilized the input structure: [CLS] anchor [SEP] target [SEP] context.
• V2: The input text incorporated the input structure: [CLS] anchor [SEP] target [SEP] con-
text [SEP] context…
Our experiment results demonstrate the effectiveness of various text processing strate-
gies, particularly highlighting the superior performance of text preprocessing method V3
among the tested approaches.
https://ar5iv.labs.arxiv.org/html/2401.06782v1 8/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
The ensemble strategy incorporated these models with different weights to maximize
their collective efficacy. The ensemble model’s impressive performance was demonstrated
by its Cross-Validation (CV) score, with Microsoft’s DeBERTa-v3-large contributing a CV
score of 0.8512, Anferico’s BERT for patents with a CV score of 0.8382, Google’s ELECTRA-
large-discriminator scoring 0.8503 in CV, and MoritzLaurer’s DeBERTa-v3-large-mnli-
fever-anli-ling-wanli achieving a CV score of 0.8385. These models were blended with
weights of 0.35, 0.2, 0.25, and 0.2, respectively, to create the ensemble. The final ensemble
score, measured using the Pearson correlation coefficient, reached an impressive 0.8534,
underscoring the success of this approach in enhancing semantic similarity measurement
for patent documents. The table summarizes these results for clarity.
IV Conclusion
AI, notably in Bioinformatics, drives medical AI integration’s rapid growth across diverse
fields. Amidst the rapid advancement of artificial intelligence in diverse fields, our study
delves into the intricate realm of semantic similarity assessment within patent docu-
ments, particularly in the context of the Cooperative Patent Classification (CPC) frame-
work. While prior research laid the CPC foundation, it grappled with language barriers
and precision issues. Subsequent innovative solutions faced constraints, and recent
strides using BERT-related techniques showed promise but raised scalability and text pro-
cessing concerns.
https://ar5iv.labs.arxiv.org/html/2401.06782v1 9/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
To overcome these challenges and bolster the CPC system, our paper introduces an en-
semble approach, harnessing multiple deep learning models, including DeBERTaV3-re-
lated ones, each meticulously trained with BCELoss. We also present creative data pro-
cessing methods tailored to patent document nuances, featuring an innovative input
structure that assigns scores to individual tokens. The incorporation of BCELoss during
training leverages both predicted and ground truth scores, enabling fine-grained semantic
analysis.
By merging these innovations with traditional similarity assessment, our work aims to
significantly enhance patent document analysis efficiency and precision. Our experimen-
tal findings conclusively establish the effectiveness of both our Ensemble Model and novel
text processing strategies when deployed on the U.S. Patent Phrase to Phrase Matching
dataset.
References
[1] B. Lent, R. Agrawal, and R. Srikant,
“Discovering trends in text databases.” in
KDD, vol. 97, 1997, pp. 227–230.
https://ar5iv.labs.arxiv.org/html/2401.06782v1 10/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
https://ar5iv.labs.arxiv.org/html/2401.06782v1 11/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
https://ar5iv.labs.arxiv.org/html/2401.06782v1 12/13
29/06/2024, 11:47 [2401.06782] Semantic Similarity Matching for Patent Documents Using Ensemble BERT-related Model and Novel Text Proce…
◄ Feeling
lucky?
Conversion
report (OK)
Report
an issue
View original
on arXiv ►
https://ar5iv.labs.arxiv.org/html/2401.06782v1 13/13