CIKM2022 Submission 3961
CIKM2022 Submission 3961
how submultiplicative norms can be converted into a metric resem- This measure can be considered as a natural extension to the stan-
bling cosine similarity, providing a family of similarity measures dard cosine similarity between vectors. Due to submultiplicativity,
building on the Schatten 𝑝-norm computed using singular values it is always within the range [−1, 1]. Even though the measure will
of covariance pooling. We then introduce new similarity measures not in general be a proper metric, we will have higher similarity
that are based on the same singular values but map them to similar- when 𝐴 and 𝐵 are similar in terms of the norm and can use it for
ity scores in a more flexible manner. The new similarity measures similarity comparisons.
have learnable parameters that are tuned for a specific end task and In this work we build on a particular family of submultiplicative
hence can learn to represent relevant information better. norms called Schatten 𝑝-norms, defined as
∑︁ 1/𝑝
1.3 Patent retrieval as context 𝑆𝑝 (𝐴) :=
𝑝
𝑠𝑛 (𝐴) , (2)
We evaluate the measures in the context of patent applications, as an 𝑛
example domain with long but structured documents. Efficient tools
where 𝑝 ∈ [1, ∞) and 𝑠𝑛 (𝐴) is the 𝑛th singular value of the matrix
for handling patent documents are in high demand due to the high
𝐴 in descending order. The normalized similarity measure can then
labor cost of manual inspection. This is especially the case for the
be expressed as 𝐷 (𝐴, 𝐵, 𝑆𝑝 (·)) in the general notation of Eq. (1).
invalidity search stage, aiming to find relevant patents that could
This family generalizes several well-known norms: for 𝑝 = 2 we get
possibly cause issues with e.g. patent infringement, or lead to delays
the Frobenius norm, for 𝑝 = 1 it corresponds to the trace norm, and
or rejection of the patent application. Over the years, there has been
for 𝑝 = ∞ we get the operator norm. Lagus et al. [13] presented the
lots of research on how to automate different parts of the process [1,
similarity measure of Eq. (1) in the specific context of the Frobenius
3] and on end-to-end solutions [9] for specific tasks. In addition to
form, but here we consider the general formulation for arbitrary
trying to solve specific tasks, there have been efforts toward creating
norms and norm-like functions.
patent-text-specific language models [4, 15]. Still, the field of patent
For 𝑝 ∈ (0, 1) the Schatten 𝑝-norm becomes a quasinorm since it
text processing is far from being solved. The patent domain is a
does not fulfill the triangle inequality, but we still retain the property
good candidate for exploring richer representations and similarity
that 𝐷 (𝐴, 𝐵, 𝑆𝑝 (·)) ∈ [−1, 1] and hence get a normalized similarity
measures as patent documents can often be tens of pages long and
measure. The Schatten 𝑝-quasinorm has recently gained traction
can greatly benefit from richer information.
in other matrix applications such as low-rank matrix recovery [21]
We explore the value of covariance pooling and singular value
and image denoising [20], and has been shown to have beneficial
based similarity measures in patent similarity comparison tasks.
properties even though lacking the convexity guarantees that the
We show that in the case of static embeddings, these similarity
triangle inequality would give, making direct optimization harder.
measures can provide better results in a full document comparison
setting when compared to mean vector representation and that
the newly proposed similarity measure using a neural network 2.2 Learnable similarity measures
further improves the similarity comparison accuracy compared to The similarity measure (1) is general and depends on the norm.
the standard Schatten 𝑝-norm. Instead of assuming a specific norm in advance, we propose using
a slightly more flexible parametric family of norms. We can then
2 SIMILARITY MEASURES BASED ON optimize the parameters of the norm directly for a task where the
distance measure is used. The Schatten 𝑝-norm (2) itself has the
SINGULAR VALUES
parameter 𝑝 which can be learned to maximize a task performance,
This section introduces our technical contributions. We first explain such as retrieval accuracy. Since we only have a single parameter, we
how submultiplicative matrix norms can be used for deriving a can either just evaluate the performance using a grid of alternative
similarity measure between two matrices. We provide a family choices or directly optimize over 𝑝 using standard gradient-based
of measures building on the Schatten 𝑝-norm, computed using optimization; we will later show that both approaches work.
singular values of the covariance pooling of document matrices. For more flexibility, we next propose extensions of the Schatten 𝑝-
We then proceed to create a family of potentially more expressive norm that involve additional control parameters. These extensions
matrix similarity measures, building on the same basic distance are not necessarily interesting as matrix norms as such since there
measure but replacing the matrix norm with alternative functions of would be no basis for determining the parameters in isolation, but in
the singular values. In particular, we introduce similarity measures applications where the norm is used to construct a distance measure,
with learnable parameters that can be fine-tuned for a given task. we can determine the parameters to maximize the eventual task
performance. We start from the observation that the Schatten 𝑝-
2.1 From matrix norm to similarity measure norm is based on singular values, and explore how much richer
Any matrix norm ∥𝐴∥ that has the submultiplicative property measures we can construct using singular values as the inputs. We
∥𝐴𝐵∥ ≤ ∥𝐴∥ ∥𝐵∥ can be used for constructing a normalized simi- consider two alternatives to be used in place of the norm in (1), all
larity measure between matrices 𝐴 and 𝐵 measured with norm (or of which result in bounded similarities.
norm-like) 𝑆 (·). This can be expressed as a general formula The simplest extension
𝑝 1/𝑝
∑︁
𝑆 (𝐴𝑇 𝐵) 𝑆 𝑤,𝑝 (𝐴) := 𝑤𝑛 𝑠𝑛 (𝐴) (3)
𝐷 (𝐴, 𝐵, 𝑆 (·)) := . (1)
𝑆 (𝐴 𝐴) 1/2𝑆 (𝐵𝑇 𝐵) 1/2
𝑇 𝑛
Research: Optimizing singular value based similarity measures for document similarity comparisons Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
weights each singular value independently but otherwise retains the 2000 samples as the training set. We use triplet loss as the loss
functional form of the Schatten 𝑝-norm. This generalization is still function setting one of the models as the distance function and the
a norm, since for any matrix 𝐴, we can always find matrix 𝐴 ′ where margin (chosen using hyperparameter optimization) to 0.5. The
𝑠𝑖 (𝐴 ′ ) = 𝑤𝑖 𝑠𝑖 (𝐴). One motivation for this norm is the observation loss for one instance for the measure in Eq. (1) is then
of Arora et al. [2] that removing the direction of the largest singular
L (𝐴, 𝑃, 𝑁 , 𝑆 (·)) = max(𝐷 (𝐴, 𝑃, 𝑆 (·)) − 𝐷 (𝐴, 𝑁 , 𝑆 (·)) + 0.5, 0),
vector helps in reducing the effect of the most common words that
are not informative in document discrimination. For 𝑝 = 1 (denoted and for the neural network model is
as 𝑆 𝑤,1 (·) later on) we obtain simple weighting as special case of
the more general weighting. Alternatively, we can interpret the L (𝐴, 𝑃, 𝑁 ) = max(D𝑁 𝑁 (𝐴, 𝑃) − D𝑁 𝑁 (𝐴, 𝑁 ) + 0.5, 0),
weights 𝑤𝑛 as a form of attention mechanism. where 𝐴 is the encoded original document, 𝑃 is the encoded X
As a still more flexible alternative, we consider directly mapping citation (the positive sample), 𝑁 is the encoded A citation (the
the singular values of 𝐴𝑇 𝐵 to the similarity with a flexible model. negative sample), and 𝑆 (·) is any of the previously defined norm-
We can then include the normalization within the measure itself, like models. Optimization is terminated once the result on the
and hence get directly a replacement for Eq. (1). For this, we use a validation set decreases for three consecutive evaluations.
small neural network
Evaluation. Finally we evaluate the trained model using a test
D𝑁 𝑁 (𝐴, 𝐵) = 𝑡𝑎𝑛ℎ(𝑅𝑒𝐿𝑈 (𝑅𝑒𝐿𝑈 (𝑠 (𝐴𝑇 𝐵)𝑊1 )𝑊2 )𝑊3 ), (4) set of 1000 triplets, measuring the distance from the anchor to both
where 𝑊1 ∈ 𝑅𝑑×500 , 𝑊2 ∈ 𝑅 500×500 , 𝑊3 ∈ 𝑅 500×1 , and 𝑅𝑒𝐿𝑈 (·) is a positive and negative samples and counting how often the positive
the rectified linear unit activation function. Finally, the hyperpoblic sample is closer to the anchor than the negative sample, i.e. the X
𝑡𝑎𝑛ℎ(·) activation at the end ensures the outcome is normalized citation ranks higher than the A citation. As the baseline, we use
between [−1, 1]. Each layer has also a bias term of suitable size, the standard mean vector combined with cosine similarity.
which is omitted here for conciseness. The network architecture
could be further tuned by standard architecture search but is not 3.2 Results
particularly relevant for this work. Results for all methods are reported in Table 1. We first inspect
the accuracy of the similarity measure using standard Schatten
3 EXPERIMENTS 𝑝-norm. The main observation is that small values of 𝑝 are the best,
We evaluate the proposed similarity measures in the context of so that 𝑝 = 1 is the best of the proper norms in both cases and the
patent documents. When patent examiners evaluate the novelty of highest overall accuracy is obtained with quasinorms with 𝑝 < 1.
a patent application, there are different kind of prior art that is to be The best 𝑝 clearly outperforms the baseline of mean vector and
considered. The X citations are prior work that can alone lead to a cosine similarity (Mean); for Claims we improve from 0.566 to 0.601
rejection, while the A citations describe the state of the art, but are with 𝑝 = 0.2 and for Descriptions from 0.553 to 0.573 with 𝑝 = 0.5.
not immediate reasons for rejection. Differentiating between these Large 𝑝 are clearly worse and all 𝑝 > 3 are effectively equivalent to
categories of citations can be useful, for example, in retrieval tasks 𝑝 = ∞.
where we want to rank the patents by their relevance to the original Rather than evaluating the metric for a range of 𝑝, we can just
document. If we know the relative ordering of each citation class, as well optimize over 𝑝. For both cases, the solution, denoted by
we can reorder the search results to highlight the most relevant 𝑆𝑜𝑝𝑡 , slightly improves from the one chosen amongst the grid of
documents. In the case of X and A citations, we should often rank alternatives as expected, and we get the optimal values of 𝑝 = 0.884
X citations higher as they give more evidence against rejecting for Claims and 𝑝 = 0.327 for Descriptions. One technical aspect we
a patent application. A good similarity measure between patents note is that when 𝑝 ∈ (0, 1) the function is non-convex [16] and
should satisfy this. can have multiple local optima within this range, but we did not
Patents themselves consist of two main parts, claims and descrip- observe this to be a problem in practice.
tion, where the claims part describes the actual claims that are being The weighted extension of Schatten 𝑝-norm of (3) is denoted here
made and the description part is a more free-form description of the by 𝑆 𝑤,𝑝 . Figure 1 (a) illustrates the learned weights (as function of
invention overall. For this reason, the claims part is usually much iteration) for fixed 𝑝 = 1, demonstrating how the measure assigns
shorter and less noisy than the description part, while the descrip- more weight for the first 10 or so singular values. Figure 1 (b)
tion part is more thorough and thus contains more fine-grained illustrates the behavior of the weights and 𝑝 when optimized jointly,
information. We evaluate the similarity measures for both cases to and reveals quite different phenomena: Instead of small 𝑝 it is
provide two parallel sets of results. now better to use large 𝑝 and down-weight many of the early
singular vectors. Even though this alternative way of measuring
3.1 Data and evaluation similarity is interesting, the empirical performance (on test data) is
not ideal; both weighted versions outperform the mean baseline,
Encoding. We encode the patent documents using English 300-
but do not provide an improvement over 𝑆𝑜𝑝𝑡 and for Claims it
dimensional fastText embeddings [11] and form the covariance
remains worse. One advantage of these measures is that – as seen
matrices of dimensionality 300 × 300 of each document as the
here – the similarity measures only depend on fairly small number
document representation.
of eigenvalues; we here have 300 × 300 matrices but only need tens
Training. For the models that require learning the parameters, of eigenvalues to represent the distance, and hence only need to
we use PyTorch library to do gradient-based optimization using compute a subset of the eigenvalues.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Anon.
a) b)
Figure 1: a) Development of singular value weights as a function of iterations for the model 𝑆 𝑤,1 . b) Development of the weights
and 𝑝 for the model 𝑆 𝑤,𝑝 . Only first 70 out of 300 weights are shown; the rest are effectively zero.
Dataset Mean S0.1 S0.2 S0.5 S1.0 S1.5 S2.0 S3.0 S5.0 S∞ S𝑜𝑝𝑡 S𝑤,1 S𝑤,𝑝 D𝑁 𝑁
Claims 0.566 0.593 0.601 0.580 0.594 0.577 0.558 0.545 0.545 0.545 0.603 0.588 0.589 0.642
Description 0.553 0.549 0.558 0.573 0.520 0.504 0.496 0.487 0.482 0.482 0.574 0.525 0.574 0.652
Table 1: Numerical results. Mean shows the baseline of mean vector with cosine similarity. Free-form neural network model
𝐷 𝑁 𝑁 is clearly the best for both tasks.
The still more flexible neural network measure of Eq. (4), how- REFERENCES
ever, works very well. It has the highest accuracy for both Claims [1] Leonidas Aristodemou and Frank Tietze. 2018. The state-of-the-art on Intellectual
and Descriptions, with substantial improvement also over 𝑆𝑜𝑝𝑡 . Property Analytics (IPA): A literature review on artificial intelligence, machine
learning and deep learning methods for analysing intellectual property (IP) data.
This verifies that singular values of 𝐴𝑇 𝐵 can be used as the ba- World Patent Information 55 (2018), 37–51.
sis for measuring similarity between documents more accurately [2] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. A simple but tough-to-
beat baseline for sentence embeddings. In International conference on learning
than what standard Schatten 𝑝-norm can reveal, and importantly representations.
the performance remains high also for the full-length documents [3] Benjamin Balsmeier, Mohamad Assaf, Tyler Chesebro, Gabe Fierro, Kevin John-
(Descriptions) that are challenging for all other similarity measures. son, Scott Johnson, Guan-Cheng Li, Sonja Lück, Doug O’Reagan, Bill Yeh, et al.
2018. Machine learning and natural language processing on the patent corpus:
Data, tools, and new measures. Journal of Economics & Management Strategy 27,
3 (2018), 535–553.
[4] Hamid Bekamiri, Daniel S Hain, and Roman Jurowetzki. 2021. PatentSBERTa:
A Deep NLP based Hybrid Model for Patent Distance and Classification using
Augmented SBERT. arXiv preprint arXiv:2103.11933 (2021).
[5] Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. 2019. On the Bures–Wasserstein
4 CONCLUSIONS distance between positive definite matrices. Expositiones Mathematicae 37, 2
We set out to investigate how similarity measures based on matrix (2019), 165–191.
[6] Minmin Chen. 2017. Efficient vector representation for documents through
norms work in document similarity comparisons in the context corruption. arXiv preprint arXiv:1707.02377 (2017).
of patent retrieval. We focused on similarity measures based on [7] Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and
Marco Baroni. 2018. What you can cram into a single vector: Probing sentence
singular values of the inner product of the two document matrices, embeddings for linguistic properties. arXiv preprint arXiv:1805.01070 (2018).
motivated by the Schatten 𝑝-norm and similarity measures induced [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
by that. Our main contribution was introducing new parametric Pre-training of deep bidirectional transformers for language understanding. arXiv
preprint arXiv:1810.04805 (2018).
similarity measures that build on the same singular values but are [9] Xiaochen Gao, Zhaoyi Hou, Yifei Ning, Kewen Zhao, Beilei He, Jingbo Shang,
fine-tuned for the specific task at hand, and we showed how a and Vish Krishnan. 2022. Towards Comprehensive Patent Approval Predictions:
direct neural network mapping the singular values to a distance Beyond Traditional Document Classification. In Proceedings of the 60th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
outperforms both standard mean representation as well as our 349–372.
attempts of more constrained – and hence more interpretable – [10] Vivek Gupta, Ankit Saw, Pegah Nokhiz, Praneeth Netrapalli, Piyush Rai, and
Partha Talukdar. 2020. P-sif: Document embeddings using partition averaging. In
similarity measures. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 7863–7870.
While the investigation was done in the context of static em- [11] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag
beddings and patent data, the applicability is not limited to these of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).
[12] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word
choices. Likely any full-document comparison task can benefit from embeddings to document distances. In International conference on machine learn-
richer representations and the rich contextual embeddings, such ing. PMLR, 957–966.
as the ones outputted by transformer models, should enhance the [13] Jarkko Lagus, Janne Sinkkonen, Arto Klami, et al. 2019. Low-rank approximations
of second-order document representations. In Proceedings of the 23rd Conference
results further.
Research: Optimizing singular value based similarity measures for document similarity comparisons Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
on Computational Natural Language Learning (CoNLL). ACL. Linguistics (Volume 2: Short Papers). 527–532.
[14] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
documents. In International conference on machine learning. PMLR, 1188–1196. Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[15] Jieh-Sheng Lee and Jieh Hsiang. 2020. Patent classification by fine-tuning BERT you need. Advances in neural information processing systems 30 (2017).
language model. World Patent Information 61 (2020), 101965. [20] Yuan Xie, Shuhang Gu, Yan Liu, Wangmeng Zuo, Wensheng Zhang, and Lei
[16] Fanhua Shang, Yuanyuan Liu, Fanjie Shang, Hongying Liu, Lin Kong, and Licheng Zhang. 2016. Weighted Schatten 𝑝 -norm minimization for image denoising and
Jiao. 2020. A unified scalable equivalent formulation for schatten quasi-norms. background subtraction. IEEE transactions on image processing 25, 10 (2016),
Mathematics 8, 8 (2020), 1325. 4842–4857.
[17] Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The cost of training nlp models: [21] Hengmin Zhang, Jianjun Qian, Bob Zhang, Jian Yang, Chen Gong, and Yang Wei.
A concise overview. arXiv preprint arXiv:2004.08900 (2020). 2019. Low-Rank Matrix Recovery via Modified Schatten-𝑝 Norm Minimization
[18] Marwan Torki. 2018. A document descriptor using covariance of word vectors. With Convergence Guarantees. IEEE Transactions on Image Processing 29 (2019),
In Proceedings of the 56th Annual Meeting of the Association for Computational 3132–3142.