Data_Driven_Supervised_Learning_for_Life
Data_Driven_Supervised_Learning_for_Life
Published in:
Frontiers in Applied Mathematics and Statistics
DOI:
10.3389/fams.2020.553000
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2020
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
amendment.
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.
Life science data are often encoded in a non-standard way by means of alpha-numeric
sequences, graph representations, numerical vectors of variable length, or other formats.
Domain-specific or data-driven similarity measures like alignment functions have been
employed with great success. The vast majority of more complex data analysis algorithms
require fixed-length vectorial input data, asking for substantial preprocessing of life science
data. Data-driven measures are widely ignored in favor of simple encodings. These
preprocessing steps are not always easy to perform nor particularly effective, with a
potential loss of information and interpretability. We present some strategies and concepts
of how to employ data-driven similarity measures in the life science context and other
complex biological systems. In particular, we show how to use data-driven similarity
Edited by: measures effectively in standard learning algorithms.
Andre Gruning,
University of Surrey, United Kingdom Keywords: similarity based learning, non-metric learning, kernel methods, indefinite learning, gershgorin circles
Reviewed by:
Anastasiia Panchuk, INTRODUCTION
Institute of Mathematics (NAN
Ukraine), Ukraine
Life sciences comprise a broad research field with challenging questions in domains such as (bio-)
Axel Hutt,
Inria Nancy - Grand-Est Research
chemistry, biology, environmental research, or medicine. Not only recent technological
Centre, France developments allow the generation of large, high dimensional and very complex data sets in
these fields, but also, the structure of the measured data representing an object of interest is often
*Correspondence:
Frank-Michael Schleif challenging. The data may be compositional, such that classical vectorial functions are not easy to
[email protected] apply and could also be very heterogeneous by combining different measurement sources.
Accordingly, new strategies and algorithms are needed to cope with the complexity of life
Specialty section: science applications. In general, it is a promising way to reflect characteristic data properties in
This article was submitted to the employed data processing pipeline. This typically leads to increased performance in tasks such as
Dynamical Systems, clustering, classification, and non-linear regression, which are commonly addressed by machine
a section of the journal learning methods. One possible way to achieve this is to adapt the used metric according to the
Frontiers in Applied Mathematics and underlying data properties and application, respectively [1]. Basically, all machine learning and data
Statistics
analysis algorithms employ the comparison of objects referred to as similarities or dissimilarities, or
Received: 17 April 2020 more general as proximities. Hence, the representation of these proximities is a crucial part. These
Accepted: 24 September 2020
measures enter the modeling algorithm either by means of distance measures, e.g., in the standard
Published: 06 November 2020
k-means algorithm or by inner products as employed in the famous support vector machine
Citation: (SVM) [2]. The calculation of these proximities is typically based on a vectorial representation of
Münch M, Raab C, Biehl M and Schleif
the input data. If the used machine learning approach is solely based on proximities, a vectorial
F-M (2020) Data-Driven Supervised
Learning for Life Science Data.
representation is in general not needed, but the pairwise proximity values are sufficient. This
Front. Appl. Math. Stat. 6:553000. approach is referred to as similarity-based learning, where the data are represented by metric
doi: 10.3389/fams.2020.553000 pairwise similarities only.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 1 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
TABLE 1 | List of commonly used non-metric proximity measures in various • We reveal the limitations of former encodings used to
domains.
enable standard kernel methods.
Measure Application field • We derive a novel encoding concept widely preserving the
data’s desired properties while showing considerable
Dynamic Time Warping (DTW) (6) Time series or spectral alignment
performance.
Inner distance (7) Shape retrieval e.g., in robotics
Compression distance (8) Generic used also for text analysis • We improve the efficiency of the encodings using an
Smith Waterman Alignment (5) Bioinformatics approximation concept not considered so far with almost
Divergence measures (9) Spectroscopy and audio processing no loss of performance in the classification process.
Generalized Lp norm (10) Time series analysis
Non-metric modified Hausdorff (11) Template matching
(Domain-specific) alignment score (12) Mass spectrometry
In the experiments, we show the effectiveness of appropriately
preprocessed non-metric measures in a variety of real-life use
cases. We conclude by a detailed discussion and provide practical
advice in applying non-metric proximity measures in the analysis
We can distinguish similarities, indicating how close or similar of life science data.
two items are to each other and dissimilarities in the opposite
sense. In the following, we expect that these proximities are at
least symmetric, but do not necessarily obey metric properties. MATERIALS AND METHODS
See e.g., [3] for an extended discussion.
Non-metric measures are common in many disciplines and Notation and Basic Concepts
occasionally entail so-called non positive semi-definite (non-psd) Given a set of N data items (like N spectral measurements or N
kernels if a similarity measure is used. This is particularly sequences), their pairwise proximity (similarity or dissimilarity)
interesting because many classical learning algorithms can be measures can be conveniently summarized in a N × N proximity
kernelized [4], but are still expecting a psd measure. As we will matrix. These proximities can be very generic in practical
outline in this paper, we can be more flexible in the use of a applications, but most often come either in the form of
proximity measure as long as some basic assumptions are symmetric similarities or dissimilarities only. Focusing on one
fulfilled. In particular, it is not necessary, for many real-world of the respective representation forms is not a substantial
life science data, to restrict the analysis pipeline to a vectorial restriction. As outlined in 16, a conversion from dissimilarities
Euclidean representation of the data. to similarities is cheap regarding to computational costs. Also, an
In the various domains like spectroscopy, high throughput out of sample extension can be easily provided. In the following,
sequencing, or medical image analysis, domain-specific measures we will refer to similarity and dissimilarity type proximity
have been designed and effectively used. Classical sequence matrices as S and D, respectively. These notions enter into
alignment functions (e.g., Smith-Waterman [5]) produce non- models by means of proximity or score functions f (x, x′ ) ∈ R
metric proximity values. There are many more examples and use where x and x′ are the compared objects (both are data items).
cases, as listed in Table 1 and detailed later on. The objects x, x′ may exist in a d-dimensional vector space, so that
Multiple authors argue that the non-metric part of the data x ∈ Rd , but can also be given without an explicit vectorial
contains valuable information and should not be removed [13, representation, e.g., as biological sequences.
14]. In this work, we highlight recent achievements in the field of As outlined in 17, the majority of analysis algorithms are
similarity-based learning for non-metric measures and provide applicable only in a tight mathematical setting. In particular, it is
conceptual and experimental evidence on a variety of scenarios expected that f (x, x′ ) obeys a variety of properties. If f (x, x′ ) is a
that non-metric measures are legal and effective tools in analyzing dissimilarity measure, it is often assumed to be a metric measure.
such data. We argue that a restriction to mathematically more Many algorithms become invalid or do not converge if f (x, x′ )
convenient, but from the data perspective unreliable, measures are does not fulfill metric properties.
not needed anymore. For example, the support vector machine formulation [18] no
Along this line, we first provide an introduction to similarity- longer leads to a convex optimization problem [19] when the
based learning in non-metric spaces. Then we provide an outline given input data is non-metric. Prominent solvers, such as
and discussion of preprocessing techniques, which can be used to sequential minimization (SMO), will converge to only a local
implement a non-metric similarity measure within a classical optimum [20, 21] and other kernel algorithms may not converge
analysis pipeline. In particular, we highlight a novel advanced at all. Accordingly, dedicated strategies for non-metric data are
shift correction approach. Here we extend prior work published very desirable.
by the authors in 15, which is substantially extended by novel The score function f (x, x′ ) could violate the metric properties
theoretical findings (Section 2.4, in particular, the eigenvalue to different degrees. In general it is at least expected that f (x, x′ )
approximation via Gershgorin), experimental results (Section 3, obeys the symmetry property such that f (x, x′ ) f (x′ , x). In
with additional experiments and datasets), and an extended general, this property is a fundamental condition, because a large
discussion. The highlights of this paper: number of algorithms become meaningless for asymmetric data.
We will also make this assumption. In the considered cases, the
• We provide a broad study of life science data encoded by proximities are either already symmetric or can be symmetrized
proximities only. without expecting a negative impact. While symmetry is a
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 2 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
reasonable assumption, the triangle inequality is frequently The matrix K : Φu Φ is an N × N kernel matrix derived from the
violated, proximities become negative, or self-dissimilarities are training data, where Φ : [ϕ(x1 ), . . . , ϕ(xN )] is a matrix of images
not zero. Such violations can be attributed to noise as addressed in (column vectors) of the training data in H. The motivation for
22 or are a natural property of the proximity function f. such an embedding comes with the hope that the non-linear
If noise is the source, often a simple eigenvalue correction [23] transformation of input data into higher dimensional H allows
can be used, although this can become costly for large datasets. As for using linear techniques in H. Kernelized methods process the
we will see later on, the noise may cause eigenvalue contributions embedded data points in a feature space utilizing only the inner
close to zero. A simple way to eliminate these contributions is to products 〈·, ·〉H (kernel trick) [28], without the need to calculate
calculate a low-rank approximation of the matrix, which can be ϕ explicitly. The specific kernel function can be very generic, but
realized with small computational cost [24, 25]. In particular, the in general, the kernel is expected to fulfill Mercer conditions [28].
small eigenvalues could become negative, also leading to Most prominent are the linear kernel with k(x, x′ ) xu x′ as the
problems in the use of classical learning algorithms. A recent Euclidean inner product or the RBF kernel
analysis of the possible sources of negative eigenvalues is provided k(x, x′ ) exp(−(x − x′ 2 /2σ 2 )), with σ as a free parameter.
in 26. Such an analysis is particularly helpful in selecting the
appropriate eigenvalue correction method applied to the Support Vector Machine
proximity matrix. Non-metric proximity measures are part of In this paper, we address data-driven supervised learning;
the daily work in various domains [27]. An area, frequently accordingly, our focus is primal on a domain-specific
applying such such non-metric proximity measures, is the field representation of the data by means of a generic similarity
of bioinformatics, spectroscopy, or alike, where classical sequence measure. There are many approaches for similarity-based
alignment algorithms (e.g., Smith-Waterman - [5]) produce non- learning and, in particular, kernel methods [28]. We will
metric proximity values. For such data, some authors argue that evaluate our data-driven encodings employing the support
the non-metric part of the data contains valuable information and vector machine (SVM) as a state of the art supervised kernel
should not be removed [13]. In particular, this is the motivation method.
for our work. Evaluating such data with machine learning models Let xi ∈ X, i ∈ {1, . . . , N} be training points in the input space
typically asks for discriminative models. In particular, for X , with labels yi ∈ {−1, 1}, representing the class of each point.2
classification tasks, a separating plane has to be determined in The input space X is often considered to be Rd but can be any
order to separate the given data according to their classes. suitable space due to the kernel trick. For a given positive
However, in practice, a linear plane in the original feature penalization term C, the SVM is the minimum of the
space is rarely separating two classes of such complexity. A following regularized empirical risk functional.
common generalization is to map the training vectors xi into a M
higher dimensional space by the function ϕ. In this space, it is 1
min ωu ω + Cξ i (1)
expected that the machine learning model finds a linear ω,ξ,b 2
i1
separating hyperplane with a maximal margin. The principle
behind such a so-called kernel function is explained in more subject to yi (ωu ϕ(xi ) + b) ≥ 1 − ξ i and ξ i ≥ 0. Here ω is the
detail in Section 2.1.1. In our setting, the mapping is provided by parameter vector of a separating hyperplane and b a bias term.
some data-driven similarity function, which, however, may not The variables ξ are so-called slack variables. The goal is to find a
lead to a psd kernel and hence has to be preprocessed (for more hyperplane that correctly separates the data while maximizing the
details, see Section 2.1.4). As a primal representation, we will sum of distances to the closest positive and negative points (the
focus on similarities because the wide majority of algorithms is margin). The parameter C controls the weight of the classification
specified in the kernel space. A brief introduction is given in the errors (C ∞ in the separable case). Details can be found in 28.
following section.1 In case of a positive semi-definite kernel function without
metric violations, the underlying optimization problem is easily
Kernels and Kernel Functions solved using, e.g., the Sequential Minimal Optimization
Let X be a collection of N objects xi , i 1, 2, . . . , N, in some input Algorithm [20]. The objective of a SVM is to derive a model
space. Further, let ϕ : X 1H be a mapping of patterns from X to from the training set, which predicts class labels of unclassified
a high-dimensional or infinite-dimensional Hilbert space H feature sets in the test data. The decision function is given as:
equipped with the inner product 〈·, ·〉H . The transformation ϕ N
is, in general, a non-linear mapping to a high-dimensional space f (x) yi αi k(xi , x) + b,
H and may commonly not be given in an explicit form. Instead i1
of this, a kernel function k : X × X 1R is given which encodes where the αi are the optimized Lagrange parameters of the dual
the inner product in H. The kernel k is a positive (semi) definite formulation of Eq. 1. In case of a non-psd kernel function, the
function such that k(x, x′ ) 〈ϕ(x)u , ϕ(x′ )〉 for any x, x′ ∈ X . optimization problem of a SVM is no longer convex, but only a
local optimum is obtained [19, 21]. As a result, the trained SVM
model can become inaccurate and incorrect. However, as we will
1
For data given as dissimilarity matrix, the associated similarity matrix can be
obtained, in a non-destructive way, by double centering (17) of the dissimilarity
2
matrix. S −JDJ/2 with J (I − 11u /N), identity matrix I and vector of ones ?. In case of more than two classes we use the one vs all approach.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 3 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
see in Section 2.1.4, there are several methods to handle non-psd multiple cases. In fact, in 38 was shown that many real-life
kernel matrices within a classical SVM. problems are better addressed by proximity measures, which
are not restricted to be metric.
Representation in the Krein Space The triangle inequality is frequently violated, if we consider object
A Krein space is an indefinite inner product space endowed with a comparisons in daily life problems, like the comparisons of text
Hilbertian topology. Let K be a real vector space. An inner documents, biological sequence data, spectral data or graphs [23, 39,
product space with an indefinite inner product 〈·, ·〉K on K is 40]. These data are inherently compositional and a representation as
a bi-linear form where all f , g, h ∈ K and α ∈ R obey the following explicit (vectorial) features leads to information loss. As an alternative,
conditions: tailored dissimilarity measures such as pairwise alignment functions,
kernels for structures, or other domain-specific similarity and
• Symmetry : 〈f , g〉K 〈g, f 〉K ;
dissimilarity functions can be used as an interface to the data [41,
• linearity : 〈αf + g, h〉K α〈f , h〉K + 〈g, h〉K ; 42]. Also for vectorial data, non-metric proximity measures are quite
• 〈f , g〉K 0 implies f 0 common in some disciplines. An example of this type is the use of
divergence measures [9, 43, 44] which are very popular for spectral
data analysis in chemistry, geo- and medical sciences [45–49], and are
An inner product is positive semi definite if ∀f ∈ K,
not metric in general. Also the popular Dynamic Time Warping
〈f , f 〉K ≥ 0, negative definite if ∀f ∈ K, 〈f , f 〉K < 0, otherwise it
(DTW) [6] algorithm provides a non-metric alignment score, which
is indefinite. A vector space K with inner product 〈·, ·〉K is called
is often used as a proximity measure between two one-dimensional
an inner product space. functions of different lengths. In image processing and shape retrieval,
An inner product space (K, 〈·, ·〉K ) is a Krein space if we have indefinite proximities are often obtained by means of the inner
two Hilbert spaces H+ and H− spanning K such that ∀f ∈ K we distance. This measure specifies the dissimilarity between two
have f f+ + f− with f+ ∈ H+ and f− ∈ H− and ∀f , g ∈ K, objects, which are represented by their shape only. Thereby,
〈f , g〉K 〈f+ , g+ 〉H+ − 〈f− , g− 〉H− . several seeding points are used and the shorted paths within the
As outlined before, indefinite kernels are typically observed by shape are calculated in contrast to the Euclidean distance between the
means of domain-specific non-metric similarity functions (such landmarks. Further examples can be found in physics where
as alignment functions used in biology [29]), by specific kernel problems of the special relativity theory or other research topics
functions - e.g., the Manhattan kernel k(x, x′ ) −x − x′ 1 ,
naturally lead to indefinite spaces [50].
tangent distance kernel [30] or divergence measures, plugged
A list of non-metric proximity measures is provided in Table 1
into standard kernel functions [9]. A finite-dimensional Krein-
and some are exemplarily illustrated in Figures 1 and 2. Most of
space is a so-called pseudo-Euclidean space. these measures are very popular but often violate the symmetry or
Given a symmetric dissimilarity matrix with zero diagonal, an triangle inequality condition or both. Hence many standard
embedding of the data in a pseudo-Euclidean vector space proximity-based machine learning methods like kernel
determined by the eigenvector decomposition of the associated methods are not easily accessible for these data.
similarity matrix S is always possible [31] - as mentioned
above, e.g., by a prior double centering. Given the
eigendecomposition of S: S U ΛU u , we can compute the
Eigenspectrum Corrections
Although native models for indefinite learning are available (see
corresponding vectorial representation V in the pseudo- e.g., [27, 51, 52]), they are not frequently used. This is mainly due
Euclidean space by
to three reasons: 1) the proposed algorithms have in general,
V Up+q+z Λp+q+z 1/2 (2) quadratic or cubic complexity [53], 2) the obtained models are
non-sparse [54], and 3) the methods are complicated to
where Λp+q+z consists of p positive, q negative non-zero implement [27, 55]. Considering the wide spread of machine
eigenvalues and z zero eigenvalues. Up+q+z consists of the learning frameworks, it would be very desirable to use the therein
corresponding eigenvectors. The triplet (p, q, z) is also referred implemented algorithms - like an efficient support vector
to as the signature of the pseudo-Euclidean space. A detailed machine, instead of having the burden to implement another
presentation of similarity and dissimilarity measures and algorithm, and in general another numerical solver. Therefore, we
mathematical aspects of metric and non-metric spaces is focus on eigenspectrum corrections, which can be effectively done
provided in 17, 32, 33. in a large number of frameworks without much effort.
A natural way to address the indefiniteness problem and to
Indefinite Proximity Functions obtain a psd similarity matrix is to correct the eigenspectrum of
Proximity functions can be very generic but are often restricted to the original similarity matrix S. Popular strategies include
fulfill metric properties to simplify the mathematical modeling eigenvalue correction by flipping, clipping, squaring, and
and especially the parameter optimization. In 32, a large variety of shifting. The non-psd similarity matrix S is decomposed by an
such measures was reviewed and basically most common eigendecomposition: S U ΛU u , where U contains the
methods nowadays make still use of metric properties. While eigenvectors of S and Λ contains the corresponding
this appears to be a reliable strategy, researchers in the field of e.g., eigenvalues λi . Now, the eigenvalues in Λ can be manipulated
psychology [34, 35], vision [14, 26, 36, 37] and machine learning to eliminate all negative parts. After the correction, the matrix can
[13, 38] have criticized this restriction as inappropriate in be reconstructed, now being psd.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 4 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
FIGURE 1 | Visualization of data-driven data description scenarios. In (A) for some Vibrio bacteria and in (B) for Chromosome data. Both datasets are used in the
experiments.
FIGURE 2 | Preprocessing workflow for creating the Tox-21 datasets. Chemicals represented as SMILE codes are translated to Morgan Fingerprints. The kernel is
created by using an application related pairwise similarity measure on the Morgan Fingerprints, in this case so-called Kulczynski.
FIGURE 3 | Visualization of the various preprocessing techniques for a generic eigenspectrum as obtained from a generic similarity matrix. The black line illustrates
the impact of the respective correction method on the eigenspectrum without reordering of the eigenvalues. (A) Visualization of a sample eigenspectrum with pos./neg.
eigenvalues. (B) Preprocessing of the eigenspectrum from Figure 3A using clip. (C) Preprocessing of the eigenspectrum from Figure 3A using flip. (D) Preprocessing of
the eigenspectrum from Figure 3A using shift.
Clip Eigenvalue Correction clipping operator on the eigenvalues, and the subsequent
All negative eigenvalues in Λ are set to 0 (see Figure 3B). reconstruction. This operation has a complexity of O(N 3 ).
The spectrum clip leads to the nearest psd matrix S in terms The complexity might be reduced by either a low-rank
of the Frobenius norm [56]. Such a correction can be approximation or the approach shown by 22 with roughly
achieved by an eigendecomposition of the matrix S, a quadratic complexity.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 5 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
Flip Eigenvalue Correction artificially upraised by the classical shift operator. This may
All negative eigenvalues in Λ are set to λi : |λi | ∀i, which at least introduce a large amount of noise in the eigenspectrum, which
keeps the absolute values of the negative eigenvalues and keeps could potentially lead to substantial numerical problems for
potentially relevant information [17]. This operation can be employed learning algorithms, for example, kernel machines.
calculated with O(N 3 ) or O(N 2 ) if low-rank approaches are If we consider the number of non-vanishing eigenvalues as a
used. Flip is illustrated in Figure 3C. rough estimate of the intrinsic dimension of the data, a classical
shift will increase this value. This may accelerate the curse of
Square Eigenvalue Correction dimension problem on this modified data [62].
All negative eigenvalues in Λ are set to λi : λ2i ∀i which amplifies
large and very small eigenvalues. The square eigenvalue correction Advanced Shift Correction
can be achieved by matrix multiplication [57] with ≈ O(N 2.8 ). To address the aforementioned challenges, we suggest an
alternative formulation of the shift correction, subsequently
Classical Shift Eigenvalue Correction referred to as advanced shift. In particular, we would like to
The shift operation was already discussed earlier by different keep the original eigenspectrum structure and aim for a sub-cubic
researchers [58] and modifies Λ such that λi : λi − minij Λ ∀i. The eigencorrection. As mentioned in Section 2.3 the classical shift
classical shift eigenvalue correction can be accomplished with operator introduces noise artifacts for small eigenvalues. In the
linear costs if the smallest eigenvalue λmin is known. Otherwise, advanced shift procedure, we will remove these artificial
some estimator for λmin is needed. A few estimators for this contributions by a null space correction. This is particularly
purpose have been suggested: analyzing the eigenspectrum on a effective if non-zero, but small eigenvalues are also taken into
subsample, making a reasonable guess, or using some low-rank account. Accordingly, we apply a low-rank approximation of the
eigendecomposition. In our approach, we suggest employing a similarity matrix as an additional preprocessing step. The
power iteration method, for example the von Mises approach, procedure is summarized in Algorithm 1.
which is fast and accurate [59] or using the Gershgorin circle The first part of the algorithm applies a low-rank
theorem [60, 61]. approximation on the input similarities S using a restricted
A spectrum shift enhances all the self-similarities and, SVD or other technique [63]. If the number of samples
therefore, the eigenvalues by the amount of λmin and does not N ≤ 1000, then the rank parameter k 30, otherwise k 100.3
change the similarity between any two different data points. The shift parameter λ is calculated on the low-rank approximated
However, it may also increase the intrinsic dimensionality of matrix, using a von Mises or power iteration [59] to determine the
the data space and amplify noise contributions, as shown in respective largest negative eigenvalue of the matrix. As shift
Figure 3D. As already mentioned by 23, small eigenvalue parameter, we use the absolute value of λ for further steps.
contributions could be linked to noise in the original data. If This procedure provides an accurate estimate of the largest
now an eigencorrection step amplifies tiny eigenvalues, this can negative eigenvalue, instead of making an educated guess as
be considered as a noise amplification. frequently suggested [51]. This is particularly relevant because
the scaling of the eigenvalues can be very different between the
Limitations various datasets, which may lead to an ineffective shift (still with
Multiple approaches have been suggested to correct a similarity negative eigenvalues left) if the guess is incorrect. The basis B of
matrix’s eigenspectrum to obtain a psd matrix [17, 27]. Most the nullspace is calculated, again by a restricted SVD. The
approaches modify the eigenspectrum in a radical way and are nullspace matrix N is obtained by calculating a product of B.
also costly due to an involved cubic eigendecomposition. In Due to the low-rank approximation, we ensure that small
particular, the flip, square and clip operator have an apparent eigenvalues, which are indeed close to 0 due to noise, are
strong impact. The flip operator affects all negative eigenvalues by shrunk to 0 [64]. In the final step, the original S or the
changing the sign and this will additionally lead to a respective low-rank approximated matrix S is shifted by the
reorganization of the eigenvalues. The square operator is largest negative eigenvalue λ that is determined by von Mises
similar to flip but additionally emphasizes large iteration. By combining the shift with the nullspace matrix N and
eigencontributions while fading out eigenvalues below 1. The the identity matrix I, the whole matrix will be affected by the shift
clip method is useful in case of noise; it may also remove valuable and not only the diagonal matrix. Finally, the doubled shift factor
2 ensures that the largest negative eigenvalue λ of the new matrix
*
contributions. The clip operator only removes eigenvalues, but
generally keeps the majority of the eigenvalues unaffected. The S will not become 0, but are kept as a contribution.
*
classical shift is another alternative operator changing only the Complexity: The advanced shift approach shown in Algorithm 1
diagonal of the similarity matrix leading to a shift of the whole is comprised of various subtasks with different complexities.
eigenspectrum by the provided offset. This may also lead to The low-rank approximation can be achieved with O(N 2 ) as well
reorganizations of the eigenspectrum due to new non-zero as the nullspace approximation. The shift parameter is calculated by
eigenvalue contributions. While this simple approach seems to von Mises iteration with O(N 2 ). Since B is a rectangular N × k
be very reasonable, it has the significant drawback that all (!) matrix, the matrix N can be calculated with O(N 2 ). The final
eigenvalues are shifted, which also affects small or even 0
eigenvalue contributions. While 0 eigenvalues have no
contribution in the original similarity matrix, they are 3
The settings for k are taken as a rule of thumb without further fine-tuning.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 6 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
Algorithm 1 Advanced
shift eigenvalue correction.
Advanced_shift(S, k)
if approximate to low rank then
S : LowRankApproximation(S, k)
end if
λ : |ShiftParameterDetermination(S)|
B : NullSpace(S)
N : B · B′
S* : S + 2 · λ · (I − N)
return S*
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 7 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
such that their contribution is still irrelevant after the correction. remains with zero or close to zero contribution. Furthermore, a
The green rectangle (dotted line) highlights the positive parts of higher contribution was assigned to those eigenvalues that previously
the eigenvalues which contribution should also be kept had no or nearly no effect on the eigenspectrum (orange/dashed
unchanged in order not to manipulate the eigenspectrum too rectangle). As a result, the classical shift increases the number of non-
aggressively. Figure 5B shows the low-rank representation of the zero eigencontributions by introducing artificial noise into the data.
original data of 5a. Here, the major negative and major positive The same is also evident for the advanced shift without low-rank
eigenvalues (red/solid and green/dotted rectangle) are still approximation depicted in Figure 5I. Since there are many
present, but many eigenvalues that have been close to zero eigenvalues close to zero but not exactly zero in this data set, all
before, have now been set to exactly 0 (black/dashed rectangle). these eigenvalues are also increased in the advanced shift, but can be
Figures 5 C and D show the eigenvalues after applying the clip cured in the low-rank approach.
operator to the eigenvalues shown in Figures 5 A and B. In both Unlike the advanced shift approach without low-rank
cases, the major positive eigenvalues (green/dotted rectangle) approximation, depicted in Figure 5I, a low-rank
remain unchanged, as well as the positive values close to 0 representation of the data leads to a shifting of only those
and exactly 0. However, the negative eigenvalues close to 0 eigenvalues that had relevant contributions before (red/solid
(parts of the orange/dashed rectangle) and, in particular, the rectangle). Eigenvalues with previously slightly zero
major negative eigenvalues (red/solid rectangle) are all set to contribution (orange/dashed rectangle), derive a contribution
exactly 0. By using the clip operator, the contribution to the of exactly zero by the approximation and are therefore not
eigenspectrum of both major negative and slightly negative shifted in the advanced shift method.
eigenvalues is completely eliminated. Considering the description of structure preservation outlined in
In contrast to clipping, the flip corrector preserves the 2.4, we observe that only the flip and the advanced shift correction
contribution of the negative and slightly negative eigenvalues, (only with low-rank approximation) widely preserve the structure
shown in Figures 5 E and F. When using the flip corrector, only of the given eigenspectrum. For all other methods, the
the negative sign of the eigenvalue is changed; thus, only the eigenspectrum is substantially modified in particular
diagonal of the matrix is changed and not the rest. Since the contributions are removed, amplified, or artificially introduced.
square operator behaves almost analogously to the flip operator In particular, this also holds for the clip or the classical shift
and only squares the negative eigenvalues in addition to flipping corrector, which, however, are frequently recommended in the
them, it was not listed separately here. Squaring the values of a literature. Although this section contained results exclusively for
matrix drastically increases the impact of the major eigenvalues the protein dataset, we observed similar findings for other indefinite
compared to the minor eigenvalues. If an essential part of the datasets as well. Our findings show that a more sophisticated
data’s information is located in the small eigenvalues, this part treatment of the similarity matrix is needed to obtain a suitable
gets a proportionally reduced contribution against the psd matrix. This makes our method more appropriate compared to
significantly increased major eigenvalues. simpler approaches such as the classic shift or clip.
The modified eigenspectra after applications of the classical
shift operator are presented in Figures 5 G and H: by increasing Materials & Experimental Setup
all eigenvalues of the spectrum, the part with the larger negative This section contains a series of experiments to highlight the
eigenvalues (red/solid rectangle) that had a higher impact now only effectiveness of our approach in combination with a low-rank
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 8 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
FIGURE 5 | Visualizations of the protein data’s eigenspectra after applying various correction methods. (A) Visualization of the original eigenspectrum with pos. and
neg. eigenvalues of the protein dataset. (B) Low-rank representation of the original eigenspectrum from Figure 5A. (C) Visualization of the original eigenspectrum of
Figure 5A after clipping all neg. eigenvalues. (D) Visualization of the low-rank approximated eigenspectrum after clipping all neg. eigenvalues. (E) Visualization of the
original eigenspectrum of Figure 5A after flipping all neg. eigenvalues. (F) Visualization of the low-rank approximated eigenspectrum after flipping all neg.
eigenvalues. (G) Visualization of the original eigenspectrum of Figure 5A after shifting all neg. eigenvalues. (H) Visualization of the low-rank approximated eigenspectrum
after shifting all neg. eigenvalues. (I) Visualization of the original eigenspectrum of Figure 5A after advanced shift. (J) Visualization of the low-rank approximated
eigenspectrum of Figure 5B after advanced shift.
approximation. We evaluate the algorithm for a set of benchmark a brief overview of the datasets used for the evaluation, the
data that are typically used in the context of proximity-based experimental setup, and the performance of the different
learning. The data are briefly described in the following and eigenvalue correction methods on the benchmark datasets are
summarized in Table 2, with details given in the references. After presented and discussed in this section.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 9 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
TABLE 2 | Overview of the different datasets. Details are given in the textual normalized histograms are computed using the L1 norm,
description.
correcting for possible different calibration factors [68].
Dataset #samples #classes signature 3. Prodom: the ProDom dataset with signature (1502,680,422)
consists of 2604 protein sequences with 53 labels. It contains a
Chromosomes 4, 200 21 (2258, 1899, 43)
comprehensive set of protein families and appeared first in the
Flowcyto-1 612 3 (538, 73, 1)
Flowcyto-2 612 3 (26, 73, 582) work of [69]. The pairwise structural alignments were computed
Flowcyto-3 612 3 (541, 70, 1) by 69. Each sequence belongs to a group labeled by experts; here,
Flowcyto-4 612 3 (26, 73, 582) we use the data as provided in 68.
Prodom 2604 53 (1502, 680, 422) 4. Protein: the Protein data set has sequence-alignment
Protein 213 4 (170, 40, 3)
SwissProt 10, 988 30 (8487, 2500, 1)
similarities for 213 proteins and is used for comparing and
Tox-21: AllBit similarity 14484 2 (2049, 0, 12435) classifying protein sequences according to its four classes of
Tox-21: Assymetric similarity 14484 2 (1888, 3407, 9189) globins: heterogeneous globin (G), hemoglobin-A (HA),
Tox-21: Kulczynski similarity 14484 2 (2048, 2048, 10388) hemoglobin-B (HB) and myoglobin (M). The signature is
Tox-21: McConnaughey similarity 14484 2 (2048, 2048, 10388)
(170,40,3), where class one through four contains 72, 72, 39,
Vibrio 1100 49 (851, 248, 1)
and 30 points, respectively [70].
5. SwissProt: the SwissProt data set (SWISS), with a signature
(8487,2500,1), consists of 10,988 points of protein sequences in 30
Datasets: classes taken as a subset from the popular SwissProt database of
In the experiments, all datasets exhibit indefinite spectral protein sequences [71]. The considered subset of the SwissProt
properties and are commonly characterized by pairwise database refers to the release 37. A typical protein sequence
distances or (dis-)similarities. As mentioned above, if the data consists of a string of amino acids, and the length of the full
are given as dissimilarities, a corresponding similarity matrix can sequences varies between 30 to more than 1000 amino acids
be obtained by double centering [17]: S −JDJ/2 with depending on the sequence. The ten most common classes such
J (I − 11u /N), with identity matrix I and vector of ones 1. as Globin, Cytochrome b, Protein kinase st, etc. provided by the
These datasets constitute typical examples of non-Euclidean data. Prosite labeling [72] were taken, leading to 5,791 sequences. Due to
In particular, the focus is on proximity-based data from the life this choice, an associated classification problem maps the sequences
science domain. We consider a broad spectrum of domain- to their corresponding Prosite labels. These sequences are compared
specific data: from sequence analysis, mass spectrometry, using Smith-Waterman, which computes a local alignment of
chemical structure analysis to flow cytometry. In particular, sequences [5]. This database is the standard source for identifying
the later one of flow cytometry [65] could also be important and analyzing protein sequences such that an automated
in the analysis of viral data like SARS-CoV-2 [66]. In all cases, classification and processing technique would be very desirable.
dedicated preprocessing steps and (dis-)similarity measures for 6. Tox-21: The initial intention of the Tox-21 challenges is to
structures were used by the domain experts to create this data predict whether certain chemical compounds have the potential
with respect to an appropriate proximity measure. The (dis-) to disrupt processes in the human body that may lead to adverse
similarity measures are inherently non-Euclidean and cannot be health effects, i. e. are toxic to humans [73]. This version of the
embedded isometrically in a Euclidean vector space. The datasets dataset contains 14484 molecules encoded as Simplified
used for the experiments are described in the following and Molecular Input Line Entry Specification (SMILE) codes.
summarized in Table 2, with details given in the references. SMILE codes are ASCII-strings to encode complex chemical
1. Chromosomes: The Copenhagen chromosomes data set structures. For example, Lauryldiethanolamine has the
constitutes a benchmark from cytogenetics [67] with a signature molecular formula of C16H35NO2 and is encoded as
(2258, 1899, 43). Karyotyping is a crucial process to classify CCCCCCCCCCCCN(CCO)CCO. Each smile code is described
chromosomes into standard classes and the results are routinely as a morgan fingerprint [74, 75] and encoded as a bit-vector with
used by the clinicians to diagnose cancers and genetic diseases. A a length of 2048 via the RDKit4 framework. The molecules are
set of 4,200 human chromosomes from 21 classes (the autosomal compared to each other using the non-psd binary similarity
chromosomes) are represented by grey-valued images. These are metrics AllBit, Kulczynski, McConnaughey, and Asymmetric
transferred to strings measuring the thickness of their silhouettes. provided by the RDKIT. The similarity matrix is constructed
These strings are compared using edit distance with insertion/ based on these pairwise similarities. According to the applied
deletion costs 4.5 [40]. similarity metrics, the resulting matrices are varying in their
2. Flowcyto This dissimilarity dataset is based on 612 FL3-A signatures: AllBit (2049, 0, 12435), Asymmetric (1888, 3407,
DNA flow cytometer histograms from breast cancer tissues in 256 9189), Kulczynski (2048, 2048, 10388), McConnaughey (2048,
resolution. The initial data were acquired by M. Nap and N. van 2048,10388). The task of the dataset is binary classification, which
Rodijnen of the Atrium Medical Center in Heerlen, The Netherlands, is either toxic or non-toxic for every given molecule and should be
during 2000-2004, using tubes 3, 4,5, and 6 of a DACO Galaxy predicted by a machine learning algorithm. Note that also graph-
flowcytometer. Overall, this data set consists of four datasets, each based representations for smile data are possible [76].
representing the same data, but with different proximity measure
settings. Histograms are labeled in 3 classes: aneuploid (335 patients),
diploid (131), and tetraploid (146). Dissimilarities between 4
https://www.rdkit.org/
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 10 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
7. Vibrio: Bacteria of the genus Vibrio are Gram-negative, In general, the accuracies of the various correction methods are
primarily facultative anaerobes, forming motile rods. Contact quite similar and rarely differ significantly. As expected, a correction
with contaminated water and consumption of raw seafood are step is needed and the plain use of uncorrected data is suboptimal,
the primary infection factors for Vibrio-associated diseases. often with a clear drop in the performance or may fail. Also, the use of
Vibrio parahaemolyticus, for instance, is one of the leading the classical shift operator can not be recommended due to
causes of foodborne gastroenteritis worldwide. The Vibrio data suboptimal results in various cases. In summary, the presented
set consists of 1,100 samples of Vibrio bacteria populations Advanced Shift with the exact determination of the shift
characterized by mass spectra. The spectra encounter parameter performed best, followed by the flip corrector. The
approximately 42,000 mass positions. The full data set consists of results in Table 3 also show that the accuracy of the Gershgorin
49 classes of vibrio-sub-species. The mass spectra are preprocessed shift variant is not substantially lower compared to the other methods.
with a standard workflow using the BioTyper software [12]. As In most cases, the Gershgorin advanced shift performs as well as
usual, mass spectra display strong functional characteristics due to the clip and the square correction method. Compared to the classic
the dependency of subsequent masses, such that problem-adapted shift, our Gershgorin advanced shift consistently results in much
similarities such as described in 12, 77 are beneficial. In our case, better accuracies. The reason for this is the appropriate preservation
similarities are calculated using a specific similarity measure as of the structure of the eigenspectrum, as shown in Section 2.4. It
provided by the BioTyper software [12] with a signature (851,248,1). becomes evident that not only the dominating eigenvalues have to be
kept, but the preservation of the entire structure of the
eigenspectrum is important to obtain reliable results in general.
RESULTS As the application of the low-rank approximation to similarity
matrices leads to a large number of truly zero eigenvalues, both
In this section, we evaluate our strategy of data-driven proximity- variants of the advanced shift corrections become more effective.
based analysis and highlight the performance of the proposed Both proposed approaches benefit from eigenspectra with many
advanced shift correction on the previously mentioned datasets close to zero eigenvalues, which occurs in many practical data,
against other eigenvalue correction methods using a standard especially in complex domains like life sciences. Surprisingly, the
SVM classifier. For this purpose, the correction approaches classical shift operator is still occasionally preferred in the literature
ensure that the input similarity, herein used as a kernel [51, 58, 78], despite its reoccurring limitations. The herein proposed
matrix, is psd. This is particularly important for kernel advanced shift outperforms the classical shift in almost every
methods to keep expected convergence properties. During the experimental setup. In fact, many datasets have an intrinsic low-
experiments, we measured the algorithm’s mean accuracy and its rank nature, which we employ in our approach but which is not
standard deviation in a ten-fold cross-validation. Additionally, we considered in the classical eigenvalue shift. In any case, the classical
captured the complexity of the model based on the number of shift increases the intrinsic dimensionality, also if many eigenvalues
necessary support vectors for the SVM. Therefore, we track the have already been of zero contribution in the original matrix. This
percentage of training data points, the SVM model needs as leads to substantial performance loss in the classification models, as
support vectors to indicate the model’s complexity. seen in the results. Considering the results of Table 3, the advanced
In each experiment, the parameter C has been selected for each shift correction is preferable in most scenarios.
correction method by a grid search on independent data not used Additionally to the accuracy of the different correction methods,
during the tests. For better comparability of the considered methods, the number of support vectors of each SVM model was gathered.
the results presented here refer exclusively to the use of the low-rank Table 4 shows the complexity of the generated SVM models in terms
approximated matrices in the SVM. Only when employing the of their required support vectors. Thus, the number of support
original data for the SVM, no low-rank approximation was vectors is set in relation to the number of all the available training
implemented to ensure that small negative eigenvalues were not data points required to build a solid decision boundary. The higher
inadvertently removed if they were of low-rank. Please note, that a this percentage, the more data points were needed to create the
low-rank approximation only, does not lead to a psd matrix. separation plane, leading to a more complex model. As explained in
Accordingly, convergence problems and uncontrolled information 79 or 80, the run time complexity can become considerably higher
loss, by means of discrimination power, may still occur. Furthermore, with an increasing number of support vectors.
both proposed methods for the determination of the shift parameter Compared to the original SVM without the low-rank
proposed in section 2.4 were tested on the low-rank approximated approximation, it becomes evident that our approach generally
datasets against the other eigenvalue correction methods. The results requires fewer and occasionally significantly fewer support
for the classification performance for the advanced shift methods vectors and is therefore considerably less complex.
against the other correction methods are shown in Table 3. In Furthermore, in comparison to the classic shift corrector, the
column Adv. Shift, we show the classification performance for the advanced shift is significantly superior in both accuracy and
advanced shift with the exact determination of the smallest required support vectors. However, compared to clip, flip, and
eigenvalue, whereas column Adv.-GS contains the classification square, things are slightly different: Table 4 shows, the advanced
performance of the advanced shift, which applied the Gershgorin shift can keep up with the clipping and flipping but has a higher
theorem to approximate the smallest eigenvalue. For the Prodom percentage of support vectors compared to the square correction
data, it is known from 27 that the SVM has convergence problems method. Considering the slightly better accuracy and the lower
(not converged - subsequently n.c.) on the indefinite input matrix. computational cost from Section 2.2 than clip and flip, the
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 11 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
TABLE 3 | Prediction accuracy (mean ± standard-deviation) for the various data sets and methods in comparison to the advanced shift method. Column Adv. Shift shows
the performance of the advanced shift method and column Adv.-GS provides the performance of the advanced shift using the Gershgorin approach to estimate the
minimum eigenvalue.
Chromosomes 96.90 ± 0.61 97.02 ± 0.86 96.83 ± 0.83 71.38 ± 9.34 97.00 ± 0.69 97.05 ± 1.02 96.45 ± 0.91
Flowcyto-1 69.62 ± 5.28 69.28 ± 5.10 63.74 ± 6.50 66.02 ± 5.45 69.93 ± 6.31 70.26 ± 5.41 70.58 ± 6.09
Flowcyto-2 70.59 ± 4.62 72.4 ± 5.85 62.09 ± 5.36 65.69 ± 6.44 71.39 ± 4.96 70.42 ± 3.84 71.08 ± 2.86
Flowcyto-3 71.25 ± 5.75 70.26 ± 3.58 62.09 ± 0.44 64.55 ± 5.61 70.74 ± 5.70 71.10 ± 4.67 70.75 ± 3.03
Flowcyto-4 70.10 ± 4.68 70.43 ± 6.12 59.88 ± 0.58 63.54 ± 6.97 71.10 ± 4.92 70.25 ± 5.31 68.29 ± 5.68
Prodom 99.77 ± 0.19 99.85 ± 0.25 n.c. 99.77 ± 0.26 99.77 ± 0.31 99.77 ± 0.25 99.65 ± 0.47
Protein 98.12 ± 2.31 99.07 ± 2.12 60.40 ± 1.13 58.23 ± 9.91 98.10 ± 3.16 99.02 ± 1.86 98.59 ± 2.15
SwissProt 97.55 ± 0.36 97.50 ± 0.31 96.46 ± 0.63 96.52 ± 0.37 96.47 ± 0.84 96.53 ± 0.60 97.42 ± 0.39
Tox-21: - AllBit - 97.22 ± 0.31 97.36 ± 0.49 97.37 ± 0.47 97.38 ± 0.44 97.33 ± 0.52 97.38 ± 0.30 97.35 ± 0.38
Tox-21: - Asymmetric - 97.33 ± 0.43 97.46 ± 0.44 90.40 ± 2.01 95.28 ± 0.64 96.96 ± 0.46 97.33 ± 0.35 97.18 ± 0.48
Tox-21: - Kulczynski - 97.34 ± 0.56 97.36 ± 0.39 92.81 ± 2.16 95.28 ± 0.54 97.20 ± 0.26 97.29 ± 0.37 97.30 ± 0.31
Tox-21: - McConnaughey- 97.31 ± 0.44 97.34 ± 0.41 92.08 ± 2.02 94.97 ± 0.56 97.15 ± 0.50 97.33 ± 0.32 97.15 ± 0.54
Vibrio 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00
TABLE 4 | Average percentage of data points that are needed by the SVM models presented an alternative formulation of the classical eigenvalue
for building a well-fitting decision hyperplane. shift, preserving the structure of the eigenspectrum of the data,
such that the inherent data properties are kept. For this advanced
Dataset Adv.- Adv. Original Shift Clip Flip Square
GS Shift
shift method, we also presented a novel strategy that
approximates the shift parameter based on the Gershgorin
Chromosomes 45.4% 39.7% 43.9% 99.8% 30.3% 30.6% 24.0% circles theorem.
Flowcyto-1 59.4% 60.6% 63.8% 99.7% 63.6% 63.6% 62.9% Furthermore, we pointed to the limitations of the classical shift
Flowcyto-2 59.6% 59.1% 69.5% 96.7% 57.6% 58.3% 57.7%
Flowcyto-3 58.6% 59.3% 65.1% 99.3% 57.8% 58.5% 59.4%
induced by the shift of all eigenvalues, including those with small
Flowcyto-4 61.2% 59.9% 65.5% 99.5% 59.3% 59.2% 62.7% or zero eigenvalue contributions. Surprisingly, the classical shift
Prodom 46.6% 18.7% n.c. 18.7% 18.7% 18.8% 12.9% eigenvalue correction is nevertheless frequently recommended in
Protein 38.6% 39.6% 80.3% 99.8% 22.9% 23.6% 14.7% the literature, pointing out that only a suitable offset needs to be
SwissProt 14.1% 13.9% 48.9% 13.9% 13.9% 13.9% 12.2%
applied to shift the matrix to psd. However, it is rarely mentioned
Tox-21: AllBit 5.5% 5.5% 5.8% 7.4% 6.5% 7.2% 4.6%
Tox-21: 4.7% 5.4% 7.3% 10.0% 7.6% 7.1% 4.6% that this shift affects the entire eigenspectrum and thus increases
Assymetric the contribution of eigenvalues that had no contribution in the
Tox-21: 5.3% 5.9% 8.0% 10.0% 7.2% 7.1% 5.3% original matrix.
Kulczynski As a result of our approach, the eigenvalues that had vanishing
Tox-21: 5.1% 5.6% 8.4% 8.3% 7.6% 7.5% 4.2%
contribution before the shift remain irrelevant after the shift.
McConnaughey
Vibrio 99.9% 99.6% 100.0% 99.5% 99.6% 99.6% 92.0% Those eigenvalues with a high contribution keep their relevance,
leading to the preservation of the eigenspectrum but with a
positive (semi-)definite matrix. In combination with the low-
rank approximation, our approach was, in general, better
advanced shift is preferable to clip and flip eigenvalue correction compared to the classical methods. Moreover, also the
and competitive to the square correction. approximated version of the advanced shift via Gershgorin
In summary, as pointed out also in previous work, there is no circles theorem performed as well as the classical methods.
simple solution for handling non-psd matrices or the correction We analyzed the effectiveness of data-driven learning on a
of eigenvalues. The results make evident that the proposed broad spectrum of classification problems from the life science
variants of the advanced shift correction are especially useful if domain. The use of domain-specific proximity measures
the negative eigenvalues are meaningful and a low-rank originally caused a number of challenges for practitioners, but
approximation of the similarity matrix preserves the relevant with the recent work on indefinite learning, substantial
eigenvalues. The analysis also shows that domain-specific improvements are available. In fact, our experiments with
measures by means of a data-driven analysis are effectively eigenvalue correction methods, especially the advanced shift
possible and keep relevant information. The presented approach, which keeps the eigenspectrum intact, have shown
strategies allow the use of standard machine learning promising results on many real-life problems. In this way,
approaches, like kernel methods without much hassle. domain-specific non-standard proximity measures allow the
effective analysis of life science data in a data-driven way.
Future work on this subject will include the reduction of the
DISCUSSION computational costs using advanced matrix approximation and
decomposition techniques in the different sub-steps. Another
In this paper, we addressed the topic of data-driven supervised field of interest is a possible adoption of the advanced shift to
learning by general proximity measures. In particular, we unsupervised scenarios.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 12 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
Finally, it remains to be said that the analysis of life science the manuscript. All authors contributed to manuscript revision,
data offers tremendous potential for understanding complex read and approved the submitted version.
processes in domains such as (bio)chemistry, biology,
environmental research, or medicine. Many challenges have
already been tackled and solved, but there are still many open FUNDING
issues in these areas where the analysis of complex data can be a
key component in understanding these processes. FMS, MM are supported by the ESF program WiT-HuB/2014-
2020, project IDA4KMU, StMBW-W- IX.4-170792. FMS, CR are
supported by the FuE program of the StMWi, project OBerA,
grant number IUK-1709- 0011// IUK530/010.
DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data ACKNOWLEDGMENTS
can be found here: https://bitbucket.fiw.fhws.de:8443/users/
popp/repos/proximitydatasetbenchmark/browse. We thank Gaelle Bonnet-Loosli for providing support with
indefinite learning and R. Duin, Delft University for various
support with DisTools and PRTools. We would like to thank
AUTHOR CONTRIBUTIONS Dr. Markus Kostrzewa and Dr. Thomas Maier for providing the
Vibrio data set and expertise regarding the biotyping approach
MM, CR and FMS contributed conception and design of the and Dr. Katrin Sparbier for discussions about the SwissProt data
study; CR preprocessed and provided the Tox-21 database; MM (all Bruker Corp.). A related conference publication by the same
performed the statistical analysis; MM and FMS wrote the first authors was published at ICPRAM 2020 see [15] - copyright
draft of the manuscript; MM, CR, FMS and MB wrote sections of related material is not affected.
14. Scheirer WJ, Wilber MJ, Eckmann M, Boult TE. Good recognition is non-
REFERENCES metric. Patt Recog (2014) 47:2721–2731. doi:10.1016/j.patcog.2014.02.018
15. Münch M., Raab C., Biehl M, Schleif F. Structure preserving encoding of non-
1. Biehl M, Hammer B, Schneider P, Villmann T. Metric learning for prototype- euclidean similarity data. In Proceedings of the 9th international conference on
based classification. In: M Bianchini, M Maggini, F Scarselli, LC Jain, editors. pattern recognition applications and methods–Volume 1: ICPRAM,. INSTICC
Innovations in Neural Information Paradigms and Applications. Studies in (SciTePress) (2020) p 43–51. doi:10.5220/0008955100430051
Computational Intelligence, Vol. 247: Springer (2009) p. 183–99 16. Gisbrecht A, Schleif FM. Metric and non-metric proximity transformations at
2. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New linear costs. Neurocomputing (2015) 167:643–57. doi:10.1016/j.neucom.2015.
York: Springer (2001) 04.017
3. Nebel D, Kaden M, Villmann A, Villmann T. Types of (dis-)similarities and 17. Pekalska E, Duin R. The dissimilarity representation for pattern recognition.
adaptive mixtures thereof for improved classification learning. Neurocomputing World Scientific (2005)
(2017) 268:42–54. doi:10.1016/j.neucom.2016.12.091 18. Vapnik V. The nature of statistical learning theory. Statistics for engineering
4. Schölkopf B, Smola A. Learning with Kernels. MIT Press (2002) and information science. Springer (2000)
5. Gusfield D. Algorithms on Strings, trees, and sequences: Computer science and 19. Ying Y, Campbell C, Girolami M. Analysis of svm with indefinite kernels. In:
computational biology. Cambridge University Press (1997) Y Bengio, D Schuurmans, J D Lafferty, CKI Williams, A Culotta, editors
6. Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken Advances in neural information processing systems 22: Curran Associates, Inc.
word recognition, IEEE Trans Acoust Speech Signal Process (1978) 26:43–49. (2009) p 2205–13.
doi:10.1109/tassp.1978.1163055 20. Platt JC. Fast training of support vector machines using sequential minimal
7. Ling H, Jacobs DW. Using the inner-distance for classification of articulated optimization. In: Advances in kernel methods: Support vector learning.
shapes. In 2005 IEEE computer society conference on computer vision and pattern Cambridge, MA: MIT Press (1999) p 185–208.
recognition (CVPR 2005), 20-26 June 2005. San Diego, CA, USA: IEEE 21. Lin H, Lin C. A study on sigmoid kernels for SVM and the training of non-PSD
Computer Society (2005) p 719–26. kernels by SMO-type methods. Neural Comput (2003) 1–32. doi:10.1.1.14.
8. Cilibrasi R, Vitányi PMB. Clustering by compression. IEEE Trans Inform 6709
Theory (2005) 51:1523–45. doi:10.1109/tit.2005.844059 22. Luss R, d’Aspremont A. Support vector machine classification with
9. Cichocki A, Amari S-I. Families of alpha- beta- and gamma- divergences: indefinite kernels. Math Prog Comp (2009) 1:97–118. doi:10.1007/
Flexible and robust measures of similarities. Entropy (2010) 12:1532–68. doi:10. s12532-009-0005-5
3390/e12061532 23. Chen Y, Garcia E, Gupta M, Rahimi A, Cazzanti L. Similarity-based
10. Lee J, Verleysen M. Generalizations of the lp norm for time series and its classification: concepts and algorithms. J Mac Learn Res (2009) 10:747–76.
application to self-organizing maps. In: M. Cottrell, editor. 5th Workshop on 24. Indyk P, Vakilian A, Yuan Y. Learning-based low-rank approximations. In:
Self-Organizing Maps. Vol. 1 (2005) p 733–40. HM Wallach, H Larochelle, A Beygelzimer, F d’Alché-Buc, EB Fox, R Garnett
11. Dubuisson MP, Jain A. A modified hausdorff distance for object matching. In editors. Advances in Neural Information Processing Systems 32: Annual
Pattern recognition, 1994. Vol. 1–conference A: Computer vision amp; Image Conference on Neural Information Processing Systems 2019, NeurIPS
processing., proceedings of the 12th IAPR international conference. Vol. 2019, 8-14 December 2019, Vancouver, BC, Canada (2019) p. 7400–10.
(1994) p. 566–568. 25. Williams CKI, Seeger MW. Using the nyström method to speed up kernel
12. Maier T, Klebel S, Renner U, Kostrzewa M. Fast and reliable maldi-tof machines. In: TK Leen, TG Dietterich, V Tresp editors Advances in neural
ms–based microorganism identification. Nature Methods (2006) 3:1–2. information processing systems 13, Papers from neural information processing
doi:10.1038/nmeth870. systems (NIPS) 2000 Denver, CO: MIT Press (2000) p 682–688.
13. Pekalska E, Duin RPW, Günter S, Bunke H. On not making dissimilarities 26. Xu W, Wilson R, Hancock E. Determining the cause of negative dissimilarity
euclidean. In SSPR&SPR 2004 (2004) p. 1145–1154. eigenvalues. LNCS 6854: LNCS (2011) p 589–597.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 13 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
27. Schleif FM, Tiño P. Indefinite proximity learning: A review. Neural 52. Mehrkanoon S, Huang X, Suykens JAK. Indefinite kernel spectral learning.
Computation (2015) 27:2039–96. doi:10.1162/neco_a_00770 Patt Recog (2018) 78:144–153. doi:10.1016/j.patcog.2018.01.014
28. Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis and 53. Schleif F, Tiño P, Liang Y. Learning in indefinite proximity spaces - recent
discovery. Cambridge University Press (2004) trends. In: 24th european symposium on artificial neural networks, ESANN
29. Smith TF, Waterman MS. Identification of common molecular subsequences. 2016; 2016 April 27-29; Bruges, Belgium. (2016) p. 113–122.
J Mol Biol (1981) 147:195–197. doi:10.1016/0022-2836(81)90087-5 54. Loosli G, Canu S, Ong CS. Learning SVM in Kreĭn spaces. IEEE Trans
30. Haasdonk B, Keysers D. Tangent distance kernels for support vector machines. Patt Anal Mach Intell (2016) 38:1204–16. doi:10.1109/tpami.2015.
ICPR (2002) (2). 864–868. doi:10.1109/icpr.2002.1048439 2477830
31. Goldfarb L. A unified approach to pattern recognition. Patt Recog (1984) 17: 55. Schleif FM, Tiño P. Indefinite core vector machine. Patt Recog (2017) 71:
575–582. doi:10.1016/0031-3203(84)90056-6 187–195. doi:10.1016/j.patcog.2017.06.003.
32. Deza M, Deza E. Encyclopedia of Distances. Springer (2009) 56. Higham NJ. Computing a nearest symmetric positive semidefinite matrix.
33. Ong CS, Mary X, Canu S, Smola AJ. Learning with non-positive kernels. In: Linear Algebra and Its Applications (1988) 103:103–118. doi:10.1016/0024-
CE Brodley, editor. Machine learning, proceedings of the twenty-first 3795(88)90223-6
international conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 57. Strassen V. Gaussian elimination is not optimal. Numer Math (1969) 13:
2004. ACM international conference proceeding series: ACM, Vol. 69 354–356. doi:10.1007/bf02165411
(2004) p 81. doi:10.1145/1015330.1015443 58. Filippone M. Dealing with non-metric dissimilarities in fuzzy central
34. Hodgetts CJ, Hahn U. Similarity-based asymmetries in perceptual matching. clustering algorithms. International Journal of Approximate Reasoning
Acta Psychologica (2012) 139:291–299. doi:10.1016/j.actpsy.2011.12.003 (2009) 50:363–384. doi:10.1016/j.ijar.2008.08.006
35. Hodgetts CJ, Hahn U, Chater N. Transformation and alignment in similarity. 59. Mises RV, Pollaczek-Geiringer H. Praktische Verfahren der
Cognition (2009) 113:62–79. doi:10.1016/j.cognition.2009.07.010 Gleichungsauflösung. Z Angew Math Mech (1929) 9:152–164. doi:10.1002/
36. Kinsman T, Fairchild M, Pelz J. Color is not a metric space implications for zamm.19290090206
pattern recognition, machine learning, and computer vision. In Proceedings 60. Gerschgorin S. Ueber die abgrenzung der eigenwerte einer matrix. Izvestija
of Western New York image processing workshop, WNYIPW 2012 (2012) Akademii Nauk SSSR, Serija Matematika (1931) 7:749–54.
p. 37–40. 61. Varga RS. Geršgorin and his circles. In: Springer series in computational
37. Van Der Maaten L, Hinton G. Visualizing non-metric similarities in multiple mathematics. Springer Berlin Heidelberg (2004)
maps. Mac Learn (2012) 87:33–55. doi:10.1007/s10994-011-5273-4 62. Verleysen M, François D. The curse of dimensionality in data mining and time
38. Duin RPW, Pekalska E. Non-euclidean dissimilarities: causes and series prediction. In: J Cabestany, A Prieto, FS Hernández, editors.
informativeness. In SSPR&SPR 2010 (2010) p. 324–33. Computational intelligence and bioinspired systems, 8th international
39. Kohonen T, Somervuo P. How to make large self-organizing maps for work-conference on artificial neural networks, IWANN 2005. Lecture
nonvectorial data. Neural Networks (2002) 15:945–52. doi:10.1016/s0893- Notes in Computer Science; 2005 June 8-10; Vilanova i la Geltrú,
6080(02)00069-2 Barcelona, Spain, Proceedings: Vol. 3512: Springer (2005) p. 758–770.
40. Neuhaus M, Bunke H. Edit distance-based kernel functions for structural 63. Sanyal A, Kanade V, Torr PHS. Low rank structure of learned representations.
pattern classification. Patt Recog (2006) 39:1852–63. doi:10.1016/j.patcog. CoRR (2018) doi:CoRR abs/1804.07090
2006.04.012 64. Ilic M, Turner IW, Saad Y. Linear system solution by null-space approximation
41. Gärtner T, Lloyd JW, Flach PA. Kernels and distances for structured data. Mac and projection (SNAP). Numer Linear Algebra Appl (2007) 14:61–82.
Learn (2004) 57:205–32. doi:10.1023/B:MACH.0000039777.23772.30 65. Aghaeepour N, Finak G, Hoos H, Mosmann TR, Brinkman R, Gottardo R,
42. Poleksic A. Optimal pairwise alignment of fixed protein structures in et al. Critical assessment of automated flow cytometry data analysis techniques.
subquadratic time. J Bioinform Comput Biol (2011) 9:367–82. doi:10.1142/ Nat Methods (2013) 10:228–238. doi:10.1038/nmeth.2365
s0219720011005562 66. Ou X, Liu Y, Lei X, Li P, Mi D, Ren L. Characterization of spike glycoprotein of
43. Zhang Z, Ooi BC, Parthasarathy S, Tung AKH. Similarity search on sars-cov-2 on virus entry and its immune cross-reactivity with sars-cov. Nat
Bregman divergence. Proc VLDB Endow (2009) 2:13–24. doi:10.14778/ Commun (2020) 11:1620. doi:10.1038/s41467-020-15562-9
1687627.1687630 67. Lundsteen C, Phillip J, Granum E. Quantitative analysis of 6985 digitized
44. Schnitzer D, Flexer A, Widmer G. A fast audio similarity retrieval method for trypsin g-banded human metaphase chromosomes. Clin Genet (1980) 18:
millions of music tracks. Multimed Tools Appl (2012) 58:23–40. doi:10.1007/ 355–370 doi:10.1111/j.1399-0004.1980.tb02296.x
s11042-010-0679-8 68. Duin RP [Dataset]: PRTools (2012)
45. Mwebaze E, Schneider P, Schleif FM. Divergence based classification in 69. Roth V, Laub J, Buhmann JM, Müller KR. Going metric: denoising pairwise
learning vector quantization. Neurocomputing (2010) 74:1429–35. doi:10. data. In: NIPS (2002) p. 817–824.
1016/j.neucom.2010.10.016 70. Hofmann T, Buhmann JM. Pairwise data clustering by deterministic
46. Nguyen NQ, Abbey CK, Insana MF. Objective assessment of sonographic: annealing. IEEE Trans Patt Anal Machine Intell (1997) 19:1–14. doi:10.
Quality ii acquisition information spectrum. IEEE Trans Med Imag (2013) 32: 1109/34.566806
691–98. doi:10.1109/tmi.2012.2231963 71. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E.
47. Tian J, Cui S, Reinartz P. Building change detection based on satellite stereo The swiss-prot protein knowledgebase and its supplement trembl in 2003.
imagery and digital surface models. IEEE Trans Geosci Remote Sens (2013) Nucleic Acids Res (2003) 31:365–370. doi:10.1093/nar/gkg095
52(1):406–417. doi:10.1109/tgrs.2013.2240692 72. Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel R, Bairoch A. Expasy: the
48. van der Meer F. The effectiveness of spectral similarity measures for the proteomics server for in-depth protein knowledge and analysis. Nucleic Acids
analysis of hyperspectral imagery. Int J Appl Earth Obser Geoinf (2006) 8:3–17. Res (2003) 31. doi:10.1093/nar/gkg563
doi:10.1016/j.jag.2005.06.001 73. Huang R, Xia M, Nguyen DT, Zhao T, Sakamuru S, Zhao J. Tox21challenge to
49. Bunte K, Haase S, Biehl M, Villmann T. Stochastic neighbor embedding (SNE) build predictive models of nuclear receptor and stress response pathways as
for dimension reduction and visualization using arbitrary divergences. mediated by exposure to environmental chemicals and drugs. Front Environ
Neurocomputing (2012) 90:23–45. doi:10.1016/j.neucom.2012.02.034 Sci (2016) 3:85. doi:10.3389/fenvs.2015.00085
50. Mohammadi M, Petkov N, Bunte K, Peletier RF, Schleif FM. Globular cluster 74. Figueras J. Morgan revisited. J Chem Inf Model (1993) 33, 717–718. doi:10.
detection in the GAIA survey. Neurocomputing (2019) 342:164–171. doi:10. 1021/ci00015a009
1016/j.neucom.2018.10.081 75. Ralaivola L, Swamidass SJ, Saigo H, Baldi P. Graph kernels for chemical
51. Loosli G. Trik-svm: an alternative decomposition for kernel methods in krein informatics. Neural Networks (2005) 18:1093–1110. doi:10.1016/j.neunet.
spaces. In: M Verleysen editor. Proceedings of the 27th european symposium 2005.07.009
on artificial neural networks (ESANN) 2019. Evere, Belgium: D-side 76. Bacciu D, Lisboa P, Martín JD, Stoean R, Vellido A. Bioinformatics and
Publications (2019) p. 79–94 medicine in the era of deep learning. In: 26th european symposium on artificial
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 14 November 2020 | Volume 6 | Article 553000
Münch et al. Data-Driven Machine Learning
neural networks, ESANN 2018; 2018 April 25-27; Bruges, Belgium (2018) p. Conflict of Interest: The authors declare that the research was conducted in the
345–354. absence of any commercial or financial relationships that could be construed as a
77. Barbuddhe SB, Maier T, Schwarz G, Kostrzewa M, Hof H, Domann E. Rapid potential conflict of interest.
identification and typing of listeria species by matrix-assisted laser desorption
ionization-time of flight mass spectrometry. Appl Environ Microbiol (2008) 74: Copyright © 2020 Münch, Raab, Biehl and Schleif. This is an open-access article
5402–5407. doi:10.1128/aem.02689-07 distributed under the terms of the Creative Commons Attribution License (CC BY).
78. Chakraborty J. Non-metric pairwise proximity data. [PhD thesis]: Berlin The use, distribution or reproduction in other forums is permitted, provided the
Institute of Technology (2004) original author(s) and the copyright owner(s) are credited and that the original
79. Burges CJC. Simplied support vector decision rules. Icml (1996) publication in this journal is cited, in accordance with accepted academic practice.
80. Osuna E, Girosi F. Reducing the run-time complexity of support vector No use, distribution or reproduction is permitted which does not comply with
machines. International conference on pattern recognition (1998) these terms.
Frontiers in Applied Mathematics and Statistics | www.frontiersin.org 15 November 2020 | Volume 6 | Article 553000