Combining Entropy Measures For Anomaly Detection
Combining Entropy Measures For Anomaly Detection
Abstract: The combination of different sources of information is a problem that arises in several
situations, for instance, when data are analysed using different similarity measures. Often, each source
of information is given as a similarity, distance, or a kernel matrix. In this paper, we propose a new
class of methods which consists of producing, for anomaly detection purposes, a single Mercer kernel
(that acts as a similarity measure) from a set of local entropy kernels and, at the same time, avoids
the task of model selection. This kernel is used to build an embedding of data in a variety that
will allow the use of a (modified) one-class Support Vector Machine to detect outliers. We study
several information combination schemes and their limiting behaviour when the data sample size
increases within an Information Geometry context. In particular, we study the variety of the given
positive definite kernel matrices to obtain the desired kernel combination as belonging to that variety.
The proposed methodology has been evaluated on several real and artificial problems.
Keywords: entropy kernel; kernel combination; Karcher mean; anomaly detection; functional data
1. Introduction
Usual Data Mining tasks, such as classification, regression and anomaly detection, are heavily
dependent on the geometry of the underlying data space. Kernel Methods, such as Support Vector
Machines (SVM), provide the control on the data space geometry through the use of a Mercer kernel
function [1,2]. Such functions, defined in the next section, induce embeddings of the data in feature
spaces where Mercer kernels act as inner products. The choice of the appropriate kernel, including its
parameters, is a particular case of model selection problems.
For instance, when working with SVM, a delicate parameterization is needed; otherwise, solutions
might be suboptimal. In other words, the choice of a suitable kernel function and its parameters will
affect both the geometry of the data embedding and the success of the algorithms [3,4]. A typical way to
proceed is by means of cross-validation procedures [5]. However, these parameter calibration strategies,
although intuitive and simple from an applied point of view, have some important drawbacks.
In particular, their computational burden is of practical relevance when implementing cross–validation
strategies in problems that involve calibrating a medium to large amount of parameters. An appealing
alternative to model selection when working with SVM is to combine or merge different kernel
functions into a single kernel [6,7].
Functional data [8] present the particularity of being intrinsically infinite dimensional.
This peculiarity implies that classical procedures for multivariate data must be adapted or redesigned
to cope with functional data. The statistical distribution of data is a basic element to afford outlier
detection problems. Entropies are natural functions to use in anomaly detection problems given that
any definition of entropy should produce large values for scattered distributions and small values for
concentrated distributions. In addition, statistical distributions are a particular case of functional data
and in this way entropy comes then into play in this context.
In this paper, we present an alternative proposal to solving anomaly detection problems that
avoids the selection of kernel hyperparameters. A novelty of this work is that the methodology
is developed to deal with functional data. We will explore several kernel combination techniques,
including some methods from Information Geometry that respect the geometry of the manifold that
contains the Gram matrices associated with the Mercer kernels involved.
The paper is organized as follows: Section 2 describes the functional data analysis methods used
to produce the data representations from kernels, as well as the minimum entropy method used in this
paper for anomaly detection. Section 3 develops several methods to obtain kernel combinations for the
task of outlier detection. Section 4 illustrates the theory with simulations and examples; and Section 5
concludes the work.
manifold embedded in Rd [9]. For x P X, denote Kx the function Kx : X Ñ R given by Kx pzq “ Kpx, zq.
There exists a unique Hilbert space HK of functions on X made up of the span of the set tKx |x P Xu,
such that for all f P HK and x P X, f pxq “ xKx , f y HK . The Hilbert space HK is said to be a Reproducing
Kernel Hilbert Space (RKHS) [10]. Next, we describe the use of RKHS for data analysis, differentiating
between the multivariate and functional cases.
In the multivariate case, we consider data sets S “ tx1 , . . . , xn u Ă X, where X is a compact subset
of RD . Consider the RKHS HK and the linear integral operator LK defined by LK p f q “ X Kp¨, sq f psqds.
ş
d2K pxi , x j q “ }φpxi q ´ φpx j q}2 “ φpxi qT φpxi q ` φpx j qφpx j q ´ 2φpxi qφpxi q “ Kpxi , xi q ` Kpx j , x j q ´ 2Kpxi , x j q (1)
Equation (1) shows that the choice of the kernel K determines the geometry of the data set after
the transformation X Ñ φpXq.
Now, we consider the functional data case, that is, the case where data are functions or,
by generalization, infinite dimensional objects (such as images, for instance). Let pΩ, F , Pq be
a probability space, where F is the σ-algebra in Ω and P a σ-finite measure. We consider random
elements (functions) Xpω, tq : Ω ˆ T Ñ R in a metric space pT, τq, where T Ă R is compact and
we assume Xpω, ¨q to be continuous functions. We consider kernels KX ps, tq “ KpXpω, sq, Xpω, tqq
(the classical choice is KX ps, tq “ EpXpω, sqXpω, tqq). Then, there exists a basis tei uiě1 of C pTq such that
for all t P T
8
ÿ
Xpω, tq “ ξ i pωqei ptq, (2)
i “1
for appropriate coefficients, where the ei are the eigenfunctions associated with the integral operator
of KX ps, tq.
Entropy 2018, 20, 698 3 of 14
In real data analysis, we do not have theoretical random paths, or functional data described by
mathematical equations, but finite samples from such processes. For instance, if we are considering
normal distributions as the object of analysis, we will not know the vectors of means and real covariance
matrices(µ and Σ), but a sample X “ txi u P Rn from which we will estimate the covariance matrix
1
S “ XX T . In the case of functions, X will be a compact space or manifold in an Euclidean space,
n
Y “ R, and there will be available sample curves f n identified with data sets tpxi , yi q P X ˆ Yuin“1 .
Let K : X ˆ X Ñ R a Mercer Kernel and HK its associated RKHS. Then, the coefficients in Equation (2)
can be approximated by solving the following optimization problem [11]:
n
1 ÿ
min p f pxi q ´ yi q2 ` γ} f }2K , (3)
f P HK n
i“1
where γ ą 0 and } f }2K represents the norm of the function f in HK . The solution, that constitutes
an example of Equation (2), is given by f ˚ pxq “ j λ̂ j φj pxq, where the λ̂ j are the weights of the
ř
projection of the function corresponding to the sample tpxi , yi qu onto the function space generated by
the eigenvalues of LK .
Next, we use local entropies for anomaly detection through kernel combinations. For this
preliminary work, we explore linear combinations and Karcher means, to validate the intuition
that the use of a more natural mean than the arithmetic mean will produce better practical results,
as far as positive definite matrices are involved.
1 ÿ ´ ¯
Hα pXq “ PpΩi q log PpΩi qα´1 , for α ě 0 and α ‰ 1. (4)
1´α
i ě1
The parameter α defines to which entropy inside the family of α entropies we are referring to.
For instance, when α “ 0, then Hα is the Hartley entropy, when α Ñ 1 then Hα converges to the
Shannon entropy and when α Ñ 8 then Hα converges to the Min-entropy measure. Let SΩ be the
collection of finite partitions of Ω, for any subset A “ in“1 Ai P SΩ , the entropy of A can be computed
Ť
as follows:
n
1 ÿ ´ ¯
Hα pAq “ PpAi q log PpAi qα´1 . (5)
1´α
i“1
This paves the way to define the ∆-local entropy [13] corresponding to any subset ∆ P FΩ
as follows
˜ such that ∆ Ă ∆.
hα p∆q “ inf Hα p∆q, ˜ (6)
˜ PS Ω
∆
Let pX1 , . . . , Xn q be a random sample drawn i.i.d. from P, we would like to compute the
local entropies of the corresponding random sets ∆1 , . . . , ∆n , where ∆i “ Ω BpXi´1 pωq, rq and
Ş
BpXi´1 pωq, rq P Ω is the open ball with center in ω and a (data driven) small radius r. In practice,
given a sample Sn “ px1 , . . . , xn q, we compute the local entropy using the estimator p hα p∆i q “
d¯k pxi , Sn q{p1 ´ αq, where d¯k pxi , Sn q is the average distance from xi to its kth -nearest neighbour. Notice
that the locality parameter k in d¯k px, Sn q, which represents the number of neighbours that we take into
Entropy 2018, 20, 698 4 of 14
account to approximate the local entropy around x, is related to r in ∆ x “ Ω Bpx´1 pωq, rq. We then
Ş
In the next section, we discuss how to avoid model selection problems. To this aim, a set of local
entropy kernels is initially estimated from the data. Then, we estimate an average local entropy kernel
that takes into account the particular geometry of the space of positive definite matrices. In this way, we
obtain a unique low dimensional data representation, from which outliers are detected. This approach
does not include neither a model selection step nor a parameter estimation procedure.
embeddings φj : X Ñ Rd j , where K ej px, yq “ φj pxqT φj pyq. As stated in Equation (1), each of the kernels
induces a kernel distance dK j on the original data space X, corresponding to the Euclidean distance on
the manifold Zj “ φj pXq.
Next, we define a new set of transformations, suitable for anomaly detection, in line with the
theory of Section 2.1 by:
ϕ j pxq “ dK j pφj pxq, φj pSn qq. (8)
Now, kernel functions are positive definite type functions, i.e., the empirical kernel matrix
K—obtained via the evaluation of the kernel function into the set of n training points—belongs to the
cone of symmetric positive semidefinite matrices P :“ tK P Rnˆn |K “ K T , K ľ 0u. Let K1 , . . . , Km be
the empirical kernel matrices defined in Equation (9), all of them in P , and let pw1 , . . . , wm qT be a
suitable non-negative vector of combination parameters, then define the “fusion” kernel K
Kpw1 , . . . , wm q :“ w1 K1 `, . . . , `wm Km ľ 0.
In the context of SVM classification problems, the goal is to find the parameters w1 , . . . , wm
that maximize the optimal margin. Instead, in anomaly detection cases, the goal is to estimate the
parameters w1 , . . . , wm that produce a suitable data representation. This is achieved when the
regular data within the sample –represented in the coordinate space provided by the fusion kernel
K–have a reduced entropy or equivalently is scarcely scattered and those observations that are atypical
in the sample are projected in distant regions from that of the regular data.
Next we consider three particular combination schemes. The first is rather straightforward,
the second proposes the mean in the manifold that contains the kernels, and the third is a weighting
scheme that assigns the weights according to the use of appropriate choices of entropy functions.
Definition 1 (Multivariate sparsity measures). Consider m different sparsity measures φ1 , ¨ ¨ ¨ , φm and let
K1 , ¨ ¨ ¨ , Km be the corresponding set of Mercer kernels, where Ki px, yq “ φiT pxqφi pyq. We define a multivariate
concentration measure by Φ “ pφ1 , ¨ ¨ ¨ , φm q : X Ñ Rm .
Thus, the kernel corresponding to a multivariate sparsity measure Φ “ pφ1 , ¨ ¨ ¨ , φm q is the sum
of the univariate kernels Ki associated with the φi . This fact allows us to interpret linear combination
ř
of kernels wi Ki as coming from (weighted) multivariate sparsity measures.
where u j P rp0, 1s are some positive constants that may be associated with each kernel matrix K j .
We refer to [14] for a detailed description of the basics of semidefinite programming.
EK j pSn q řm
Proof. Given that λ˚j “ ř ě 0 and K j ľ 0, the constraint j “1 λ j K j ľ 0 holds. In addition,
j EK j pSn q
m
ÿ m
ÿ EK pSn q
λ˚j “ ř j “ 1.
j “1 j“1 j EK j pSn q
Since all the λ˚j reach their upper bound, the theorem holds and the solution is unique.
The set of positive definite square matrices P is a Riemannian manifold, with inner product
xA, ByX “ TrpX ´1 AX ´1 Bq on the tangent space to P at the point X. The distance between A, B P P
is
bř given by dP pA, Bq “ } logpA´1{2 BA´1{2 q}F , where } ¨ }F is the Frobenius norm, that is }A}F “
ř 2
i j aij . Given K1 , . . . , Km kernel matrices, the Karcher mean, denoted onwards as K, is defined
as the minimizer of the function f pXq “ im“1 dP pX, Ki q2 , and it is the unique solution X P P of the
ř
řm
matrix equation i“1 logpKi´1 Xq “ 0.
4. Experimental Section
In this section, we illustrate, with the aid of multiple numerical examples and real data
sets, the performance of the proposed methodology when the goal is to detect abnormal
observations in a sample. We consider a list of several kernel functions, namely: (i) the Gaussian
2
kernel KG pxi , x j q “ e´σ}xi ´x j } with parameter σ defined in a grid of values ranging in σ P
t0.13 , 0.12 , 0.1, 1, 10, 50, 100, 500, 103 u; (ii) the linear kernel K L pxi , x j q “ xxi , x j y and (iii) the second
degree polynomial kernel K P pxi , x j q “ pxxi , x j y ` 1q2 . As it was explained in Section 1, the combination
methods proposed can be considered as an alternative to model selection techniques for outlier
detection purposes. Therefore, the results obtained are presented jointly with the single kernel methods.
Our combination methods are denoted as: (i) the average kernel (K), s (ii) the kernel constructed using
the Karcher mean of the single kernel functions (K) and (iii) the minimum entropy linear combination
kernel or entropy kernel (EK ).
For comparison purposes, we consider several alternative approaches for anomaly detection in
both the multivariate and the functional data frameworks. In the multivariate case, we consider some
alternative well-known techniques in the field of machine learning. These methods are: (i) LOF [18]
and (ii) HiCS [19]. In the functional case, we test our proposals against three widely used depth
measures: the Modified Band Depth (MBD) [20], the Modal Depth (HMD) [21] and the Random Tukey
Depth (RTD) [22]. This depth measures induce an order with respect to the functional data set that can
be used to determine which observations (curves) are far from the deepest or central point and can be
classified as outliers.
Each database presents a set of regular observations and has been contaminated with abnormal or
outlying observations. Let P and N be the amount of outlier and normal data in the sample, respectively,
and let TP = True Positive and TN = True Negative be the respective quantities detected by different
methods. In Tables 1 and 2, we report the following average metrics TPR = TP/P (True Positive
Rate or sensitivity), TNR = TN/N (True Negative Rate or specificity). For the comparison with other
techniques using real data sets, in Tables 4 and 5, we report the area under the ROC curve (AUC) for
each experiment.
Figure 1. Main data in black (‚) and outlying observations in red (˚).
Synthetic functional data: We consider random samples of Gaussian processes tx1 ptq, . . . , xn ptqu,
with sizes 4000 and 2000, where a proportion ν “ 0.1, known a priori, present an atypical pattern, and
the remaining np1 ´ νq curves are considered the main data. We consider the following generating
processes:
2
ÿ
Xl ptq “ ξ j sinpjπtq ` ε l ptq, for l “ 1, . . . , p1 ´ νqn,
j “1
2
ÿ
Yl ptq “ ζ j sinpjπtq ` ε l ptq, for l “ 1, . . . , νn,
j “1
where t P r0, 1s, εptq are independent autocorrelated random error functions and pξ 1 , ξ 2 q is
a normally-distributed bivariate random variable (NDMRV) with mean µξ “ p1, 2q and diagonal
co-variance matrix Σξ “ diagp1, 1q. To generate the outliers, we consider pζ 1 , ζ 2 q NDMRV with
parameters µξ “ p4, 5q and Σζ “ Σξ . The data are plotted in Figure 2.
Figure 2. Main data in black (—) and outlying observations in red (—).
Entropy 2018, 20, 698 8 of 14
Table 1 shows the results of the experiment using synthetic multivariate data. Best results are
marked using bold enhanced text. It can be observed that the proposed combination methods, namely
the mean, the weighted entropy and the Karcher mean perform as well as the best single kernel in
terms of the TNR. With respect to the TPR, the best combination method is the one based on the
calculation of the Karcher mean.
Table 1. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic multivariate data.
Experiment KGσ“0,13 KGσ“0,12 KGσ“0,1 KGσ“1 KGσ“10 KGσ“50 KGσ“100 KGσ“500 KGσ“103 KL KP K̄ K EK
TPR 69.3 69.3 69.3 65.3 0.0 0.0 0.0 0.0 0.0 69.3 62.6 62.6 69.3 62.6
TNR 97.7 97.7 97.7 97.7 92.5 92.5 92.5 92.5 92.5 97.7 97.7 97.7 97.7 97.7
In Table 2, the results of the experiment using synthetic functional data are presented. In this case,
two of the three proposals, the mean and the weighted entropy are always able to perform as well as
the best single kernel (the polynomial kernel) in terms of both the TNR and TPR. The method based on
the calculation of the Karcher mean obtains good results with respect to the TNR measure.
Table 2. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic functional data.
Figure 3. NOX (left) and VDP (right) functional data sets. The sample of regular curves in black (“—”),
and abnormal curves in red (“—”).
Table 4 shows the results of the experiment using real multivariate data. It can be observed that
the best overall method in average is the weighted entropy proposal. In particular, this method attains
the best results for two of the six data bases (Pima and Cardio), and for the rest of the sets its results
are close to the best ones. Although the proposed methodologies seem to perform systematically
better than other machine learning approaches, it is not clear, in terms of the AUC, whether for some
data bases (Glass, Breast Cancer, Breast Cancer Diagnostic and Pima) the difference is statistically
significant.
Table 4. Area under the ROC curve (AUC) for multivariate data sets.
Experiment Glass Vertebral Breast Cancer Breast Cancer (Diag.) Pima Cardio
KGσ“0,13 88.13 57.5 60.9 94.8 49.5 90.0
KGσ“0,12 88.51 71.4 60.8 94.7 49.7 89.5
KGσ“0,1 91.49 63.0 60.5 94.8 50.6 69.6
KGσ“1 82.87 66.9 62.2 94.4 71.9 65.3
KGσ“10 76.40 77.9 68.6 86.1 74.8 49.2
KGσ“50 51.49 89.0 68.1 85.2 48.6 46.8
KGσ“100 56.42 79.2 68.1 84.7 43.8 45.4
KGσ“500 53.90 85.8 68.1 65.2 52.4 64.8
KGσ“103 57.51 79.5 68.1 72.9 62.2 67.0
KL 87.86 72.8 60.8 94.8 49.1 90.1
KP 85.85 74.1 59.4 96.3 49.4 94.8
K̄ 85.85 66.4 62.9 94.3 48.8 94.8
K 88.13 82.2 60.7 94.4 78.7 63.4
EK 88.13 82.2 61.4 94.4 78.7 94.8
LOF 76.8 59.3 56.4 86.9 70.9 59.6
HiCS 80.0 56.6 59.3 94.2 72.4 63.0
Entropy 2018, 20, 698 10 of 14
In Table 5, the results of the experiment using real functional data are presented. For the VDP
data set, in terms of the AUC measure, the weighted entropy and the mean proposals perform as well
as the best single kernels and the MBD. For the NOx data set, the best overall method is the one based
on the calculation of the Karcher mean, followed closely by the MBD approach.
Table 5. Area under the ROC curve (AUC) for functional data sets.
˜ ¸ « ˜ ¸ ˜ ¸ff
ξ1 0 0.75 ´0.5
Xptq “ ξ 1 sinptq ` ξ 2 cosptq for t P r0, πs, and „N µ“ , Σ“ (14)
ξ2 0 ´0.5 0.75
that is pξ 1 , ξ 2 q follows a zero mean bi-variate normal distribution with covariance parameters
σ11 “ σ22 “ 0.75 and σ12 “ σ21 “ ´0.5. Using the representation techniques introduced in § 2, we can
represent these curves as points in R2 and, moreover, we can estimate (by Maximum Likelihood) a
covariance matrix Σ p using this data representation. We replicate the previous generating process 10
times, obtaining 10 covariance matrices estimates, namely Σ p i for i “ 1, . . . , 10. Next, we construct
the mean estimated covariance matrix as, Σ p “ 10 Σ
ř
i“1 i {10, and the Karcher mean estimated covariance
s p
matrix, ΣpΣp 1, . . . , Σ
p 10 q. The estimations are illustrated in Figure 4-left, where each ellipse (in grey“—”)
corresponds to the following equation:
where χ22,0.99 is the value of a Chi-square with two degrees of freedom that accumulates 0.99 probability,
λ̂1,i and λ̂2,i are the estimated eigenvalues, corresponding to each estimate Σp i , and θ̂i is the estimated
rotation angle with respect to the ‘x1 ’ axis. In addition, in the same Figure, the estimated mean
Σ
s (its corresponding ellipse estimation) is shown in red (“- - -”), and in blue (“- - -”) the Karcher
p
mean. To introduce some anomaly in our data, in Figure 4-right, we added one ellipse constructed
with an anomalous bivariate distribution with covariance matrix with elements σ11 “ σ22 “ 7.5 and
Entropy 2018, 20, 698 11 of 14
σ12 “ σ21 “ ´10; this atypical covariance matrix corresponds to a different stochastic Gaussian model
from the baseline introduced in Equation (14).
Figure 4. First two coordinates of ten 99th percentile ellipses (“—”). Σ p ellipse in red (“- - -”) and Σ
s
ellipse in blue (“- - -”). Left panel: Gaussian scenario; Right panel: Gaussian scenario contaminated
with anomalous covariance matrix in black (“- - -”).
It can be observed in Figure 4-left that the average covariance matrix and the Karcher mean of the
covariance matrix generate similar 99th percentile ellipses. Since the generated covariance matrices
Σ
p i are located in a small region within the cone of semi-definite positive matrices, such a region can
be approximated by a linear subspace that contains the average covariance matrix. On the other
hand, in Figure 4-right, the curvature of the cone is depicted by the difference in the dispersion
of the anomalous covariance matrix, illustrated by the ellipse with a black-dashed line. In this
scenario, the Karcher mean of the covariance matrices generates similar 99th percentile ellipse with
respect to the regular scenario (left panel), which shows the robustness of the Karcher mean in the
presence of outliers. Nevertheless, in the contaminated scenario (right panel), the 99th percentile ellipse
generated with the simple average mean of the covariance matrices changes radically with respect to the
regular scenario. The robustness in the estimation of the covariance matrix allows us to ensure that the
procedure proposed in this paper, based on the estimation of the Karcher mean in the cone of positive
definite matrices, will be useful when solving atypical functional data identification problems.
Last but not least, the relevant aspect of this numerical example is that, using the Karcher mean
as an estimator of the center of the distribution of semi-definite positive matrices, we are minimizing
the Riemannian distance, as it is defined in Section 3, and, as a consequence, the proposed method
is able to identify the anomalous covariance matrix with respect to the pattern given by the rest of
the distributions.
5. Discussion
In this work, we have explored how to combine different sources of information for anomaly
detection within the framework of Entropy measures. We define entropies associated with the
transformation induced by Mercer kernels, both for random variables and for data sets. We propose
a new class of combination methods that generate a single Mercer kernel (that acts as a similarity
measure) for anomaly detection purposes from a set of entropy measures in the context of density
Entropy 2018, 20, 698 12 of 14
estimation. In particular, three combination schemes have been proposed and analysed, namely:
(i) an average of the kernel matrices; (ii) the mean in the manifold that contains the kernels;
and (iii) a weighting scheme that assigns the weights as the solution of an optimization problem
that seeks to maximize a particular kernel entropy. Such proposals, based on the idea of building the
final combined kernel matrix within the same variety where the kernel matrices to be combined live,
seem to be the most successful ones on average.
An innovative application of this methodology is the use of the Karcher mean as part of a method
to identify anomalous covariance matrices. The success of this proposal is due to the fact that the
Karcher mean acts as an estimator of the center of the distribution of semi-definite positive matrices,
while minimizing their Riemannian distance, allowing the identification of the outlying matrices with
respect to the pattern given by such an estimator.
A relevant aspect for the method applicability in real problems is its complexity and costs in
comparison with other alternatives. The proposals whose structure is based on a linear combination of
kernel matrices have a very low computational cost based on the computation of products of constants
and sums of matrices. The proposal based on the use of the Karcher mean has the typical drawback
of any semidefinite programming problem, that is, the computational and memory costs are related
to the size of the matrices involved. Current systems are not able to deal with dense large matrices,
given that processing time and memory grow quasi-exponentially as the size of the matrices increase.
See [28] for a discussion on these aspects and current trends to improve the performance of methods for
the solution of semidefinite programming problems. Most applications for general dense matrices in
semidefinite programming involve a few hundred data cases. Fortunately, in this particular application
(outlier detection), we do not need to work with the full database to success. Due to the presence of
statistical regularities, a few thousand data cases will usually be enough to collect all the relevant
statistical aspects of the data set at hand.
Further research is to be afforded, especially regarding the possibility of exploring other
embeddings of the data. For instance, higher dimensional transformations specific for anomaly
detection could be designed. In this regard, care should be taken with the scaling of such
transformations, as dimensions with large magnitudes with respect to the others may lead to
suboptimal results. In this work, for multivariate data, we have compared the methodologies proposed
with some multivariate outlier detection techniques. In the future, systematic experiments comparing
with other well known methodologies such as XBGOD [29], LODES [30], iForest [31] or MASS [32]
are to be carried out. Regarding these multivariate techniques, another interesting research line is
the extension of such methodologies to functional data analysis. In this regard, suitable multivariate
representations of functional data similar to those in [2] should be explored.
References
1. Moguerza, J.M.; Muñoz, A. Support vector machines with applications. Stat. Sci. 2006, 21, 322–336.
[CrossRef]
2. Muñoz, A.; González, J. Representing functional data using support vector machines. Pattern Recognit. Lett.
2010, 31, 511–516. [CrossRef]
3. Carl, G.; Sollich, P. Model selection for support vector machine classification. Neurocomputing 2003, 55,
221–249.
4. Chapelle, O.; Vapnik, V.; Bousquet, O.; Mukherjee, S. Choosing multiple parameters for support vector
machines. Mach. Learn. 2002, 46, 131–159. [CrossRef]
Entropy 2018, 20, 698 13 of 14
5. Wahba, G. Support Vector machines, Reproducing Kernel Hilbert Spaces, and randomized GACV.
In Advances in Kernel Methods: Support Vector Learning; Schoelkopf, B., Burges, C., Smola, A., Eds.; MIT Press:
Cambridge, MA, USA, 1999; pp. 69–88.
6. Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the kernel matrix with
semidefinite programming. J. Mach. Learn. Res. 2004, 5, 27–72.
7. De Diego, I.M.; Muñoz, A.; Moguerza, J.M. Methods for the combination of kernel matrices within a support
vector framework. Mach. Learn. 2010, 78, 137–172. [CrossRef]
8. Hsing, T.; Eubank, R. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators;
John Wiley & Sons: Hoboken, NJ, USA, 2015.
9. Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert space embedding for distributions. In Proceedings
of the International Conference on Algorithmic Learning Theory, Sendai, Japan, 1–4 October 2007; pp. 13–31.
10. Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer:
New York, NY, USA, 2011.
11. Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971,
33, 82–94. [CrossRef]
12. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960.
13. Martos, G.; Hernández, N.; Muñoz, A.; Moguerza, J.M. Entropy measures for stochastic processes with
applications in functional anomaly detection. Entropy 2018, 20, 33. [CrossRef]
14. Vandenberghe, L.; Boyd, S. Semidefinite programming. SIAM Rev. 1996, 38, 49–95. [CrossRef]
15. Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30,
509–541. [CrossRef]
16. Arnaudon, M.; Barbaresco, F.; Yang, L. Medians and means in Riemannian geometry: Existence, uniqueness
and computation. arXiv 2011, arXiv:1111.3120.
17. Bini, D.A.; Iannazzo, B. Computing the Karcher mean of symmetric positive definite matrices. Linear Algebra
Appl. 2013, 438, 1700–1710. [CrossRef]
18. Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers.
In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Dallas, TX,
USA, 15–18 May 2000; pp. 93–104.
19. Keller, F.; Müller, E.; Böhm, K. HiCS: High Contrast Subspaces for Density-Based Outlier Ranking.
In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Washington, DC,
USA, 1–5 April 2012; pp. 1037–1048.
20. López-Pintado, S.; Romo, J. On the concept of depth for functional data. J Am. Stat. Assoc. 2009, 104, 718–734.
[CrossRef]
21. Cuevas, A.; Febrero, M.; Fraiman, R. Robust estimation and classification for functional data via
projection-based depth notions. Comput. Stat. 2007, 22, 481–496. [CrossRef]
22. Cuesta-Albertos, J.A.; Nieto-Reyes, A. The random Tukey depth. Comput. Stat. Data Anal. 2008, 52,
4979–4988. [CrossRef]
23. Gelman, A.; Meng, X.L. A note on bivariate distributions that are conditionally normal. Am Stat. 1991, 45,
125–126.
24. Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases. Available online: http://archive.ics.
uci.edu/ml/index.php (accessed on 10 September 2018).
25. Rayana, S. ODDS Library. Available online: http://odds.cs.stonybrook.edu (accessed on 10 September 2018).
26. Febrero-Bande, M.; de la Fuente, M.O. Statistical computing in functional data analysis: The R package fda.
usc. J. Stat. Softw. 2012, 51, 1–28. [CrossRef]
27. Moguerza, J.M.; Muñoz, A.; Psarakis, S. Monitoring nonlinear profiles using support vector machines.
In Iberoamerican Congress on Pattern Recognition; Springer: Berlin, Germany, 2007; pp. 574–583.
28. Zheng, Y.; Yan, Y.; Liu, S.; Huang, X.; Xu, W. An Efficient Approach to Solve the Large-Scale Semidefinite
Programming Problems. Math. Probl. Eng. 2012, 2012, 764760. [CrossRef]
29. Zhao, Y.; Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised
Representation Learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN),
Rio, Brazil, 8–13 July 2018.
Entropy 2018, 20, 698 14 of 14
30. Sathe, S.; Aggarwal, C. LODES: Local Density Meets Spectral Outlier Detection. In Proceedings of the 2016
SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; pp. 171–179.
31. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International
Conference on Data Mining (2008), Pisa, Italy, 15–19 December 2008; pp. 413–422.
32. Ting, K.M.; Chuan, T.S.; Liu, F.T. Mass: A New Ranking Measure for Anomaly Detection; Technical Report
TR2009/1; Gippsland School of Information Technology, Monash University: Victoria, Australia, 2009.
c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).