0% found this document useful (0 votes)
4 views

Efficient Algorithms for Kernel Aggregation Queries

The document presents efficient algorithms for kernel aggregation queries, addressing the computational challenges associated with kernel functions used in various machine learning applications. It introduces a novel bounding technique that significantly speeds up the computation of kernel aggregations, achieving at least an order of magnitude improvement over existing methods. The proposed solution, KARL, extends to additive kernel functions and demonstrates effectiveness through experimental results on real datasets.

Uploaded by

ken.mlyiu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Efficient Algorithms for Kernel Aggregation Queries

The document presents efficient algorithms for kernel aggregation queries, addressing the computational challenges associated with kernel functions used in various machine learning applications. It introduces a novel bounding technique that significantly speeds up the computation of kernel aggregations, achieving at least an order of magnitude improvement over existing methods. The proposed solution, KARL, extends to additive kernel functions and demonstrates effectiveness through experimental results on real datasets.

Uploaded by

ken.mlyiu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 1

Efficient Algorithms for Kernel Aggregation


Queries
Tsz Nam Chan, Leong Hou U, Member, IEEE, Reynold Cheng, Member, IEEE, Man Lung Yiu and
Shivansh Mittal

Abstract—Kernel functions support a broad range of applications that require tasks like density estimation, classification, regression or
outlier detection. For these tasks, a common online operation is to compute the weighted aggregation of kernel function values with
respect to a set of points. However, scalable aggregation methods are still unknown for typical kernel functions (e.g., Gaussian kernel,
polynomial kernel, sigmoid kernel and additive kernels) and weighting schemes. In this paper, we propose a novel and effective
bounding technique, by leveraging index structures, to speed up the computation of kernel aggregation. In addition, we extend our
technique to additive kernel functions, including χ2 , intersection, JS and Hellinger kernels, which are widely used in different
communities, e.g., computer vision, medical science, Geoscience etc. To handle the additive kernel functions, we further develop the
novel and effective bound functions to efficiently evaluate the kernel aggregation. Experimental studies on many real datasets reveal
that our proposed solution KARL achieves at least one order of magnitude speedup over the state-of-the-art for different types of kernel
functions.

Index Terms—KARL, Kernel functions, lower and upper bounds

1 I NTRODUCTION TABLE 1: Applications of kernel-based machine learning models

Kernel functions are widely used in different machine learning Model Application
models, including kernel density estimation (for statistical analy- Kernel density estimation/ Galaxy density quantification [19]
sis), kernel regression (for prediction and forecasting) and kernel classification (KDE/KDC) Particle Searching [15]
classification (for data mining). These types of models are actively [21], [20]
used in many applications, which are summarized in Table 1. Support vector regression Wind speed forecasting [45], [55], [42]
For density-estimation-based application, astronomical scientists (SVR) [47] Ecological modeling [54]
Time series prediction [43], [46]
[19] utilize kernel density estimation for quantifying the galaxy Image detection [26]
density. For regression-based application, environmental scientists 1-class support vector Suspicious packet detection [7], [6]
[45], [55], [42] utilize support vector regression to forecast the machine (OCSVM) [37] Time series anomaly detection [25], [33]
wind speed which helps in predicting the generated energy by 2-class support vector Suspicious packet detection [7], [6]
wind power. For classification-based application, network security machine (2CSVM) [47] Cancer detection [14], [30]
systems [7], [6] utilize kernel SVM to detect suspicious packets. Image classification [36], [13], [16]
Due to its wide range of applications, many types of open-source Pedestrian detection [5], [3], [4]
libraries, e.g., LibSVM [12] and Scikit-learn [40], also support
above machine learning models, which can combine with different
kernel functions. work [12], [40], [36], [4]. Both γ , β and deg are constants and
In the above applications, a common online operation is to dist(q, p) denotes the Euclidean distance between q and p. In
compute the following function: addition, we also denote q` and p` as the `th dimensional values
X of vectors q and p respectively and d is the dimensionality of the
FP (q) = wi · K(q, pi ) (1) vector.
pi ∈P
TABLE 2: Kernel functions
where q is a query point, P is a dataset of points, wi is scalar, and
K(q, pi ) denotes the kernel function between q and pi . Table 2 Kernel name Function K(q, p)
summarizes all kernel functions which are widely used in existing Gaussian exp(−γ · dist(q, p)2 )
Polynomial (γq · p + β)deg
• T. N. Chan, R. Cheng and S. Mittal are with the Department of Computer Sigmoid tanh(γq · p + β)
Pd 2q` p`
Science, The University of Hong Kong, Hong Kong. χ2
E-mail: {tnchan, ckcheng}@cs.hku.hk and [email protected] Pd `=1 q` +p`
• L. H. U is with the State Key Laboratory of Internet of Things for Smart
Intersection
Pd 1 `=1 min(q` , p` )
q` +p` 1 q` +p`
City and the Department of Computer and Information Science, University JS `=1 2 q` log2 ( q` ) + 2 p` log2 ( p` )
of Macau, Macau.
Pd √
Hellinger `=1 q ` p`
E-mail: [email protected]
• M. L. Yiu is with the Department of Computing, Hong Kong Polytechnic
Univertiy, Hong Kong. Two types of online prediction queries, namely KAQ and
E-mail: [email protected]
τ KAQ, are adopted in different machine learning models. Figure
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 2

1 shows the visualization of KAQ and τ KAQ queries in one at- two widely-used libraries, namely LibSVM [12] and Scikit-learn
tribute (5th dimension) of shuttle sensor dataset [2] with Gaussian [40]. Implementation-wise, LibSVM is based on the sequential
kernel function. scan method, and Scikit-learn is based on the algorithm in [21]
for query type I. We compare them with our proposal (KARL) in
Approximate kernel aggregation query (KAQ): In the regression/
Table 4. As a remark, since Scikit-learn supports query types II
density-estimation-based models, for example: KDE and SVR, the
and III via the wrapper of LibSVM [40], we remove those two
relative error, , is used, such that for every query q, the returned
query types from the row of Scikit-learn in Table 4. The features
approximate value Fapprox is within 1 ±  of FP (q), i.e.,
of KARL are: (i) it supports all three types of weightings as well
(1 − )FP (q) ≤ Fapprox ≤ (1 + )FP (q) (2) as both KAQ and τ KAQ queries, (ii) it supports index structures,
(iii) it yields much lower response time than existing libraries.
Thresholded kernel aggregation query (τ KAQ): In the
TABLE 4: Comparisons of libraries
classification-based models, for example: KDC, OCSVM and
2CSVM, the threshold, τ , is adopted, such that for every query Library Supported Supported Support Response
q, the returned result is either 1 or -1 which indicates whether queries weightings indexing time
FP (q) ≥ τ . LibSVM [12] τ KAQ Types II, III no high
The above queries are expensive as it takes O(nd) time Scikit-learn [40] KAQ Type I yes high
to compute FP (q) online, where d is the dimensionality of KARL (this paper) KAQ, τ KAQ Types I, II, III yes low
data points and n is the cardinality of the dataset P . In the
machine learning community, many recent literatures [29], [24], However, existing libraries, including LibSVM [12], Scikit-
[27] also complain about the inefficiency issues for computing learn [40] and KARL (our preliminary work) [10] only focus on
kernel aggregation, which are quoted as follows: Gaussian, polynomial and sigmoid kernels. Except for these three
famous kernels, another class of kernel functions, called additive
• “Despite their successes, what makes kernel methods kernels, including χ2 , intersection, JS and Hellinger kernels, has
difficult to use in many large scale problems is the fact attracted more attention in many application domains recently,
that computing the decision function is typically expensive, e.g., machine learning [39], [41], [56] and computer vision [53],
especially at prediction time.” [29] [36], [49]. In this work, we extend KARL to support all additive
• “However, computing the decision function for the new kernels.
test samples is typically expensive which limits the ap- Compared to our preliminary work [10], there are three new
plicability of kernel methods to real-world applications.” contributions in this work. First, we develop the new lower and
[24] upper bound functions, which are based on the monotonicity
• “..., it has the disadvantage of requiring relatively large property of additive kernel functions. We further show that these
computations in the testing phase.” [27] two bound functions can be computed in O(d) time, if each q`
(min) (max)
To address the above inefficiency issues, existing solutions is within the given range [q` , q` ]. This range can be
are divided into two camps (cf. Table 3). The machine learning specified during the offline stage. Second, we extend the linear
community tends to improve the response time by using heuris- bound functions in our preliminary work [10] to support those q` ,
(min) (max)
tics [27], [24], [29], [32] (e.g., sampling points in P ), which which are not within this range [q` , q` ]. Third, we further
may affect the quality of the model (e.g., classification/prediction conduct new experiments for (1) additive kernels, (2) regression
accuracy). The other camp, which we are interested in, aims to models, which are not supported in our preliminary work [10].
enhance the efficiency while preserving the quality of the model. Our experimental results show that our method can achieve at
The pioneering solutions in this category are [21], [20], albeit they least one order of magnitude speedup compared with the state-of-
are only applicable to queries with type I weighting (see Table 3). the-art methods.
Their idea [21], [20] is to build an index structure on the point We first discuss the related work in Section 2. Later, we present
set P offline, and then exploit index nodes to derive lower/upper the state-of-the-art method for evaluating KAQ and τ KAQ, using
bounds and attempt pruning for online queries. Gaussian kernel, in Section 3. Then, we discuss our solution
KARL for Gaussian kernel in Section 4. Next, we extend our
TABLE 3: Types of weighting in FP (q) method KARL to handle additive kernel functions in Section 5.
After that, we present our experiments in Section 6. Lastly, we
Type of weighting Used in model Techniques conclude the paper with future research directions in Section 7.
Type I: identical, positive wi KDE/KDC Quality-preserving
(most specific) solutions [21], [20]
Type II: positive wi OCSVM Heuristics 2 R ELATED W ORK
(subsuming Type I) [32] The term “kernel aggregation query” abstracts a common oper-
Type III: no restriction on wi SVR, Heuristics
(subsuming Types I, II) 2CSVM [27], [24], [29]
ation in several statistical and learning problems such as kernel
density estimation [21], [20], 1-class SVM [37], 2-class SVM [47]
In our preliminary work [10], we propose Kernel Aggregation and support vector regression [47].
Rapid Library (KARL)1 , which provides a fast solution to support Kernel density estimation is a non-parametric statistical
Gaussian, polynomial and sigmoid kernels (cf. Table 2) with method for density estimation. To speedup kernel density esti-
different types of weighting (cf. Table 3). We also compare with mation, existing work would either compute approximate density
values with accuracy guarantee [38] or test whether density values
1. https://github.com/edisonchan2013928/KARL-Fast-Kernel-Aggregation- are above a given threshold [20]. Zheng et al. [58] focus on fast
Queries kernel density estimation on low-dimensional data (e.g., 1d, 2d)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 3

0.05 0.05 High

Approximate FP(q)
0.04 0.04

Classify FP(q)
FP(q) 0.03 0.03

0.02 0.02

0.01 0.01 Low

0 0
0 10 20 30 0 10 20 30 0 10 20 30
q q q

(a) Exact KAQ (b) KAQ with  = 0.2 (c) τ KAQ with τ = 0.01

Fig. 1: KAQ of the query region [0,30] in shuttle sensor dataset [2] (using the 5th dimension)

and propose sampling-based solutions with theoretical guarantees aggregation function from O(nd) to O(D) time in the online
on both efficiency and quality. On the other hand, [38], [20] phase. However, this type of work only provides the approxima-
assume that the point set P is indexed by a k -d tree, and apply tion results for both KAQ and τ KAQ without any theoretical
filter-and-refinement techniques for kernel density estimation. The guarantee. In the second camp, researchers [23], [35], [36] dis-
library Scikit-learn [40] adopts the implementation in [38]. Our cover that the kernel aggregation function can be evaluated exactly
proposal, KARL, adapts the algorithm in [38], [20] to evaluate and efficiently with some additive kernels. For example: Maji et
kernel aggregation queries. The key difference between KARL al. [35], [36] show that the kernel aggregation function can be
and [38], [20] lies in the bound functions. As explained in our computed in O(d log n) time using intersection kernel. However,
preliminary work [10], our proposed linear bound functions are not all additive kernels (e.g., χ2 , JS and Hellinger) exhibit this
tighter than existing bound functions used in [38], [20]. Further- property. To speedup the evaluation of kernel aggregation function
more, we extend our linear bound functions to deal with different for other kernels, some work [36], [5], [4] further prestore the
types of weighting and kernel functions [10], which have not been kernel aggregation values into the lookup table, which can be
considered in [38], [20]. used to evaluate the kernel aggregation function effectively and
SVM is proposed by the machine learning community to clas- approximately. Even though this type of methods is similar with
sify data objects or detect outliers. SVM has been applied in differ- our proposal, it does not provide any theoretical guarantee. In
ent application domains, such as document classification [37], net- our work, we further support these kernel functions for evaluating
work fault detection [57], [7], [6], anomaly/outlier detection [11], KAQ and τ KAQ, which guarantee our returned result within the
[31], novelty detection [25], [33], [48], cancer detection [14], [30], relative error  and correctly classify the value with the threshold
image classification [13], [53], [36], time series classification [28], τ , respectively.
and pedestrian detection [4]. The typical process is divided into
two phases. In the offline phase, training/learning algorithms are 3 S TATE - OF - THE -A RT (SOTA) F OR G AUSSIAN
used to obtain the point set P , the weighting, and parameters.
K ERNEL
Then, in the online phase, thresholded kernel aggregation query
(τ KAQ) can be used to support classification or outlier detection. In this section, we introduce the state-of-the-art [21], [20] (SOTA),
Two approaches have been studied to accelerate the online phase. albeit it is only applicable to queries with type I weighting (see
The library LibSVM [12] assumes sparse data format and applies Table 3) and Gaussian kernel function. In this case, we denote the
inverted index to speedup exact computation. The machine learn- common weight by w and K(q, pi ) = exp(−γ · dist(q, pi )2 ) in
ing community has proposed heuristics [27], [24], [29], [32] to Equation 1.
reduce the size of the point set P in the offline phase, in order X
FP (q) = w · exp(−γ · dist(q, pi )2 ) (3)
to speedup the online phase. However, these heuristics may affect
pi ∈P
the prediction quality of SVM. Our proposed bound functions have
not been studied in the above work. Bounding functions.
Most of the existing work, e.g., [47], [10], only focus on three We introduce the concept of bounding rectangle [44] below.
kernel functions, including Gaussian, polynomial and sigmoid
kernel functions. Recently, additive kernels (e.g., χ2 , intersection, Definition 1. Let R be the bounding rectangle for a point set P .
JS, Hellinger kernels in Table 2) with SVM are extensively used We denote its interval in the j -th dimension as [R[j].l, R[j].u],
for different applications, e.g., image classification [53], [36] where R[j].l = minp∈P p[j] and R[j].u = maxp∈P p[j].
in computer vision, detection of spatial properties of object in Given a query point q, we can compute the minimum dis-
Geoscience [16], colorectal cancer detection in medical science tance mindist(q, R) from q to R, and the maximum distance
[30], pedestrian detection system in transportation [4]. However, maxdist(q, R) from q to R, i.e., the following inequality holds
it is still time-consuming (still in O(nd) time) to evaluate KAQ for every point p inside R.
and τ KAQ, using additive kernels. To boost the efficiency per-
formance, many approximation methods have been developed in mindist(q, R) ≤ dist(q, p) ≤ maxdist(q, R)
existing work, which can be divided into two camps. With the above notations, the lower bound LBR (q) and the
In the first camp, researchers [34], [50], [49], [53] aim to learn upper bound U BR (q) for FP (q) (Equation 3) are defined as:
the finite D-dimensional feature representation of each vector,
e.g., φ(q) for q and φ(p) for p, such that K(q, p) ≈ φ(q)T φ(p) LBR (q) = w · R.count · exp(−γ · maxdist(q, R)2 )
in which it can reduce the time complexity for evaluating kernel U BR (q) = w · R.count · exp(−γ · mindist(q, R)2 )
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 4

where R.count denotes the number of points (from P ) in R, LBR (q) and U BR (q)) are replaced by our bound functions.
and w denotes the common weight (for type I weighting). It takes Our key contribution is to develop tighter bound functions for
O(d) time to compute the above bounds online. FP (q) (cf. Equation 3). In Section 4.1, we propose a novel idea
Refinement of bounds. to bound the function exp(−x) and discuss how to compute such
The state-of-the-art [21], [20] employs a hierarchical index bound functions quickly. In Section 4.2, we devise tighter bound
structure (e.g., k -d tree) to index the point set P . Consider the functions and show that they are always tighter than existing bound
example index in Figure 2. Each non-leaf entry (e.g., R5 , 9) stores functions.
the bounding rectangle of its subtree (e.g., R5 ) and the number of In this section, we only focus on the Gaussian kernel in the
points in its subtree (e.g., 9). function FP (q). As a remark, our preliminary work [10] proposes
some advanced techniques, e.g., auto-tuning, to further boost the
root node: Nroot efficiency performance. In addition, we also propose a method
R5,9 | R6,9 to support both polynomial and sigmoid kernels in [10]. To save
space, we omit these parts. Interested readers can refer to [10].
node N5 node N6
R1,5 | R2,4 R3,4 | R4,5 4.1 Fast Linear Bound Functions
We wish to design bound functions such that (i) they are tighter
node N1 node N2 node N3 node N4
than existing bound functions (cf. Section 3), and (ii) they are
p1 p2 … p5 p6 p7 … p9 p10 p11 … p13 p14 p15 … p18 efficient to compute, i.e., taking only O(d) computation time.
In this section, we assume type I weighting and denote the
Fig. 2: Hierarchical index structure common weight by w. Consider an example on the dataset P =
{p1 , p2 , p3 }. Let xi be the value γ · dist(q, pi )2 . With this
We illustrate the running steps of the state-of-the-art on the
notation, the value FP (q) (cf. Equation 3) can be simplified to:
above example index in Table 5. For conciseness, the notations
LBR (q), U BR (q), FP (q) are abbreviated as lbR , ubR , FP re-
 
w exp(−x1 ) + exp(−x2 ) + exp(−x3 ) .
spectively. The state-of-the-art maintains a lower bound lb b and
upper bound ub for FP (q). Initially, the bounding rectangle of
c
In Figure 3, we plot the function value exp(−x) for x1 , x2 , x3 as
the root node (say, Rroot ) is used to compute lb b and ub
c. It uses
points.
a priority queue to manage the index entries that contribute to the
computation of those bounds; the priority of an index entry Ri function value
xmin= xmax=
is defined as the difference ubRi − lbRi . In each iteration, the 1
γ mindist(q,p)2 γ maxdist(q,p)2
algorithm pops an entry Ri from the priority queue, processes
0.8
the child entries of Ri , then refines the bounds incrementally and
updates the priority queue. For example, in step 2, the algorithm 0.6
pops the entry R5 from the priority queue, inserts its child
entries R1 , R2 into the priority queue, and refines the bounds 0.4
function
incrementally. Similar technique can be also found in similarity exp(–x)
search community (e.g., [8], [9]). 0.2

TABLE 5: Running steps for state-of-the-art 0 x1 x2 x3 x axis


0.0 0.5 1.0 1.5 2.0

Step Priority queue Maintenance of lower bound lb


b
Fig. 3: Linear bounds
and upper bound ubc
1 Rroot lb
b = lbRroot , We first sketch our idea for bounding FP (q). First, we com-
ub = ubRroot
c pute the bounding interval of xi , i.e., the interval [xmin , xmax ],
2 R5 , R 6 lb
b = lbR5 + lbR6 , where xmin = γ·mindist(q, R)2 , xmax = γ·maxdist(q, R)2 ,
ub
c = ubR5 + ubR6 and R is the bounding rectangle of P . Within that interval, we
3 R6 , R 1 , R 2 lb = lbR6 + lbR1 + lbR2 ,
b employ two functions E L (x) and E U (x) as lower and upper
ub
c = ubR6 + ubR1 + ubR2 bound functions for exp(−x), respectively (cf. Definition 2). We
4 R1 , R2 , R3 , R4 lb b = lbR1 + lbR2 + lbR3 + lbR4 , illustrate these two functions by a red line and a blue line in
ub = ubR1 + ubR2 + ubR3 + ubR4
c Figure 3.
5 R2 , R 3 , R 4 b = Fp1 ···p5 + lbR2 + lbR3 + lbR4 ,
lb Definition 2 (Constrained bound functions). Given a query point
c = Fp1 ···p5 + ubR2 + ubR3 + ubR4
ub q and a point set P , we call two functions E L (x) and E U (x) to
be lower and upper bound functions for exp(−x), respectively, if
The state-of-the-art terminates upon reaching a termination
b ≥ τ or
condition. For τ KAQ, the termination condition is: lb E L (x) ≤ exp(−x) ≤ E U (x)
ub
c < τ . For KAQ, the termination condition is: ub
c < (1 + )lb
b.
holds for any x ∈ [γ · mindist(q, R)2 , γ · maxdist(q, R)2 ],
4 KARL FOR G AUSSIAN K ERNEL where R is the bounding rectangle of P .
Our proposed solution, KARL, adopts the state-of-the-art (SOTA) In this paper, we model bound functions E L (x) and E U (x)
for query processing, except that existing bound functions (e.g., by using two linear functions Linml ,cl (x) = ml x + cl and
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 5

Linmu ,cu (x) = mu x + cu , respectively. Then, we define the function value


1
aggregation of a linear function Linm,c (x) as:
X  
0.8
FLP (q, Linm,c ) = w m(γ · dist(q, pi )2 ) + c (4) existing bound:
xmin exp(-xmin)
pi ∈P 0.6
the chord:
With this concept, the functions FLP (q, Linml ,cl ) and x1 EU(x)=mu x + cu
0.4
FLP (q, Linmu ,cu ) serve as a lower and an upper bound function x2 xmax function
for FP (q) respectively, subject to the condition stated in Lemma 0.2 x3 exp(–x)
1.
0 x axis
Lemma 1. Suppose that Linml ,cl (x) and Linmu ,cu (x) are 0.0 0.5 1.0 1.5 2.0
lower and upper bound functions for exp(−x), respectively, for
the query point q and point set P . It holds that: Fig. 4: Chord-based upper bound function

FLP (q, Linml ,cl ) ≤ FP (q) ≤ FLP (q, Linmu ,cu ) (5)
such that FLP (q, Linmu ,cu ) ≤ U BR (q), where U BR (q) is the
Observe that the bound functions in Figure 3 are not tight. We
upper bound function used in the state-of-the-art (see Section 3).
will devise tighter bound functions in the next subsection.
Fast computation of bounds. Linear function Linml ,cl (x) for modeling E L (x).
The following lemma allows FLP (q, Linm,c ) to be effi- To achieve the (1) correct and (2) tight lower bound for
ciently computed in O(d) time. exp(−x), we utilize the tangent line, which passes through
the tangent point, as the correct lower bound of exp(−x)
Lemma 2. Given two values m and c, FLP (q, Linm,c ) (Equa-
(cf. Figure 5), due to the convexity property of this function.
tion 4) can be computed in O(d) time and it holds that:
  Observe from Figure 5a, once we choose the tangent point
FLP (q, Linm,c ) = wmγ |P | · ||q||2 − 2q · aP + bP + wc|P | (xmax , exp(−xmax )), the lower bound is already tighter than
the existing bound exp(−xmax ) (green dashed line in Figure 5a),
||pi ||2 . as stated in Lemma 4.
P P
where aP = pi ∈P pi and bP = pi ∈P

Proof. Lemma 4. There exists a linear function Linml ,cl (x) such that
X   FLP (q, Linml ,cl ) ≥ LBR (q), where LBR (q) is the lower
FLP (q, Linm,c ) = w m(γ · dist(q, pi )2 ) + c bound function used in the state-of-the-art (see Section 3).
pi ∈P
X   Interestingly, it is possible to devise a tighter bound than the
= wmγ ||q||2 − 2q · pi + ||pi ||2 + wc|P | above. Figure 5b depicts the tangent line at point (t, exp(−t)).
p ∈P
i  This tangent line offers a much tighter bound than the one in
= wmγ |P | · ||q||2 − 2q · aP + bP + wc|P | Figure 5a.
In the following, we demonstrate how to find the optimal
tangent line (i.e., leading to the tightest bound). Suppose that the
Observe that both terms aP and bP are independent of the
linear function of lower bound Linml ,cl (x) is the tangent line at
query point q. Therefore, with the pre-computed values of aP and
point (t, exp(−t)). Then, we derive the slope ml and the intercept
bP , FLP (q, Linm,c ) can be computed in O(d) time.
cl as:
d exp(−x)
4.2 Tighter Bound Functions ml = = − exp(−t)
dx x=t
We proceed to devise tighter bound functions by using cl = exp(−t) − ml t = (1 + t) exp(−t)
Linml ,cl (x) and Linmu ,cu (x).
Theorem 1 establishes the optimal value topt that leads to the
Linear function Linmu ,cu (x) for modeling E U (x).
tightest bound.
Observe from Figure 4, the chord-based linear bound
function Linmu ,cu (x), which passes through two points Theorem 1. Consider the function FLP (q, Linml ,cl ) as a func-
(xmin , exp(−xmin )) and (xmax , exp(−xmax )), can act as the tion of t, where ml = − exp(−t) and cl = (1 + t) exp(−t). This
proper upper bound function of exp(−x), given each xi is in the function yields the maximum value at:
interval [xmin , xmax ]. γ X
It turns out that Linmu ,cu (x) can also lead to a tighter upper topt = · dist(q, pi )2 (8)
|P | p ∈P
bound than the existing bound exp(−xmin ) (see Section 3), as i

shown in Figure 4 (green dashed line in Figure 4), which is stated Proof. Let H(t) = FLP (q, Linml ,cl ) be a function of t. For
in Lemma 3. the sake of clarity, we define the following two constants that are
Lemma 3. There exists a linear function Linmu ,cu (x) = mu x+ independent of t:
cu with:
X
z1 = wγ · dist(q, pi )2 and z2 = w|P |
exp(−xmax ) − exp(−xmin ) pi ∈P
mu = (6)
xmax − xmin Together with the given ml and cl , we can rewrite H(t) as:
xmax exp(−xmin ) − xmin exp(−xmax )
cu = (7) H(t) = −z1 exp(−t) + z2 (1 + t) exp(−t)
xmax − xmin
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 6

function value function value


1 1

0.8 0.8

xmin xmin
0.6 the tangent line (at xmax): 0.6
x1 EL(x) = ml x + cl x1
x2 x2
0.4 0.4
x3 x function x3 x function
max max
exp(–x) exp(–x)
0.2 existing bound: 0.2 optimized tangent line (at t):
exp(-xmax) EL(x)=ml x + cl

0 x axis 0 x axis
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0

(a) tangent line at xmax (b) optimized tangent line at t


Fig. 5: Tangent-based lower bound function

The remaining proof is to find the maximum value of H(t). LBSOTA UBSOTA LBKARL UBKARL
We first compute the first derivative of H(t) (in terms of t): 8*105

H 0 (t)

Bound Value
= z1 exp(−t) + z2 exp(−t) − z2 (1 + t) exp(−t) 6*105 KARL stops SOTA stops
= (z1 + z2 − z2 − z2 t) exp(−t)
4*105
= (z1 − z2 t) exp(−t)
Next, we compute the value topt such that H 0 (topt ) = 0. 2*105
Threshold τ
Since exp(−topt ) 6= 0, we get: 0*100 0 10 20 30 40 50 60 70 80
z1 − z2 topt = 0 Iteration (x102)
z1 γ X
topt = = · dist(q, pi )2
z2 |P | p ∈P Fig. 6: Bound values of SOTA and KARL vs. the number of
i
iterations; for type I-τ query on the home dataset
Then we further test whether topt indeed yields the maximum
value. We consider two cases for H 0 (t). Note that both z1 and z2
are positive constants. 4.3.1 Type II Weighting
1) For the case t < topt , we get H 0 (t) > 0, implying that For type II weighting, each wi takes a positive value. Note that
H(t) is an increasing function. different wi may take different values.
2) For the case t > topt , we get H 0 (t) < 0, implying that First, we redefine the aggregation of a linear function
H(t) is a decreasing function. Linm,c (x) as:
Thus, we conclude that the function H(t) yields the maximum X  
value at t = topt . FLP (q, Linm,c ) = wi m(γ · dist(q, pi )2 ) + c (9)
pi ∈P
The optimal value topt involves the term pi ∈P dist(q, pi )2 .
P
This term can be computed efficiently in O(d) time by the trick in This function can also be computed efficiently (i.e., in O(d) time),
as shown in Lemma 5.
Lemma 2. By applying Lemma P2 and substituting w = m = γ =
1 and c = 0, we can express pi ∈P dist(q, pi )2 in the form of Lemma 5. Under type II weighting, FLP (q, Linm,c ) (Equation
FLP (q, Linm,c ), which can be computed in O(d) time. 9) can be computed in O(d) time, given two values of m and c.
Case study. Proof.
We conduct the following case study on the augmented k -d
tree, in order to demonstrate the performance of KARL and the FLP (q, Linm,c )
X  
tightness of our bound functions compared to existing bound func- = wi m(γ · dist(q, pi )2 ) + c
tions. First, we pick a random query point from the home dataset pi ∈P
  
[2] (see Section 6.1 for details). Then, we plot the lower/upper
X X
= wi mγ ||q||2 − 2q · pi + ||pi ||2 +c wi
bound values of SOTA and KARL versus the number of iterations. pi ∈P pi ∈P
Observe that our bounds are much tighter than existing bounds, 
2

= mγ wP · ||q|| − 2q · aP + bP + cwP
and thus KARL terminates sooner than SOTA.
where aP = pi ∈P wi pi , bP = pi ∈P wi ||pi ||2 and wP =
P P
4.3 Other Types of Weighting P
The state-of-the-art solution [21], [20] only considers type-I pi ∈P wi .
weighting. However, as stated in Table 3, different types of The terms aP , bP , wP are independent of q. With their pre-
weighting are adopted in different statistical/ machine learning computed values, FLP (q, Linm,c ) can be computed in O(d)
models. In this section, we extend our bounding techniques for time.
the following function under other types of weighting: It remains to discuss how to find tight bound functions. For the
upper bound function, we adopt the same technique in Figure 4.
X
FP (q) = wi · exp(−γ · dist(q, pi )2 )
pi ∈P For the lower bound function, we use the idea in Figure 5b, except
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 7

that the optimal value topt should also depend on the weighting also takes O(nd) time, which is slow in the prediction phase of
(cf. Theorem 2). machine learning models. Therefore, we extend KARL to support
this class of kernel functions.
Theorem 2. Consider the function FLP (q, Linml ,cl ) as a func-
Additive kernel functions exhibit the following additive prop-
tion of t, where ml = − exp(−t) and cl = (1 + t) exp(−t). This
erty (cf. Definition 3).
function yields the maximum value at:
γ X Definition 3. (Additive property [36]) We denote K(q, p) as the
topt = · wi · dist(q, pi )2 (10) additive kernel, if we can express this function as the sum of d
wP p ∈P
i
one-dimensional kernel functions (denoted as K` (q` , p` ), where
1 ≤ ` ≤ d), which correspond to different dimensions, where:
P
where wP = pi ∈P wi .
d
Proof. Following the proof of Theorem 1, we let H(t) = X
FLP (q, Linml ,cl ) be a function of t and we define the following K(q, p) = K` (q` , p` ) (11)
`=1
two constants.
X As a remark, additive kernels are mainly used for histograms.
z1 = γ · wi · dist(q, pi )2 and z2 = wP Therefore, we also follow existing work [53], [36], [49] and regard
pi ∈P
q and p as histograms, i.e., q` and p` are both positive in this
Then, we follow exactly the same steps of Theorem 1 to derive section. Table 6 summarizes the most representative K` (q` , p` )
the optimal topt (Equation 10). [49], [53] for different additive kernels.
Again, the value topt can also be computed efficiently (i.e., in TABLE 6: K` (q` , p` ) for different additive kernel functions
O(d) time).
Kernel K` (q` , p` )
4.3.2 Type III Weighting 2 2q` p`
χ q` +p`
For type III weighting, there is no restriction on wi . Each wi takes Intersection min(q` , p` )
either a positive value or a negative value. JS 1
q log2 ( q`q+p
2 `
`
` √
) + 12 p` log2 ( q`p+p
`
`
)
Our idea is to convert the problem into two subproblems that Hellinger q` p`
use type II weighting. First, we partition the point set P into two
sets P + and P − such that: (i) all points in P + are associated Observe from Equation 11, additive kernel function K(q, p)
with positive weights, and (ii) all points in P − are associated with consists of different one-dimensional kernel functions K` (q` , p` ).
negative weights. Then we introduce the following notation: Therefore, we can convert the kernel aggregation function (cf.
X Equation 1) to the composition of one-dimensional kernel ag-
FP − (q) = |wi | · exp(−γ · dist(q, pi )2 ) = −FP − (q) gregation functions FP` (q` ) [36] (cf. Lemma 6), where P` de-
pi ∈P −
notes the set of data points in P with the `th dimension, i.e.,
This enables us to rewrite the function FP (q) as: P` = {p1` , p2` , ..., pn` }, and FP` (q` ) is:
X X
FP (q) = wi · exp(−γ · dist(q, pi )2 ) FP` (q` ) = wi · K` (q` , pi` ) (12)
pi ∈P + ∪P − pi` ∈P`

= FP + (q) + FP − (q) Lemma 6. Given the additive kernel K(q, p), the kernel aggre-
= FP + (q) − FP − (q) gation function FP (q) (cf. Equation 1) is the addition of d one-
dimensional kernel aggregation functions FP` (q` ), i.e.,
Since the weights in both FP + (q) and FP − (q) are positive,
d
the terms FP + (q) and FP − (q) can be bounded by using the X
techniques for type II weighting. FP (q) = FP` (q` ) (13)
The upper bound of FP (q) can be computed as the upper `=1

bound of FP + (q) minus the lower bound of FP − (q). Proof.


The lower bound of FP (q) can be computed as the lower d
X 
bound of FP + (q) minus the upper bound of FP − (q).
X X
FP (q) = wi · K(q, pi ) = wi · K` (q` , pi` )
pi ∈P pi ∈P `=1
d  X d
5 KARL FOR A DDITIVE KERNELS X  X
= wi · K` (q` , pi` ) = FP` (q` )
In previous sections, we mainly focus on Gaussian kernel function. `=1 pi` ∈P` `=1
However, as discussed in Introduction (cf. Section 1), additive
kernel functions (e.g., χ2 , intersection, JS and Hellinger kernels
in Table 2) have recently attracted the attention in both machine The above lemma implies that we can focus on developing
learning [56], [17], [41], [39] and computer vision [35], [49], the lower and upper bound functions of one-dimensional kernel
[36], [53] communities, which are actively used in the following aggregation function FP` (q` ), and then add them to obtain the
applications. Some pedestrian detection systems [5], [3], [4], bounds of FP (q).
utilize additive kernels to detect human. Medical scientists [30] In the following, we will present two bounding techniques for
utilize additive kernels to detect colorectal cancer. Geoscientists FP` (q` ), which are the monotonicity-based bounds (cf. Section
[16] utilize additive kernels to characterize spatial properties of 5.1) and the linear bounds (cf. Section 5.2). Then, we further pro-
objects in a scene. However, the same as Gaussian kernel, eval- pose a two-step method to integrate these two bounding techniques
uating kernel aggregation function FP (q) with additive kernels for solving KAQ and τ KAQ in Section 5.3.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 8

5.1 Monotonicity-based Lower and Upper Bounds for ℱ𝑃ℓ (𝑞ℓ )


ℱ𝑃ℓ (𝑞ℓ )
FP` (q` ) (KARLmono )
𝑞
𝑞ℓ
In this section, we develop the new lower and upper bound
𝑞ℓ
(min) 𝑞ℓ 𝑞ℓ (max)
𝑞ℓ
functions for FP` (q` ), which are based on the monotonicity
property. To simplify the discussion, we mainly focus on the χ2 𝛿
kernel function (cf. Table 6) and type-II weighting (i.e., wi ≥ 0). Fig. 8: Precomputation of FP` (q) in the `th dimension (four
However, these bound functions can be extended to other additive FP` (q) are precomputed in this example)
kernel functions and other weighting (using the similar technique
in Section 4.3.2).
can find the precomputed value to achieve the tolerance ε be-
Observe from Figure 7, the larger the q` , the larger the function
tween the lower bound FP` (q` ) and upper bound FP` (qb` ), i.e.,
value q2q `x
. Therefore, once we let x = pi` and we have computed
` +x FP` (qb` ) ≤ (1 + ε)FP` (q` ), which guarantees these two bounds
these two values FP` (q` ) and FP` (qb` ), where q` ≤ q` ≤ qb` . We
are not far away from each other (cf. Lemma 8).
can conclude FP` (q` ) ≤ FP` (q` ) ≤ FP` (qb` ), i.e., these two
values act as the lower and upper bounds of FP` (q` ) (cf. Lemma Lemma 8. Given δ as the interval size of consecutive samples,
7). i.e., δ = qb` − q` , and ε as the tolerance, if δ fulfills the condition
function value in Table 7, we conclude FP` (qb` ) ≤ (1 + ε)FP` (q` ).
1
𝑞ℓ =0.6 Proof. In this proof, we focus on the χ2 kernel function.
0.8 𝑞ℓ =0.5  2qb p   2(q` + δ)pi` 
` i`
X X
𝑞ℓ =0.4 FP` (qb` ) = wi = wi
p ∈P
qb` + pi` p ∈P
q` + δ + pi`
0.6 i` ` i` `

X  2(q` + δ)pi` 
≤ wi
0.4
pi` ∈P`
q` + pi`
 2p 
i`
X
0.2 = FP` (q` ) + δ wi
p ∈P
q` + pi`
i` `
0 x axis
0 0.5 1 1.5 2 Therefore, we have the following relative error:
2q` x FP` (qb` ) − FP` (q` ) δ δ
Fig. 7: Lower and upper bounds for q` +x ≤ ≤ (min)
FP` (q` ) q` q`
Lemma 7. If q` ≤ q` ≤ qb` , we have FP` (q` ) ≤ FP` (q` ) ≤
To ensure the tolerance is within ε, we set:
FP` (qb` ), where the kernel functions are from Table 6.
δ (min)
Proof. In this proof, we only focus on the χ2 kernel function. (min)
≤ ε =⇒ δ ≤ ε × q`
q`
However, we can extend this proof to other kernel functions in
Table 6 using the similar idea. Hence, we prove the above claim.
Given q` ≤ qb` , we have:
In the proof of Lemma 8, we only focus on the χ2 kernel.
FP` (qb` ) − FP` (q` ) For other kernel functions in Table 6, we also have similar
X  2qb p 
` i`
X  2q p 
` i`
results, but with different conditions for δ , which are sum-
= wi − wi (min)
marized in Table 7. We denote p` = minpi` ∈P` pi` and
pi` ∈P`
q
b` + pi`
pi` ∈P`
q ` + pi` (max)
p` = maxpi` ∈P` pi` . the detailed proofs for other kernel
2
X  2(qb` − q` )pi` 
functions (intersection, JS and Hellinger) are shown in Appendix
= wi ≥0
p ∈P
(q ` + pi` )(qb` + pi` ) (cf. Section 8).
i` `

Therefore, we have FP` (q` ) ≤ FP` (qb` ). Similarly, we can also TABLE 7: Different conditions for interval size δ
conclude FP` (q` ) ≤ FP` (q` ).
Kernel Condition for δ
2 (min)
Even though the bound functions FP` (q` ) and FP` (qb` ) can χ δ ≤ ε × q`
(min) (min)
correctly act as the lower and upper bounds for FP` (q` ) respec- Intersection δ ≤ ε × min(q` , p` )
tively, it is not feasible to compute these two bounds on-the-   (min) 
(min) (min) q`
JS δ ≤ ε × min q` , p` × log2 (max) + 1
fly, since the response time is the same as the exact evaluation p`
(min)
of FP` (q` ). To avoid exact evaluation in the online phase, we Hellinger δ ≤ ε2 × q`
precompute several FP` (q) for different q in advance, as shown in
(min) (max)
Figure 8. In the online phase, we can directly use the precomputed If q` is in the range [q` , q` ], for every ` = 1, 2..., d,
values, FP` (q` ) and FP` (qb` ) as the lower and upper bounds of we can obtain the lower and upper bounds, denoted as F and
FP` (q` ) respectively, once q` ≤ q` ≤ qb` . Fb respectively, for FP (q) in O(d) time based on the lookup
Observe from Figure 8, there is a trade-off between the operations, where:
bound values and the precomputation cost (space and time). d d
Here, we demonstrate how to sample the q -axis uniformly with X X
F = FP` (q` ) and Fb = FP` (qb` )
interval size δ (cf. Figure 8) in offline phase, such that we
`=1 `=1
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 9

Moreover, these two bound values are also within the tolerance and Linmu ,cu (x) respectively. To find ml , cl , mu and cu , we can
ε. utilize the similar concepts in Section 4.2, which are omitted here.
(min) (max) The concave property can be also found for other additive kernel
Theorem 3. If q` is in the range of [q` , q` ] and the
functions (cf. Table 6). Therefore, we can simply extend Lemma
interval size δ fulfills the condition in Table 7, with ε as the
9 for other additive kernel functions.
tolerance, for every ` = 1, 2..., d, we have Fb ≤ (1 + ε)F .
(min) (max)
Proof. If each q` is in the range of [q` , q` ], q` lies on function value
one interval [q` , qb` ]. Using the result in Lemma 8, FP` (qb` ) ≤ 1
xmin xmax
(min) (max)
(1 + ε)FP` (q` ), for every q` ∈ [q` , q` ]. Therefore, we
0.8 2𝑞ℓ 𝑥
have: function
d d x3 𝑞ℓ +𝑥
X X x2
0.6
Fb = FP` (qb` ) ≤ (1 + ε) FP` (q` ) = (1 + ε)F
`=1 `=1
0.4 x1

In both Lemma 8 and Theorem 3, there is one assumption


(min) (max) 0.2
that we need to know the range for q` , i.e., [q` , q` ],
in advance for each dimension. In practice, we can utilize the x axis
(min) (max) 0
interval of each dimension in P to estimate [q` , q` ], i.e., 0 0.5 1 1.5 2
(min) (min) (max) (max)
we estimate q` = p` and q` = p` . Normally, 2q` x
(min) (max) Fig. 9: Linear lower and upper bound functions for function q` +x
[p` , p` ] is not large, as we need to normalize each
dimension of the dataset to be within [0, 1] before training the
regression and classification models, e.g., SVM [12]. Note that After we find the suitable Linm,c (x), we can obtain the O(1)-
the interval size δ depends on the value q`
(min)
(cf. Lemma 8). We time bound function FLP` (q` , Linm,c ) for FP` (q` ) (cf. Equation
cannot set it to a very small value, e.g., 0, as it incurs very large 14), as stated in Lemma 10.
precomputation cost, including space and time. As an example, if Lemma 10. Given two values m and c, we can evaluate
(min) (max)
q` = 0.00001 and q` = 1 and ε = 0.01, then, for χ2 FLP` (q` , Linm,c ) in O(1) time.
−7
kernel, δ = 10 and the corresponding precomputation space
(1−0.00001) Proof.
(or the number of intervals) for each dimension is 10−7 , X
which is near 10 million. To avoid the huge precomputation cost, FLP` (q` , Linm,c ) = wi · Linm,c (pi` )
(min)
we restrict the smallest estimated value of q` to be 0.01, even pi` ∈P`
(min) X
though it is possible that p` < 0.01. = wi · (mpi` + c) = maP` + cbP
pi` ∈P`
5.2 Linear Bounds for Out-of-Range q` (KARLlinear ) P P
where aP` = pi` ∈P` wi pi` and bP = pi` ∈P` wi can be
As discussed in Section 5.1, we only utilize the interval of precomputed.
(min) (max)
each dimension in P to estimate [q` , q` ]. However, it
is possible for q` to be out-of-range in the online phase. To handle To efficiently support the lower and upper bound functions
this situation, we extend our linear bounds (cf. Sections 4.1 and for each dimension, we need to prebuild multiple index structures
4.2) for this case. As an example, we focus on the χ2 kernel (e.g., multiple binary trees). Once q` is out-of-range, we can use
in Equation 12 (i.e., Equation 14) in this section. However, our the tree for `th dimension to find the approximate value F
c` which
method can be also easily applied for other kernel functions. is within FP` (q` ) and (1 + ε)FP` (q` ).
 2q p 
` i`
X
FP` (q` ) = wi (14) 5.3 Integration of Our Bounds to Solve KAQ and τ KAQ
p ∈P
q` + pi`
i` ` (KARL)
In order to extend our linear bounds to Equation 14, we let In this section, we discuss how to integrate our bound functions
xi = pi` and define the following aggregation of linear function, (cf. Sections 5.1 and 5.2) to solve KAQ and τ KAQ.
where Linm,c (x) = mx + c. Recall from Section 3, the termination conditions for τ KAQ
X
and KAQ are lb b ≥ τ or ub
c ≤ τ and ubc ≤ (1 + )lbb respectively.
FLP` (q` , Linm,c ) = wi · Linm,c (pi` )
pi` ∈P` To solve KAQ and τ KAQ with additive kernels, we adopt the
following 2-step method to maintain the bounds lbb and ub
c.
Based on the similar concepts of Definition 2 and Lemma 1,
our goal is to find two linear functions Linml ,cl (x) = ml x + cl 1) We either obtain the monotonicity-based lower and upper
and Linmu ,cu (x) = mu x+cu such that Linml ,cl (x) ≤ q2q `x
` +x
≤ bound functions, i.e., FP` (q` ) and FP` (qb` ) respectively,
Linmu ,cu (x). Once we obtain these two linear functions, we have (cf. Section 5.1) or compute the linear bounds once q` is
FLP` (q` , Linml ,cl ) ≤ FP` (q` ) ≤ FLP` (q` , Linmu ,cu ) (cf. out-of-range (cf. Section 5.2), using kd-tree/ ball-tree for
Lemma 9). `th dimensional points in P (with the similar concept of
Section 3). We denote the lower and upper bounds for `th
Lemma 9. Given Linml ,cl (x) ≤ q2q `x
` +x
≤ Linmu ,cu (x), we
dimension as l` and u` respectively. Then, we update lb b
have: FLP` (q` , Linml ,cl ) ≤ FP` (q` ) ≤ FLP` (q` , Linmu ,cu ).
and ub.
c
Observe from Figure 9, since q2q `x
` +x
is the concave function, 2) If lb
b and ub
c does not fulfill the termination condition, we
we can simply use the chord and tangent lines as Linml ,cl (x) incrementally perform sequential scan (SCAN) for each
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 10

dimension (with the higher priority for larger value of functions, LBKARL and U BKARL . In this extension, we extend
u` − l` ), i.e., compute FP` (q` ), to refine the bounds, lb
b KARL to support additive kernels, we denote KARLmono as the
and ub, until the termination condition is fulfilled.
c method of monotonicity-based bound functions (cf. Section 5.1)
and KARLlinear as the method of linear bound functions (cf.
6 E XPERIMENTAL E VALUATION Section 5.2) with the combination of multiple binary-trees (for
each dimension). We further integrate both methods KARLmono
We introduce the experimental setting in Section 6.1. Next, we
and KARLlinear (which handles out-of-range case of q` ) to an
demonstrate the performance of different methods with Gaussian
unified one (cf. Section 5.3), i.e., KARL, to support the χ2 kernel
kernel function in Section 6.2. Lastly, we demonstrate the perfor-
function for both query types III- and III-τ .
mance of different methods with χ2 kernel function in Sections
6.3 and 6.4. 6.2 Efficiency Evaluation for Different Query Types
6.1 Experimental Setting with Gaussian kernel
6.1.1 Datasets We test the performance of different methods for five types of
queries which are I-, I-τ , II-τ , III- and III-τ . The parameters of
For type-I, type-II, type-III weighting, we take the application
these queries are set as follows.
models as kernel density estimation, 1-class SVM, and 2-class
SVM/SVR, respectively. We use a wide variety of real datasets for Types I- and III-. We set the relative error  = 0.2 for each
these models, as shown in Table 8. The value nraw denotes the dataset.
number of points in the raw dataset, and the value d denotes the . We fix the mean value µ of FP (q) from the set Q,
Type I-τ P
data dimensionality. These datasets are all from data repository i.e., µ = q∈Q FP (q)/|Q| as the threshold τ for each dataset in
websites [2], [12]. Some datasets have been also used in existing Table 9.
work, e.g., [20], [41].
For type-I weighting, we follow [20] and use the Scott’s rule Types II-τ and III-τ . The threshold τ can be obtained during the
to obtain the parameter γ . Type-II and type-III datasets require a training phase.
training phase. We consider two kernel functions: Gaussian and χ2
TABLE 9: Throughput (queries/sec) of all methods for different
kernels. We denote the number of remaining points in the dataset
gauss χ2 types of queries with Gaussian kernel
after training as nmodel and nmodel , for the Gaussian kernel and
χ2 kernel respectively. Type Datasets SCAN LIBSVM Scikit SOTA KARL
The LIBSVM software [12] is used in the training phase. The miniboone 36.1 n/a 36 16.5 301
training parameters are configured as follows. For each type-II I- home 15.2 n/a 11.9 36.2 187
susy 2.02 n/a 1.17 0.77 13.2
dataset, we apply 1-class SVM training, with the default kernel
miniboone 36.1 n/a n/a 102 510
parameter γ = d1 [12]. Then we vary the training model parameter I-τ home 15.2 n/a n/a 93.2 258
ν from 0.01 to 0.3 and choose the model which provides the susy 2.02 n/a n/a 3.58 83.4
highest accuracy. For each type-III dataset, we apply 2-class nsl-kdd 283 481 n/a 748 20668
II-τ kdd99 260 520 n/a 1269 11324
SVM/SVR training with the automatic script in [12] to determine covtype 158 462 n/a 448 6022
the suitable values for training parameters. cadata 1951 2001 n/a 1756 13539
TABLE 8: Details of datasets III- wave 940 1205 n/a 442 33482
2
3D-road 41 62 n/a 28 27334
Model Raw dataset nraw ngauss
model nχmodel d skin 495 504 n/a 248 4104
Type I: miniboone [2] 119596 n/a n/a 50 III-τ cod-rna 55.4 68.2 n/a 27.6 854
kernel home [2] 918991 n/a n/a 10 home 14.7 18.3 n/a 8.6 231
density susy [2] 4990000 n/a n/a 18
Type II: nsl-kdd [1] 67343 17510 n/a 41 Table 9 shows the throughput of different methods for all
1-class kdd99 [2] 972780 19461 n/a 41 types of queries. In the result of query type I-, SCAN is com-
SVM covtype [12] 581012 25486 n/a 54
Type III: cadata [2] 10640 1643 1179 8 parable to Scikit and SOTA since the bounds computed by the
2-class wave [2] 62000 3454 2331 48 basic bound functions are not tight enough. The performance of
SVR 3D-road [2] 424874 177585 176847 3 Scikit and SOTA is affected by the overhead of the loose bound
Type III: skin [2] 235057 14628 28124 3 computations. KARL uses our new bound functions which are
2-class cod-rna [12] 478565 114765 62368 8
SVM home [2] 918991 411461 297560 10 shown to provide tighter bounds. These bounds lead to significant
speedup in all evaluated datasets, e.g., KARL is at least 5.16
6.1.2 Methods for Comparisons times faster than other methods. For query type III-, our method
KARL can further improve the efficiency performance by 7x to
In our experimental study, we compare different state-of-the-art
28x, compared with other methods.
methods with our solution. SCAN is the sequential scan method
For query type I-τ , our method KARL improves the through-
which computes FP (q) without any pruning. SOTA is the state-
put by 2.76x to 21.2x when comparing to the runner-up method
of-the-art method which was developed by [20] for handling the
SOTA. The improvement becomes more obvious for type II-τ and
kernel density classification problem, i.e., I-τ query. We modify
type III-τ queries. The improvement of KARL can be up to 31x as
and extend their framework to handle other types of queries.
compared to SOTA. KARL achieves significant performance gain
LIBSVM [12] is the famous library for handling both support
for all these queries due to its tighter bound value compared with
vector machine-based regression and classification models, i.e.,
SOTA.
query types II-τ , III- and III-τ . In our preliminary version
[10], we develop KARL for Gaussian kernel function, which Sensitivity of τ . In order to test the sensitivity of threshold τ in
follows the framework of [20], combining with our linear bound different methods, we select five thresholds from the range µ −
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 11
qP
2
σ to µ + 3σ , where σ = q∈Q (FP (q) − µ) /|Q| is the SCAN SOTA  KARL 4
standard deviation. Figure 10 shows the results on the two largest 103 103

Throughput (Queries/sec)

Throughput (Queries/sec)
datasets (home and susy). Due to the superior performance of our
bound functions, KARL outperforms SOTA by nearly one order 102 102
of magnitude in most of the datasets regardless of the chosen
threshold. 101 101

Sensitivity of . In Scikit-learn library [40], we can select different


100 100
relative error  in the approximate KDE. To test the sensitivity, we μ-σ μ μ+σ μ+2σ μ+3σ μ-σ μ μ+σ μ+2σ μ+3σ
vary the relative error  for the two largest datasets (home and Threshold τ Threshold τ
susy) with query type I-. If the value of  is very small, the room (a) home (b) susy
for the bound estimations is very limited so that neither KARL nor
SOTA perform well in very small  setting (e.g., 0.05). For other Fig. 10: Query throughput with query type I-τ , varying the
general  settings, our method KARL consistently outperforms threshold τ
other methods by a visible margin (cf. Figure 11). 103 102

Throughput (Queries/sec)

Throughput (Queries/sec)
Sensitivity of dataset size. In the following experiment, we test
102 101
how the size of the dataset affects the evaluation performance of
different methods for both query types I- and I-τ . We choose the
101 100
largest dataset (susy) and vary the size via sampling. The trend in
Figure 12 meets our expectation; a smaller size implies a higher
throughput. Our KARL in general outperforms other methods by 100 10-1
0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
one order of magnitude in a wide range of data sizes. ε ε
(a) home (b) susy
6.3 Efficiency Evaluation for Different Query Types
Fig. 11: Query throughput with query type I-, varying the relative
with χ2 kernel
error 
In this section, we test the efficiency performance for χ2 kernel 103 102
Throughput (Queries/sec)

Throughput (Queries/sec)
(cf. Table 10). However, our techniques can be extended to other
additive kernel functions. Since additive kernels are mainly used 102 101

in regression (query type III-) and classification (query type III-


τ ) models, we only conduct the experiments in query type III. 101 100

By default, we fix the relative error  = 0.2 for KAQ. In


addition, we also fix the tolerance parameter ε = 0.05 for the 100 10-1
1 2 3 4 5 1 2 3 4 5
(min)
KARLmono method. Since we fix q` = 0.01 (cf. Section Size (x106) Size (x106)
(min)
5.1), we can choose the interval size δ = ε × q` = 0.0005 (a) type I-τ , fixing τ = µ (b) type I-, fixing  = 0.2
2
(χ kernel in Table 7). As such, the number of precomputed q`
is 1−0.01
0.0005 = 1980. Observe from Table 10, both our method Fig. 12: Query throughput on the susy dataset, varying the dataset
KARLmono and our method KARL can be significantly better size
than the existing methods (e.g., SCAN, LibSVM and SOTA)
SCAN LIBSVM + SOTA 
by at least one order of magnitude. On the other hand, since
KARLlinear 4 KARLmono KARL ×
the monotonicity-based bound functions are tighter than linear
104 103
bound functions, KARLmono can also achieve significant speedup
Throughput (Queries/sec)

Throughput (Queries/sec)

compared with our method KARLlinear [10]. 103


102

TABLE 10: Throughput (queries/sec) of all methods for different 102

types of queries with χ2 kernel 101


101

TypeDatasetsSCANLIBSVMSOTAKARLlinear KARLmono KARL 100 100


cadata 941 1102 366 1438 13243 15684 0.05 0.1 0.15 0.2 0.25 0.3 0.05 0.1 0.15 0.2 0.25 0.3
III- wave 87 104 26.8 149 2863 3216 ε ε
3D-road 18.1 26.3 3.7 45 183 201 (a) wave (b) 3D-road
skin 115 180 303 6741 23273 31232 2
III-τ cod-rna 24.3 32.5 52.5 701 10187 14875 Fig. 13: Query throughput on χ kernel, varying the relative error
home 4.25 7.73 6.92 824 1919 2457 

To test the sensitivity of the relative error  for KAQ with


χ2 kernel function, we vary the parameter  from 0.05 to 0.3, 6.4 Accuracy Evaluation for Different Methods with χ2
and measure the throughput of each method, using the two largest kernel
datasets wave and 3D-road. Observe from Figure 13, since our As discussed in Section 2, there are many approximation methods
monotonicity-based lower and upper bound functions are effec- for supporting fast evaluation of kernel aggregation function using
tive, both our method KARLmono and KARL can outperform the additive kernels. However, all of these methods do not provide
existing methods by at least one order of magnitude. theoretical guarantee between the returned result and the exact
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 12

solution for both τ KAQ and KAQ. In this section, we compare R EFERENCES
the accuracy of three state-of-the-art approximation methods in
[1] Nsl-kdd dataset. https://github.com/defcom17/.
both computer vision and machine learning communities, which [2] UCI machine learning repository. http://archive.ics.uci.edu/ml/index.php.
are NNmap [53], PPLmap [41] and VLFeatmap [49], and the exact [3] J. Baek, S. Hong, J. Kim, and E. Kim. Efficient pedestrian detection at
methods (e.g., SCAN, KARL, etc.) in the classification task (i.e., nighttime using a thermal camera. Sensors, 17(8):1850, 2017.
τ KAQ). Both NNmap , PPLmap and VLFeatmap create the high- [4] J. Baek, J. Hyun, and E. Kim. A pedestrian detection system accelerated
by kernelized proposals. IEEE Transactions on Intelligent Transportation
dimensional feature maps (or feature vectors), which can be Systems, pages 1–13, 2019.
further trained/ predicted by the linear SVM [18] efficiently, to [5] J. Baek, J. Kim, and E. Kim. Fast and efficient pedestrian detection via
approximate the additive kernel functions. However, these approx- the cascade implementation of an additive kernel support vector machine.
imation methods normally sacrifice the accuracy performance, IEEE Trans. Intelligent Transportation Systems, 18(4):902–916, 2017.
[6] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita. Network anomaly
compared with exact methods (e.g., KARL). Table 11 summarizes detection: Methods, systems and tools. IEEE Communications Surveys
the accuracy result for χ2 kernel function. The accuracy of both and Tutorials, 16(1):303–336, 2014.
VLFeatmap , PPLmap and NNmap are 8.6%-14.4%, 7.32%-8.53% [7] A. L. Buczak and E. Guven. A survey of data mining and machine
learning methods for cyber security intrusion detection. IEEE Communi-
and 5.36%-17.1% lower than exact methods respectively. cations Surveys and Tutorials, 18(2):1153–1176, 2016.
[8] T. N. Chan, M. L. Yiu, and K. A. Hua. A progressive approach for
TABLE 11: Accuracy result (in %) for different methods with χ2 similarity search on matrix. In SSTD, pages 373–390. Springer, 2015.
kernel [9] T. N. Chan, M. L. Yiu, and K. A. Hua. Efficient sub-window nearest
neighbor search on matrix. IEEE Trans. Knowl. Data Eng., 29(4):784–
DatasetsExact (e.g., KARL)VLFeatmap [49]PPLmap [41]NNmap [53] 797, 2017.
[10] T. N. Chan, M. L. Yiu, and L. H. U. KARL: fast kernel aggregation
skin 97.57 83.13 90.25 92.21
queries. In ICDE, pages 542–553, 2019.
cod-rna 96.69 86.22 88.35 79.63 [11] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey.
home 87.36 78.75 78.83 75.24 ACM Comput. Surv., 41(3):15:1–15:58, 2009.
[12] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and Technology,
7 C ONCLUSION 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/
∼cjlin/libsvm.
In this paper, we propose the solution, called KARL, to support [13] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan. Contextualiz-
the kernel aggregation queries, which are used in different types ing object detection and classification. IEEE Trans. Pattern Anal. Mach.
of machine learning models, including kernel density estima- Intell., 37(1):13–27, 2015.
tion/classification [21], [20], 1-class SVM [37], 2-class SVM and [14] H.-S. Chiu and et al. Pan-Cancer Analysis of lncRNA Regulation
Supports Their Targeting of Cancer Genes in Each Tumor Context. Cell
SVR [47]. Reports, 23(1):297–312, Apr. 2018.
Our contribution is threefold. First, we extend our work [10] to [15] K. Cranmer. Kernel estimation in high-energy physics. 136:198–207,
support other important kernel functions, called additive kernels, 2001.
[16] B. Demir and L. Bruzzone. Histogram-based attribute profiles for classifi-
which are widely used in different communities, e.g., machine
cation of very high resolution remote sensing images. IEEE Transactions
learning, computer vision, Geoscience, etc. Then, we develop the on Geoscience and Remote Sensing, 54(4):2096–2107, April 2016.
monotonicity-based bound functions for these kernel functions. [17] D. K. Duvenaud, H. Nickisch, and C. E. Rasmussen. Additive gaussian
After that, we further integrate both monotonicity-based bound processes. In NIPS, pages 226–234, 2011.
[18] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
functions and the linear bound functions to one unified solution LIBLINEAR: A library for large linear classification. Journal of Machine
to support these kernel functions. Experimental studies on a wide Learning Research, 9:1871–1874, 2008.
variety of datasets show that our solution KARL yields higher [19] B. J. Ferdosi, H. Buddelmeijer, S. C. Trager, M. H. F. Wilkinson, and
throughput than the state-of-the-art solution by at least one order J. B. T. M. Roerdink. Comparison of density estimation methods for
astronomical datasets. Astronomy and Astrophysics, 2011.
of magnitude for kernel aggregation queries with additive kernel [20] E. Gan and P. Bailis. Scalable kernel density classification via threshold-
functions. based pruning. In ACM SIGMOD, pages 945–959, 2017.
In the future, we will develop efficient algorithms for the [21] A. G. Gray and A. W. Moore. Nonparametric density estimation: Toward
prediction phase of other machine learning models, e.g., kernel computational tractability. In SDM, pages 203–211, 2003.
[22] L. He, C. Lu, G. Ma, S. Wang, L. Shen, P. S. Yu, and A. B. Ragin.
clustering [52], graph-kernel-based classification [51] and kernel- Kernelized support tensor machines. In ICML, pages 1442–1451, 2017.
ized support tensor machines [22]. On the other hand, we will also [23] M. Herbster. Learning additive models online with fast evaluating
extend our work to the training phase of kernel-based machine kernels. In COLT, pages 444–460, 2001.
learning models. [24] C. Hsieh, S. Si, and I. S. Dhillon. Fast prediction for large-scale kernel
machines. In NIPS, pages 3689–3697, 2014.
ACKNOWLEDGEMENT [25] C. Huang, G. Min, Y. Wu, Y. Ying, K. Pei, and Z. Xiang. Time series
anomaly detection for trustworthy services in cloud computing systems.
This work was supported by the National Key Research and Devel- IEEE Trans. Big Data, 2017.
opment Plan of China (No.2019YFB2102100). Tsz Nam Chan and [26] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes
a photograph memorable? IEEE Trans. Pattern Anal. Mach. Intell.,
Reynold Cheng were supported by the Research Grants Council 36(7):1469–1482, 2014.
of Hong Kong (RGC Projects HKU 17229116, 106150091, and [27] H. G. Jung and G. Kim. Support vector number reduction: Survey
17205115), the University of Hong Kong (Projects 104004572, and experimental evaluations. IEEE Trans. Intelligent Transportation
102009508, and 104004129), and the Innovation and Technol- Systems, 15(2):463–476, 2014.
[28] A. Kampouraki, G. Manis, and C. Nikou. Heartbeat time series classifica-
ogy Commission of Hong Kong (ITF project MRP/029/18). tion with support vector machines. IEEE Trans. Information Technology
Leong Hou U was supported by the Science and Technol- in Biomedicine, 13(4):512–518, 2009.
ogy Development Fund, Macau SAR (File no. 0015/2019/AKP [29] Q. V. Le, T. Sarlós, and A. J. Smola. Fastfood - computing hilbert space
expansions in loglinear time. In ICML, pages 244–252, 2013.
and SKL-IOTSC-2018-2020) and University of Macau (File no.
[30] W. Li, M. Coats, J. Zhang, and S. Mckenna. Discriminating dysplasia:
MYRG2019-00119-FST). Man Lung Yiu was supported by grant Optical tomographic texture analysis of colorectal polyps. Medical image
GRF 152050/19E from the Hong Kong RGC. analysis, 26:57–69, 09 2015.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 13

[31] B. Liu, Y. Xiao, P. S. Yu, L. Cao, Y. Zhang, and Z. Hao. Uncertain Tsz Nam Chan received the bachelor’s degree
one-class learning and concept summarization learning on uncertain data in electronic and information engineering and
streams. IEEE Trans. Knowl. Data Eng., 26(2):468–484, 2014. the PhD degree in computer science from the
[32] Y. Liu, Y. Liu, and Y. Chen. Fast support vector data descriptions Hong Kong Polytechnic University in 2014 and
for novelty detection. IEEE Trans. Neural Networks, 21(8):1296–1313, 2019 respectively. He is currently a research
2010. associate in the University of Hong Kong. His
[33] J. Ma and S. Perkins. Time-series novelty detection using one-class research interests include multidimensional sim-
support vector machines. In IJCNN, pages 1741–1745, 2003. ilarity search, pattern matching and kernel meth-
[34] S. Maji and A. C. Berg. Max-margin additive classifiers for detection. In ods for machine learning.
ICCV, pages 40–47, 2009.
[35] S. Maji, A. C. Berg, and J. Malik. Classification using intersection kernel
support vector machines is efficient. In CVPR, 2008.
[36] S. Maji, A. C. Berg, and J. Malik. Efficient classification for additive
kernel svms. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):66–77, 2013.
[37] L. M. Manevitz and M. Yousef. One-class svms for document classifica-
tion. Journal of Machine Learning Research, 2:139–154, 2001.
Leong Hou U completed his B.Sc. in Computer
[38] A. W. Moore. The anchors hierarchy: Using the triangle inequality to
Science and Information Engineering at Taiwan
survive high dimensional data. In UAI, pages 397–405, 2000.
Chi Nan University, his M.Sc. in E-commerce
[39] M. Mutny and A. Krause. Efficient high dimensional bayesian optimiza-
at University of Macau, and his Ph.D. in Com-
tion with additivity and quadrature fourier features. In NeurIPS, pages
puter Science at University of Hong Kong. He
9019–9030, 2018.
is now an Associate Professor in the State
[40] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
Key Laboratory of Internet of Things for Smart
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-
City and the Department of Computer and In-
derPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
formation Science, University of Macau. His
esnay. Scikit-learn: Machine learning in python. Journal of Machine
research interests include spatial and spatio-
Learning Research, 12:2825–2830, 2011.
temporal databases, advanced query process-
[41] O. Pele, B. Taskar, A. Globerson, and M. Werman. The pairwise
ing, crowdsourced query processing, information retrieval, data mining
piecewise-linear embedding for efficient non-linear classification. In
and optimization problems.
ICML, pages 205–213, 2013.
[42] Y. Ren, P. N. Suganthan, and N. Srikanth. A novel empirical mode
decomposition with support vector regression for wind speed forecasting.
IEEE Trans. Neural Netw. Learning Syst., 27(8):1793–1798, 2016.
[43] D. Sahoo, S. C. H. Hoi, and B. Li. Online multiple kernel regression. In
SIGKDD, pages 293–302, 2014.
[44] H. Samet. Foundations of Multidimensional and Metric Data Structures. Reynold Cheng is an Associate Professor of the
Morgan Kaufmann. 2006. Department of Computer Science in the Univer-
[45] G. Santamarı́a-Bonfil, A. Reyes-Ballesteros, and C. Gershenson. Wind sity of Hong Kong (HKU). He obtained his PhD
speed forecasting for wind farms: A method based on support vector from Department of Computer Science of Pur-
regression. Renewable Energy, 85(C):790–809, 2016. due University in 2005. Dr. Cheng was granted
[46] N. I. Sapankevych and R. Sankar. Time series prediction using support an Outstanding Young Researcher Award 2011-
vector machines: A survey. IEEE Comp. Int. Mag., 4(2):24–38, 2009. 12 by HKU.
[47] B. Schölkopf and A. J. Smola. Learning with Kernels: support vector
machines, regularization, optimization, and beyond. Adaptive computa-
tion and machine learning series. MIT Press, 2002.
[48] B. Schölkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C.
Platt. Support vector method for novelty detection. In NIPS, pages 582–
588, 1999.
[49] A. Vedaldi and A. Zisserman. Efficient additive kernels via explicit
feature maps. IEEE Trans. Pattern Anal. Mach. Intell., 34(3):480–492,
2012. Man Lung Yiu received the bachelor’s degree
[50] A. Vedaldi and A. Zisserman. Sparse kernel approximations for efficient in computer engineering and the PhD degree in
classification and detection. In CVPR, pages 2320–2327, 2012. computer science from the University of Hong
[51] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Kong in 2002 and 2006, respectively. Prior to
Borgwardt. Graph kernels. J. Mach. Learn. Res., 11:1201–1242, 2010. his current post, he worked at Aalborg University
[52] S. Wang, A. Gittens, and M. W. Mahoney. Scalable kernel k-means for three years starting in the Fall of 2006. He is
clustering with nystr\”om approximation: Relative-error bounds. J. now an associate professor in the Department of
Mach. Learn. Res., 20:12:1–12:49, 2019. Computing, The Hong Kong Polytechnic Univer-
[53] Z. Wang, X. Yuan, Q. Liu, and S. Yan. Additive nearest neighbor feature sity. His research focuses on the management of
maps. In ICCV, pages 2866–2874, 2015. complex data, in particular query processing top-
[54] K. Were, D. T. Bui, Øystein B. Dick, and B. R. Singh. A comparative ics on spatiotemporal data and multidimensional
assessment of support vector regression, artificial neural networks, and data.
random forests for predicting and mapping soil organic carbon stocks
across an afromontane landscape. Ecological Indicators, 52:394 – 403,
2015.
[55] B. Wolff, J. Kühnert, E. Lorenz, O. Kramer, and D. Heinemann. Com-
paring support vector regression for pv power forecasting to a physical
modeling approach using measurement, numerical weather prediction,
Shivansh Mittal is currently a final year un-
and cloud motion data. Solar Energy, 135(C):197–208, 2016.
dergraduate student, studying computer science
[56] J. Wu, W. Tan, and J. M. Rehg. Efficient and effective visual code-
major and finance minor, at the University of
book generation using additive kernels. Journal of Machine Learning
Hong Kong. His research interests include ker-
Research, 12:3097–3118, 2011.
nel methods for machine learning and recom-
[57] L. Zhang, J. Lin, and R. Karim. Adaptive kernel density-based anomaly
mender systems.
detection for nonlinear systems. Knowledge-Based Systems, 139(Supple-
ment C):50 – 63, 2018.
[58] Y. Zheng, J. Jestes, J. M. Phillips, and F. Li. Quality and efficiency for
kernel density estimates in large data. In SIGMOD, pages 433–444, 2013.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE) 14

8 A PPENDIX Therefore, we have:


  q +p 
In this appendix, we provide the detailed proofs in Lemma 8 (with (1) (1)
P
wi 21 δ log2 ` q` i`
FP` (qb` ) − FP` (q` ) pi` ∈P` δ
different conditions of δ in Table 7) for other kernel functions, =P  q +p  ≤ (min)
(1)

including both Intersection, JS and Hellinger kernels. FP` (q` ) 1 ` i`
q`
pi` ∈P` wi 2 q` log2 q`

(min)
8.1 Proof of Lemma 8 with Intersection kernel Therefore, once we set δ ≤ ε × q` , we can achieve the
(1) (1)
tolerance ε for the first part, where FP` (qb` ) ≤ (1 + ε)FP` (q` ).
Proof.
Claim (2):
FP` (qb` ) (2)
X 1  qb + p 
` i`
X X FP` (qb` ) = wi pi` log2
= wi min(qb` , pi` ) = wi min(q` + δ, pi` ) pi` ∈P`
2 pi`
pi` ∈P` pi` ∈P` X 1  q` + δ + pi` 
= wi pi` log2
X X
≤ wi (min(q` , pi` ) + δ) = FP` (q` ) + δ wi 2 pi`
pi` ∈P`
pi` ∈P` pi` ∈P`
X 1   q` + pi`  δ 
Therefore, we have the relative error: ≤ wi pi` log2 +
p ∈P
2 pi` pi`
P i` `
FP` (qb` ) − FP` (q` ) δ wi
pi` ∈P` δ X
≤ P =
(2)
FP` (q` ) + wi
FP` (q` ) pi` ∈P` wi min(q` , pi` ) 2 p ∈P
i` `
δ
≤ (min) (min) Therefore, we have:
min(q` , p` )
(2) (2) δ P
(min) (min)
FP` (qb` ) − FP` (q` ) 2 pi` ∈P` wi
Once we set δ ≤ ε × min(q` , p` ), we can achieve the (2)
≤   
FP` (q` )
P 1 qb` +pi`
tolerance ε, i.e., FP` (qb` ) ≤ (1 + ε)FP` (q` ). pi` ∈P` wi 2 p i` log 2 pi`
δ
≤  (min) 
8.2 Proof of Lemma 8 with JS kernel (min)
p` log
q`
+1
(max)
p`
Proof. To prove this lemma, we separate FP` (q` ) into two parts,
(1) (2) (min)
which are FP` (q` ) and FP` (q` ).
 
(min) q`
As such, once we set δ ≤ ε × p` log (max) +1 ,
p`
FP` (q` ) (2)
FP` (qb` ) can achieve the tolerance ε compared with FP` (q` ),
(2)
1 q + p  1  q + p 
(2) (2)
X ` i` ` i`
= wi q` log2 + p` log2 i.e., FP` (qb` ) ≤ (1 + ε)FP` (q` ).
2 q` 2 pi`
pi` ∈P` To fulfill both claims of (1) and (2), we select the smallest δ,
1  q + p  1  q + p  (min) 
X ` i`
X ` i` (min) (min) q`
= wi q` log2 + wi pi` log2 i.e., δ = ε × min q` , p` × log2 (max) + 1 .
p ∈P
2 q` p ∈P
2 pi` p`
i` ` i` `
(1) (2)
= FP` (q` ) + FP` (q` ) 8.3 Proof of Lemma 8 with Hellinger kernel
(1) (1) Proof.
Then, we claim that (1) FP` (qb` ) ≤ (1 + ε)FP` (q` ) and (2)
(2) (2) q
FP` (qb` ) ≤ (1 + ε)FP` (q` ), once δ fulfills the condition (cf.
X p X
FP` (qb` ) = wi qb` pi` = wi (q` + δ)pi`
Table 7). After we prove the above claims, we can conclude pi` ∈P` pi` ∈P`
FP` (qb` ) ≤ (1 + ε)FP` (q` ), since: X √ p
≤ wi ( q` pi` + δpi` )
FP` (qb` ) − FP` (q` ) pi` ∈P`
X p
FP` (q` ) = FP` (q` ) + wi δpi`
(1) (2) (1) (2) pi` ∈P`
FP` (qb` ) + FP` (qb` ) − FP` (q` ) − FP` (q` )
= (1) (2)
≤ε Therefore, we have the relative error:
FP` (q` ) + FP` (q` ) P √ s
FP` (qb` ) − FP` (q` ) pi` ∈P` wi δpi` δ
Now, we prove the above claims. ≤P √ = (min)
FP` (q` ) pi` ∈P` wi q ` p i` q`
Claim (1):
r
1  qb + p  δ
(1) ` i` Therefore, we can achieve the tolerance ε once we set (min) ≤
X
FP` (qb` ) = wi qb` log2 q`
pi` ∈P`
2 q
b`
2 (min)
ε and thus, δ ≤ ε × q` .
X 1  q` + δ + pi` 
= wi (q` + δ) log2
pi` ∈P`
2 q` + δ
X  1  pi` 
≤ wi (q` + δ) log2 1 +
pi` ∈P`
2 q`
1  q` + pi` 
(1)
X
= FP` (q` ) + wi δ log2
p ∈P
2 q`
i` `

You might also like