0% found this document useful (0 votes)

21 views

Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

This document discusses hyperkernels, which involve defining a reproducing kernel Hilbert space on the space of kernels. This allows the problem of choosing a kernel from a parameterized family to be framed as a statistical estimation problem. The document introduces quality functionals, which indicate how suitable a kernel is for explaining training data. Expected quality functionals aim to minimize this measure averaged over datasets. Examples of quality functionals are given, including kernel target alignment. Hyperkernels provide a framework for parameterizing kernel classes while managing overfitting.

Uploaded by

Reshma Khemchandani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Uploaded by

Reshma Khemchandani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Hyperkernels

Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Research School of Information Sciences and Engineering
The Australian National University
Canberra, 0200 ACT, Australia
Cheng.Ong, Alex.Smola, Bob.Williamson @anu.edu.au

Abstract

We consider the problem of choosing a kernel suitable for estimation

using a Gaussian Process estimator or a Support Vector Machine. A
novel solution is presented which involves defining a Reproducing Ker-
nel Hilbert Space on the space of kernels itself. By utilizing an analog
of the classical representer theorem, the problem of choosing a kernel
from a parameterized family of kernels (e.g. of varying width) is reduced
to a statistical estimation problem akin to the problem of minimizing a
regularized risk functional. Various classical settings for model or kernel
selection are special cases of our framework.

1 Introduction

Choosing suitable kernel functions for estimation using Gaussian Processes and Support
Vector Machines is an important step in the inference process. To date, there are few if
any systematic techniques to assist in this choice. Even the restricted problem of choosing
the “width” of a parameterized family of kernels (e.g. Gaussian) has not had a simple and
elegant solution.
A recent development [1] which solves the above problem in a restricted sense involves
the use of semidefinite programming to learn an arbitrary positive semidefinite matrix ,

subject to minimization of criteria such as the kernel target alignment [1], the maximum of
the posterior probability [2], the minimization of a learning-theoretical bound [3], or subject
to cross-validation settings [4]. The restriction mentioned is that the methods work with the
kernel matrix, rather than the kernel itself. Furthermore, whilst demonstrably improving the
performance of estimators to some degree, they require clever parameterization and design
to make the method work in the particular situations. There are still no general principles to
guide the choice of a) which family of kernels to choose, b) efficient parameterizations over
this space, and c) suitable penalty terms to combat overfitting. (The last point is particularly
an issue when we have a very large set of semidefinite matrices at our disposal).
Whilst not yet providing a complete solution to these problems, this paper presents a frame-
work that allows the optimization within a parameterized family relatively simply, and cru-
cially, intrinsically captures the tradeoff between the size of the family of kernels and the
sample size available. Furthermore, the solution presented is for optimizing kernels them-
selves, rather than the kernel matrix as in [1]. Other approaches on learning the kernel
include using boosting [5] and by bounding the Rademacher complexity [6].
Outline of the Paper We show (Section 2) that for most kernel-based learning methods
there exists a functional, the quality functional1, which plays a similar role to the empiri-
cal risk functional, and that subsequently (Section 3) the introduction of a kernel on ker-
nels, a so-called hyperkernel, in conjunction with regularization on the Reproducing Ker-
nel Hilbert Space formed on kernels leads to a systematic way of parameterizing function
classes whilst managing overfitting. We give several examples of hyperkernels (Section 4)
and show (Section 5) how they can be used practically. Due to space constraints we only
consider Support Vector classification.

2 Quality Functionals
Let

denote the set of training data and the
"#%$
!

set of corresponding labels, jointly drawn iid from some probability distribution
on &('*) /0$
. Furthermore, let and +-, +-,.
denote the corresponding test sets (drawn from
the same ! ). Let 1
2 43 +-,.
and . 5
6 7 83 7+-,.
9
; %7 $
We introduce a new class of functionals on data which we call quality functionals. Their
:
purpose is to indicate, given a kernel and the training data , how suitable
the kernel is for explaining the training data.

:

Definition 1 (Empirical Quality Functional) Given a kernel , and data
9<+-=?>%@ : C # D;< E A
, define
: :B BC# 0D$$
to be an empirical quality functional if it depends on only via
F
9 +-=?> @ : <AGHF0
IJ@ :B C # D $ A CK D
where ; i.e. if there exists a function such that

where is the kernel matrix.

The basic idea is that 9L+-=?> could be used to adapt : in a manner such that 9M+-=?> is
minimized, based on this single dataset . Given E a sufficiently rich class N of ker-
nels : it is in general A for any training:Pset.
possible to find a kernel O N that attains arbitrarily small
values of 9 +-=?> @ :QO However, it is very unlikely that
9<+-=?>%@ :QO +-,. Q+-,.-A would be similarly small in general. Analogously to the standard
methods of statistical learning theory, we aim to minimize the expected quality functional:

9 +-=?>
9!@ :RA/
2SUT K V @ 9L+-=?>R@ : < AWA
Definition 2 (Expected Quality Functional) Suppose is an empirical quality func-
tional. Then (1)
is the expected quality functional, where the expectation is taken with respect to .

between
X +-=?>%@ Y similarity
Note the <AL [ Z CW\"] C9 #+-=?C > YG@ : C^$#$ <(where
A and ] the empirical risk of an estimator
cases we/compute
is a suitable loss function): indrawn
both

from !
#%$ and a function, and in both cases we have
the value of a functional which depends on some sample

9!@ :RAP2S_T K V @ 9<+-=?>%@ : <A`A and X @ YAP6S_T K V @ X +-=?>%@ Y <A`A (2)
X
Here @ YA is known as the expected risk. We now present some examples of quality func-
tionals, and derive their exact minimizers whenever possible.

Example 1 (Kernel Target Alignment) This quality functional was introduced in [7] to
assess the “alignment” of a kernel with training labels. It is defined by
%g
9 +-=?a b> =?+- c @ : A/
Jdfe hijhk h h
k k

(3)
Z C,K D hjh Ck k D denotes the l k norm of , and h h k

where denotes the vector h h ofk elements of g

k
nmo

is the Frobenius norm: . Note that the definition in [7] looks

somewhat different, yet it is algebraically identical to (3).
1
We actually mean badness, since we are minimizing this functional.
g , in
By decomposing
g R g
into its eigensystem, one can see that (3) is minimized if
hijh k
9 +-=?a b> =?+- c @ : O 7 % AHdfe hjhkk h g h k Hd4e hijhkk hjhkk (4)

which case

It is clear that one cannot expect that 9 +-=?>

a b =?+- c @ :QO A for data other than
the set chosen to determine :PO .

Example 2 (Regularized Risk Functional) If is the Reproducing Kernel Hilbert Space

:
(RKHS) associated with the kernel , the regularized risk functionals have the form
X +-b @ Y A/
d ] C C YG C $#$
fh Y h k
CW\" (5)
h hk
where Y is the RKHS norm of Y E . By virtue of the representer theorem (see e.g., [4, 8])
we know that the minimizer over Y of (5) can be written as a kernel expansion.
]
For a given loss this leads to the quality functional

d C`\" ] C # C @ /A C $
g ! (6)
9 +-+-=?b> , @ : 7 A

The minimizer of (6) is more difficult to find, since we R g have to carry out a double mini-
mization over and . First, note that for
# "
g +"-, . Thus 9 +-b ., @ : A< k / $ . For sufficiently large " , we and and $&%(')%(* , 6

+-b , +-=?> A arbitrarily close to .

can
make 9 +-=?> @ :

Even if we disallow setting to zero, g by setting E 5 mo , and

6 d , we can determine the minimum
of (6) as follows. Set %10% *3242 , where 2 2 . Then 2 and so

d ] C C @ jA C $
g ] C C 2 C 7$
8h 2 h kk
CW\"

CW\"
Choosing each 2 98o;:4<=
C k
] CC ? >7$ /k > yields the minimum with respect to 2 . The
proof that is the global minimizer of this quality functional is omitted for brevity.

X +-b@ Y Q A

Example 3 (Negative Log-Posterior) In Gaussian processes, this functional is similar to
since it includes a regularization term (the negative log prior) and a
loss term (the negative log-likelihood). In addition, it also includes the log-determinant of
which measures the size of the space spanned by . The quality functional is

9 a-+ @=?b>>@,. @ : Q A/
AC B e `C \ 7DE :F" C;G C Y C $
d Y g

d G G

, Y DE : !
(7)
Note that any G G
which does not have full rank will send (7) to I H , and thus such cases e
J
d

J" hBh , k R g " ,LNK M K POMe hijh , k R g $

need to be excluded. When we fix , to exclude the above case, we can set

which leads to C
G G
d . UnderC the assumption

e " C C (8)

Y C$
with respect to Y is attained
C that the minimum of D<E
: F
a @b>@,. @ : at QY A . , we can see that "RQSH still leads to the overall

minimum of 9 +-=?>
Other examples, such as cross-validation, leave-one-out estimators, the Luckiness frame-
work, the Radius-Margin bound also have empirical quality functionals which can be arbi-
trarily minimized.
The above examples illustrate how many existing methods for assessing the quality of a
kernel fit within the quality functional framework. We also saw that given a rich enough
class of kernels , optimization of N 9 +-=?>
over would result in a kernel that would be N
useless for prediction purposes. This is yet another example of the danger of optimizing
too much — there is (still) no free lunch.
3 A Hyper Reproducing Kernel Hilbert Space
We now introduce a method for optimizing quality functionals in an effective way. The
method we propose involves the introduction of a Reproducing Kernel Hilbert Space on
:
the kernel itself — a “Hyper”-RKHS. We begin with the basic properties of an RKHS
(see Def 2.9 and Thm 4.2 in [8] and citations for more details).

Definition 3 (Reproducing Kernel Hilbert Space) Let be a nonempty set &

5 (often called
the index set) and denote by a Hilbert space of functions Q . Then
is Y
& `
h Y h
Y Y
called a reproducing kernel Hilbert space endowed with the dot product
5
: $
& ' & P $
#Q

(and the
/ E &
norm ) if there exists a function
: / $ ^ $
Y :B "

satisfying,
YG
:
Y E
:B /# .$ / $ E
1. has the reproducing property for all ; in particular,
:B :B

.

G
:
2. spans , i.e. 8) :B &
where is the completion of .

The advantage of optimization in an RKHS is that under certain conditions the optimal
solutions can be found as the linear combination of a finite number of basis functions,
regardless of the dimensionality of the space , as can be seen in the theorem below.

Theorem 4 (Representer Theorem) Denote by 5 H ]

k $ @ $ 5 Q 5 a strictly monotonic
increasing function, by a set, and by & Q E 4
& ' 3 H an arbitrary loss
Y
of the regularized risk

] B #R YG $#$ # YG <$#$$ h Y h $

function. Then each minimizer

admits a representation of the form G Y

$ Z `
C "
\ C
:B

C #
P
$ .
(9)

5
The above definition allows us to define an RKHS on kernels & 5 ' & Q , simply by
introducing &
6& ' & and by treating : as functions :
& Q :

Definition 5 (Hyper Reproducing Kernel Hilbert Space) Let & be a nonempty set and
let &
5& ' & (the compounded index set). Then the Hilbert space of functions
:
& Q , endowed with a dot product W
(and the norm h : h : : ) is called5
a Hyper Reproducing Kernel Hilbert Space if there exists a hyperkernel :
& ' & Q

: :
$ :B $ for all : E , in particular,
with the following properties:
1. : has

$
the
$
reproducing
$
property
:
:

: .

$ G E & .
2. : spans , i.e. E 8) :
E & , the& function : is a kernel in its second argument, i.e. for
:B /# $
: / $#$ with "# E & is a kernel.

3. For any fixed the hyperkernel

any fixed

What distinguishes from a normal RKHS is the particular form of its index set ( & & )
k
and the additional condition on : to be a kernel in its second argument for any fixed first
argument. This condition somewhat limits the choice of possible kernels. On
hand, it allows for simple optimization algorithms which consider kernels :
E the, which
other

are in the convex cone of : . Analogously to the definition of the regularized risk functional
(5), we define the regularized quality functional:

9 +-b @ : <A/
69 +-=?> @ : LA
h : h k (10)
where

is a regularization constant and :
h h k denotes the RKHS norm in . Mini-
mization of 9<+-b is less prone to overfitting than minimizing 9M+-=?> , since the regularization
h h k effectively controls the complexity of the class of kernels under consideration.
term /k :
Regularizers other than / k :
h h k are also possible. The question arising immediately from
(10) is how to minimize the regularized quality functional efficiently. In the following we
show that the minimum can be found as a linear combination of hyperkernels.

@ $
5 Theorem for Hyper-RKHS) Let be a hyper-RKHS and de-
Corollary 6 (Representer
H Q & 9
: E
note by a strictly monotonic increasing function, by a set, and by
an arbitrary quality functional. Then each minimizer of the regularized quality
functional
9!@ : <A

h: hk (11)

admits a representation of the form :B
/
#
$
CK D\" " CD : # PC#0D $ /# #$ $ .

C D
PC#0D $ 9 D @ : _A
Proof All we need to do is rewrite (11) so that it satisfies the conditions of Theorem 4. Let

:
. Then
on via its values at
C ;
. Furthermore, / k h: hk
has the properties of a loss function, as it only depends
is an RKHS regularizer, so the representer
theorem applies and the expansion of follows. :
This result shows that even though we are optimizing over an entire (potentially infinite
dimensional) Hilbert space of kernels, we are able to find the optimal solution by choosing
among a finite dimensional subspace. The dimension required ( ) is, not surprisingly, sig-
k
nificantly larger than the number of kernels required in a kernel function expansion which
makes a direct approach possible only for small problems. However, sparse expansion
techniques, such as [9, 8], can be used to make the problem tractable in practice.

4 Examples of Hyperkernels
Having introduced the theoretical basis of the Hyper-RKHS, we need to answer the ques-
:
tion whether practically useful exist which satisfy the conditions of Definition 5. We
address this question by giving a set of general recipes for building such kernels.

: C
5 4 (Power
5
B $ Z C`\ /] C
Example Series Construction) Denote by a positive semidefinite kernel, and

Q
X : k / $ X
by a function with positive Taylor

expansion coefficients and
convergence radius . Then for we have that
C
: # $
B7:j $ :B $#$ WC \ ] C :B $ :B $#$

(12)

C /fixed , : "# $#$ is a sum of kernel functions, hence it is

is a hyperkernel: for any
itself (since :
.$
is a kernel if : is). To show that : is a kernel, note that
: $ $i $ , where $
J ] ] : $i ] k : k $i$ .
a kernel

@ d A
Example 5 (Harmonic Hyperkernel) A special case of (12) is the harmonic hyperkernel:

set
] C
H d : e $ C #Q :
&(';&
for some . Then we have
Denote by a kernel with (e.g., RBF kernels satisfy this property), and
d
C
: # $ dfe $ C`\ :B $ :B $ $ fd e dfe:B $ :j $

(13)

Example 6 (Gaussian Harmonic Hyperkernel) For :B

"# $ /#e k h e -h k $ ,

: # /# $i # $$ dfe Le k h dfe e h k hi e h k $#$ (14)

For Q d , : converges to

K ; that is, the expression h : h k converges to the Frobenius

norm of : on 1' .
B $

Power
series expansion
2 2
X We can find further hyperkernels,
d

H simply by consulting tables on
6 * 2
k K

power series of functions. Ta-
;<
H ble 1 contains a list of suitable
* 2 *
[

E

d k k H
expansions. Recall that expan-
[ *
6 sions such as (12) were mainly
8o m )8

k K 1 chosen for computational conve-
$ * [
2

e D / dfe

k 1
nience, in particular whenever it
is not clear which particular class
of kernels would be useful for the
Table 1: Examples of Hyperkernels
expansion.

Example 7 (Explicit Construction) If we know or have a reasonable guess as to which

:
kernels could be potentially relevant (e.g., a range of scales of kernel width, polynomial
degrees, etc.), we may begin with a set of candidate kernels, say , . . . , and define :
: #
$ ] C C $ C $

`C \" : : : C $
(15)

] : : $i is ] k a: k hyperkernel,
Clearly $ ] : since
$ $ : W$ $ . $ ,
where
$

.

5 An Application: Minimization of the Regularized Risk

Recall that in the case of the Regularized Risk functional, the regularized quality optimiza-
tion problem takes on the form

C KB
d ] PCC YG PC $$
h Y h k
h : h k
CW\"
(16)

For Y
Z C C :B C #$ , the second term h Y h k is a linear function of : . Given a convex
]
loss function , the regularized quality functional (16) is convex in : . The corresponding
regularized quality functional is:

9 +-+-bb , @ : <A 9 -+ +-=?b> , @ : <A

h : h k (17)

:
For fixed , the problem can be formulated as a constrained minimization problem in , and Y
subsequently expressed in terms of the Lagrange multipliers . However, this minimum
:
depends on , and for efficient minimization we would like to compute the derivatives with
:
respect to . The following lemma tells us how (it is an extension of a result in [3] and we
omit the proof for brevity):
E 5 YG / $i ] C
5
5
Y
Lemma 7 Let and denote by Q convex functions, where is X $
parameterized
denote by
by . Let
its minimizer):
be the minimum $
of the following optimization problem (and

YG " $ subject to ] C $

minimize

for all d ! #"? (18)
D X $ D

Then $ % '& k YG $ $ , where ( E*) and & k denotes the derivative with respect to
the second argument of Y .

Since the minimizer of (17) can be written as a kernel expansion (by the representer theo-
rem for Hyper-RKHS), the optimal regularized quality functional can be written as (using
CD
: # C#0D$i %# $$ :
the soft margin loss and

9 +-+-bb , @ " MA CW\" 8 d

fd e DK K \" D " C D
C
(19)

C D CD

C D CD
CK DK K \" "

CK DK K \" " "

Minimization of (19) is achieved by alternating between minimization over for fixed "
C D
is a quadratic optimization problem), and subsequently minimization over " (with
(this
" to ensure positivity of the kernel matrix) for fixed .
Low Rank Approximation While being finite in the number of parameters (despite the
optimization over two possibly infinite dimensional Hilbert spaces and ), (19) still
presents a formidable optimization problem in practice (we have coefficients for " ).
k
: C $:C $ lk
For an explicit
expansion of type (15) we can optimize in the expansion coefficients of
directly, which means that we simply have a quality functional with an
penalty on the expansion coefficients. Such an approach is recommended if there are few
"
), we resort to a low-rank approximation, as
: BC#0D $ $ c
terms in (15). In the general case (or if
described in [9, 8]. This means that we pick from

with

(
a small d
fraction of terms which approximate on :
sufficiently well. '
6 Experimental Results and Summary
Experimental Setup To test our claims of kernel adaptation via regularized quality func-
tionals we performed preliminary tests on datasets from the UCI repository (Pima, Iono-
sphere, Wisconsin diagnostic breast cancer) and the USPS database of handwritten digits
(’6’ vs. ’9’). The datasets were split into training data and test data, except for
the USPS data, where the provided split was used. The experiments were repeated over
200 random 60/40 splits. We deliberately did not attempt to tune parameters and instead
made the following choices uniformly for all four sets:

The kernel width was set to , 5d
, where is the dimensionality of the

data. We deliberately chose a too large value in comparison with the usual rules
of thumb [8] to avoid good default
kernels. 4

was adjusted so that d
(that is Hd
in the Vapnik-style parameter-
ization of SVMs). This/ has commonly been reported to yield good results.

for the Gaussian Harmonic Hyperkernel was chosen to be throughout, giv-
ing adequate coverage over various kernel widths in (13) (small focus almost
exclusively on wide kernels, close to will treat d
all widths equally).

The hyperkernel regularization was set to , . d
We compared the results with the performance of a generic Support Vector Machine with
the same values chosen for and and one for which had been hand-tuned using cross

validation.
Results Despite the fact that we did not try to tune the parameters we were able to achieve
highly competitive results as shown in Table 2. It is also worth noticing that the number of
hyperkernels required after a low-rank decomposition of the hyperkernel matrix contained
typically less than 10 hyperkernels, thus rendering the optimization problem not much
more costly than a standard Support Vector Machine (even with a very high quality , d
approximation of ) and that after the optimization of (19), typically less than were being

used. This dramatically reduced the computational burden.

Using the same non-optimized parameters for different data sets we achieved results com-
parable to other recent work on classification such as boosting, optimized SVMs, and kernel
target alignment [10, 11, 7] (note that we use a much smaller part of the data for training:
X +-b 9 +-b Best in Tuned
Data(size) Train Test Train Test [10, 11] SVM
pima(768) 25.2 2.0 26.2 3.3 22.2 1.4 23.2 2.0 23.5 22.9 2.0
ionosph(351) 13.4 2.0 16.5 3.4 10.9 1.5 13.4 2.4 6.2 6.1 1.9
wdbc(569) 5.7 0.8 5.7 1.3 2.1 0.6 2.7 1.0 3.2 2.5 0.9
usps(1424) 2.1 3.4 1.5 2.8 NA 2.5

Table 2: Training and test error in percent

only rather than ). Results based on 9M+-b
are comparable to hand tuned SVMs
(right most column), except for the ionosphere data. We suspect that this is due to the small
training sample.
Summary and Outlook The regularized quality functional allows the systematic solu-
tion of problems associated with the choice of a kernel. Quality criteria that can be used
include target alignment, regularized risk and the log posterior. The regularization implicit
in our approach allows the control of overfitting that occurs if one optimizes over a too
large a choice of kernels.
A very promising aspect of the current work is that it opens the way to theoretical analyses
of the price one pays by optimizing over a larger set N
of kernels. Current and future
research is devoted to working through this analysis and subsequently developing methods
for the design of good hyperkernels.
Acknowledgements This work was supported by a grant of the Australian Research
Council. The authors thank Grace Wahba for helpful comments and suggestions.

References
[1] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the
kernel matrix with semidefinite programming. In ICML. Morgan Kaufmann, 2002.
[2] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to
linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in
Graphical Models. Kluwer Academic, 1998.
[3] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing kernel parameters
for support vector machines. Machine Learning, 2002. Forthcoming.
[4] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional
Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990.
[5] K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In Advances in
Neural Information Processing Systems 15, 2002. In press.
[6] O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In
Advances in Neural Information Processing Systems 15, 2002. In press.
[7] N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment.
Technical Report NC2-TR-2001-087, NeuroCOLT, http://www.neurocolt.com, 2001.
[8] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.
[9] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representa-
tion. Technical report, IBM Watson Research Center, New York, 2000.
[10] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In ICML,
pages 148–146. Morgan Kaufmann Publishers, 1996.
[11] G. Rätsch, T. Onoda, and K. R. Müller. Soft margins for adaboost. Machine Learning,
42(3):287–320, 2001.

Grade 4 Sasmo: Answer The Questions
100% (4)
Grade 4 Sasmo: Answer The Questions
4 pages
Dynamics of Entrepreneurial Development AND Manage Ment: Vasant Desai
100% (2)
Dynamics of Entrepreneurial Development AND Manage Ment: Vasant Desai
761 pages
Claudette Colvin Chapters 1 3
No ratings yet
Claudette Colvin Chapters 1 3
1 page
Kernel-Based Approximation Methods Using MATLAB
0% (1)
Kernel-Based Approximation Methods Using MATLAB
9 pages
UIMO CLASS 5 Past 5 Papers Reduced Size 6xwwsc
No ratings yet
UIMO CLASS 5 Past 5 Papers Reduced Size 6xwwsc
98 pages
Sample Midterm Solutions
No ratings yet
Sample Midterm Solutions
12 pages
Superkernels
No ratings yet
Superkernels
9 pages
Machine Learning Using Hyperkernels: Only For Supervised Learning
No ratings yet
Machine Learning Using Hyperkernels: Only For Supervised Learning
8 pages
SinhaDu16 PDF
No ratings yet
SinhaDu16 PDF
20 pages
Kernel Models 1233
No ratings yet
Kernel Models 1233
56 pages
Mva - Slides Machine Learning With Kernel Methods
No ratings yet
Mva - Slides Machine Learning With Kernel Methods
644 pages
0701907v3
No ratings yet
0701907v3
53 pages
Kernal Methods Machine Learning
No ratings yet
Kernal Methods Machine Learning
53 pages
Gaussian Process Kernels For Pattern Discovery and Extrapolation
No ratings yet
Gaussian Process Kernels For Pattern Discovery and Extrapolation
10 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
07 Kernels
No ratings yet
07 Kernels
6 pages
Lecture 05
No ratings yet
Lecture 05
49 pages
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
No ratings yet
2021 UNAS REFER Rafi Yon Saputra 173112706420242 Kernel Primer
65 pages
Lecture03_kernel
No ratings yet
Lecture03_kernel
28 pages
Kernels and Distances For Structured Data
No ratings yet
Kernels and Distances For Structured Data
28 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
ICS E4030 Lecture1
No ratings yet
ICS E4030 Lecture1
37 pages
Intro&NP Stat
No ratings yet
Intro&NP Stat
122 pages
2014 02 26 Kernels
No ratings yet
2014 02 26 Kernels
140 pages
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
No ratings yet
Support Vector Machines: Kernels: CS4780/5780 - Machine Learning Fall 2011 Thorsten Joachims Cornell University
15 pages
Easy Multiple Kernel Learning: January 2014
No ratings yet
Easy Multiple Kernel Learning: January 2014
7 pages
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
No ratings yet
Kernel Functions: Tejumade Afonja Jan 2, 2017 6 Min Read
6 pages
Vahid
No ratings yet
Vahid
18 pages
Classes of Kernels For Machine Learning: A Statistics Perspective
No ratings yet
Classes of Kernels For Machine Learning: A Statistics Perspective
14 pages
18.657: Mathematics of Machine Learning: N I I H H I 1
No ratings yet
18.657: Mathematics of Machine Learning: N I I H H I 1
6 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
ml mod 4
No ratings yet
ml mod 4
26 pages
Lecture17 Kernels
No ratings yet
Lecture17 Kernels
23 pages
Talitckii et al. - Eﬃcient Convex Algorithms for Universal Kernel Learning
No ratings yet
Talitckii et al. - Eﬃcient Convex Algorithms for Universal Kernel Learning
40 pages
Principe Icassp2011 Klms
No ratings yet
Principe Icassp2011 Klms
124 pages
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
No ratings yet
Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr Om Method, and Use of Kernels in Machine Learning: Tutorial and Survey
31 pages
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
No ratings yet
Probability Product Kernels: Tony Jebara Risi Kondor Andrew Howard
26 pages
SVM Kernel Functions
No ratings yet
SVM Kernel Functions
12 pages
Article Kernal Model
No ratings yet
Article Kernal Model
9 pages
Kernel Functions
No ratings yet
Kernel Functions
35 pages
Kernel
No ratings yet
Kernel
3 pages
Kernel Adaptive Filtering PDF
No ratings yet
Kernel Adaptive Filtering PDF
124 pages
04f51e6dde498ae3ffd83852e9d480a2
No ratings yet
04f51e6dde498ae3ffd83852e9d480a2
35 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Lecture4
No ratings yet
Lecture4
49 pages
phd_thesis_final_statistics tests
No ratings yet
phd_thesis_final_statistics tests
154 pages
SVM 4
No ratings yet
SVM 4
8 pages
Universal Multi-Task Kernels
No ratings yet
Universal Multi-Task Kernels
32 pages
Ds 11
No ratings yet
Ds 11
21 pages
KernelMethods
No ratings yet
KernelMethods
19 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Practice_Problems_for_ML_Midterms
No ratings yet
Practice_Problems_for_ML_Midterms
5 pages
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
No ratings yet
Paul Honeiné, Cédric Richard, Patrick Flandrin, Jean-Baptiste Pothin
4 pages
17-570
No ratings yet
17-570
45 pages
Search Results PCA TF IDF
No ratings yet
Search Results PCA TF IDF
21 pages
Time Series Forecasting by Using Wavelet Kernel SVM
No ratings yet
Time Series Forecasting by Using Wavelet Kernel SVM
52 pages
Efficient Algorithms for Kernel Aggregation Queries
No ratings yet
Efficient Algorithms for Kernel Aggregation Queries
14 pages
To Machine Learning: Isabelle Guyon
No ratings yet
To Machine Learning: Isabelle Guyon
40 pages
Be Central
No ratings yet
Be Central
98 pages
Ortonormalidad en Espacios de Hilbert
No ratings yet
Ortonormalidad en Espacios de Hilbert
20 pages
Liu et al. - 2021 - Random Features for Kernel Approximation A Survey on Algorithms, Theory, and Beyond
No ratings yet
Liu et al. - 2021 - Random Features for Kernel Approximation A Survey on Algorithms, Theory, and Beyond
35 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
ISFO English Toppers List 2019-20: S.No. Student Name School Name State Class Subject International Rank
No ratings yet
ISFO English Toppers List 2019-20: S.No. Student Name School Name State Class Subject International Rank
10 pages
Newsletter BHIS Noida
No ratings yet
Newsletter BHIS Noida
39 pages
ISTSE Detailed Instructions
No ratings yet
ISTSE Detailed Instructions
33 pages
CL 4 Nstse 2021 Paper 465 Key
No ratings yet
CL 4 Nstse 2021 Paper 465 Key
4 pages
Neural Networks For Machine Learning: Lecture 3a Learning The Weights of A Linear Neuron
No ratings yet
Neural Networks For Machine Learning: Lecture 3a Learning The Weights of A Linear Neuron
34 pages
April SKG
No ratings yet
April SKG
5 pages
Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
No ratings yet
Neural Networks For Machine Learning: Lecture 16a Learning A Joint Model of Images and Captions
19 pages
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
No ratings yet
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
10 pages
Unified International Mathematics Olympiad - 2021: NA: Not Applicable-For That Class Note
No ratings yet
Unified International Mathematics Olympiad - 2021: NA: Not Applicable-For That Class Note
2 pages
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
No ratings yet
Pattern Recognition: Zhe Wang, Zonghai Zhu, Dongdong Li
14 pages
The Broken Flute, 1994, Sharada Dwivedi, 0140236821, 9780140236828, Penguin Books, 1994
No ratings yet
The Broken Flute, 1994, Sharada Dwivedi, 0140236821, 9780140236828, Penguin Books, 1994
29 pages
Neural Networks: Guoqiang Wu, Ruobing Zheng, Yingjie Tian, Dalian Liu
No ratings yet
Neural Networks: Guoqiang Wu, Ruobing Zheng, Yingjie Tian, Dalian Liu
16 pages
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
No ratings yet
Regularized Minimax Probability Machine - 2019 - Knowledge Based Systems
9 pages
1566991547ISFO English G3 2019 PDF
100% (1)
1566991547ISFO English G3 2019 PDF
64 pages
Knowledge-Based Systems: Jun Ma, Jumei Shen
No ratings yet
Knowledge-Based Systems: Jun Ma, Jumei Shen
14 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
LogIQids Trends - Relationship
No ratings yet
LogIQids Trends - Relationship
16 pages
A Qualitative Phenomenological Exploration of Teachers Experience With Nutrition Education
No ratings yet
A Qualitative Phenomenological Exploration of Teachers Experience With Nutrition Education
14 pages
The Role and Challenges of Vehicle
No ratings yet
The Role and Challenges of Vehicle
96 pages
Polygraphy Reviewer
No ratings yet
Polygraphy Reviewer
22 pages
Thesis SVM
100% (1)
Thesis SVM
5 pages
Research Proposal
No ratings yet
Research Proposal
8 pages
Microsoft Word - Report On Descriptive Statistics and Item Analysis
100% (1)
Microsoft Word - Report On Descriptive Statistics and Item Analysis
23 pages
Project Report On Performance Management System in NHPC1
100% (1)
Project Report On Performance Management System in NHPC1
54 pages
Industrial_Symbiosis
No ratings yet
Industrial_Symbiosis
114 pages
Application of Cumulative Absolute Velocity (Cav) To Real-Time Earthquake Damage Assessment
No ratings yet
Application of Cumulative Absolute Velocity (Cav) To Real-Time Earthquake Damage Assessment
5 pages
Problems 3
No ratings yet
Problems 3
5 pages
Towards A Computational Model of General Cognitive Control Using Artificial Intelligence, Experimental Psychology and Cognitive Neuroscience
No ratings yet
Towards A Computational Model of General Cognitive Control Using Artificial Intelligence, Experimental Psychology and Cognitive Neuroscience
244 pages
Clinical Psychology
No ratings yet
Clinical Psychology
54 pages
RPSC College Librarian Paper 1 - Unit 2
No ratings yet
RPSC College Librarian Paper 1 - Unit 2
30 pages
Project Report On Soft Skills (Soft Skill-Ii)
No ratings yet
Project Report On Soft Skills (Soft Skill-Ii)
27 pages
PR Horn's Spelling Method
No ratings yet
PR Horn's Spelling Method
34 pages
SEMrush Keyword Research Checklist
No ratings yet
SEMrush Keyword Research Checklist
1 page
Biomass Detection Production And Usage Edited By Darko Matovic instant download
No ratings yet
Biomass Detection Production And Usage Edited By Darko Matovic instant download
79 pages
Anes Timeseries 2020 Userguidecodebook 20220210
No ratings yet
Anes Timeseries 2020 Userguidecodebook 20220210
796 pages
12 11 2021 10 39 10 F-0217A-DocumentReviewandPre-AssessmentISO17025Rev00
No ratings yet
12 11 2021 10 39 10 F-0217A-DocumentReviewandPre-AssessmentISO17025Rev00
14 pages
Decision Science - Assignment
No ratings yet
Decision Science - Assignment
6 pages
FFI On PPT Proposal
No ratings yet
FFI On PPT Proposal
20 pages
Aud339 Group Report: Article Review
No ratings yet
Aud339 Group Report: Article Review
8 pages
Quality - Online - Banking - Service (1) Deep Original
No ratings yet
Quality - Online - Banking - Service (1) Deep Original
59 pages
Cybersecurity Issues and Protection Strategies For State Transportation Agency Ceos (2023)
No ratings yet
Cybersecurity Issues and Protection Strategies For State Transportation Agency Ceos (2023)
32 pages
Joan Woodward PDF
100% (2)
Joan Woodward PDF
4 pages
Student Perception Towards Boundary Monuments (Open Ended)
No ratings yet
Student Perception Towards Boundary Monuments (Open Ended)
2 pages
Song & Kim (2017)
No ratings yet
Song & Kim (2017)
14 pages

Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Uploaded by

Hyperkernels: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Uploaded by

Hyperkernels

Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

We consider the problem of choosing a kernel suitable for estimation

where is the kernel matrix.

where denotes the vector h h ofk elements of g

is the Frobenius norm: . Note that the definition in [7] looks

It is clear that one cannot expect that 9 +-=?>

Example 2 (Regularized Risk Functional) If is the Reproducing Kernel Hilbert Space

+-b ,     +-=?>  A arbitrarily close to .

Even if we disallow setting  to zero, g by setting E 5 mo , and

X +-b@ Y    Q A

J" hBh , k R g  " ,LNK M K POMe hijh , k R g $

e "  C  C  (8)

Definition 3 (Reproducing Kernel Hilbert Space) Let be a nonempty set &

Theorem 4 (Representer Theorem) Denote by  5 H ]

]  B #R  YG $#$  #   YG <$#$$  h Y h $

admits a representation of the form G Y 

3. For any fixed the hyperkernel

C /fixed  , :   "#  $#$ is a sum of kernel functions, hence it is

Example 6 (Gaussian Harmonic Hyperkernel) For :B

Example 7 (Explicit Construction) If we know or have a reasonable guess as to which

5 An Application: Minimization of the Regularized Risk

9  +-+-bb  , @ :   <A 9 -+ +-=?b>  , @ :   <A  

YG " $ subject to ] C $  

CK DK K \" " "

used. This dramatically reduced the computational burden.

Table 2: Training and test error in percent

You might also like

+-b , +-=?> A arbitrarily close to .

Even if we disallow setting to zero, g by setting E 5 mo , and

X +-b@ Y Q A

J" hBh , k R g " ,LNK M K POMe hijh , k R g $

e " C C (8)

Theorem 4 (Representer Theorem) Denote by 5 H ]

] B #R YG $#$ # YG <$#$$ h Y h $

admits a representation of the form G Y

C /fixed , : "# $#$ is a sum of kernel functions, hence it is

9 +-+-bb , @ : <A 9 -+ +-=?b> , @ : <A

YG " $ subject to ] C $

CK DK K \" " "