0% found this document useful (0 votes)
62 views23 pages

Classification With Sparse Grids Using Simplicial Basis Functions

The document discusses a new approach to classification problems using sparse grids. It uses a grid-based discretization technique called sparse grid combination to build classifiers in high-dimensional feature spaces. This allows the method to scale linearly with the number of data points, making it well-suited for large data mining applications. The method was tested on standard and large synthetic datasets of up to 14 dimensions, achieving classification rates competitive with other state-of-the-art methods.

Uploaded by

Gajdos Laszlo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views23 pages

Classification With Sparse Grids Using Simplicial Basis Functions

The document discusses a new approach to classification problems using sparse grids. It uses a grid-based discretization technique called sparse grid combination to build classifiers in high-dimensional feature spaces. This allows the method to scale linearly with the number of data points, making it well-suited for large data mining applications. The method was tested on standard and large synthetic datasets of up to 14 dimensions, achieving classification rates competitive with other state-of-the-art methods.

Uploaded by

Gajdos Laszlo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Classication with sparse grids

using simplicial basis functions


Jochen Garcke and Michael Griebel
Institut f ur Angewandte Mathematik
Rheinische Friedrich-Wilhelms-Universit at Bonn
Wegelerstrae 6
53115 Bonn, Germany
{garckej, griebel}@iam.uni-bonn.de
Abstract
Recently we presented a new approach [20] to the classication problem arising in
data mining. It is based on the regularization network approach but in contrast to other
methods, which employ ansatz functions associated to data points, we use a grid in the
usually high-dimensional feature space for the minimization process. To cope with the
curse of dimensionality, we employ sparse grids [52]. Thus, only O(h
1
n
n
d1
) instead of
O(h
d
n
) grid points and unknowns are involved. Here d denotes the dimension of the feature
space and h
n
= 2
n
gives the mesh size. We use the sparse grid combination technique
[30] where the classication problem is discretized and solved on a sequence of conventional
grids with uniform mesh sizes in each dimension. The sparse grid solution is then obtained
by linear combination.
The method computes a nonlinear classier but scales only linearly with the number
of data points and is well suited for data mining applications where the amount of data is
very large, but where the dimension of the feature space is moderately high. In contrast
to our former work, where d-linear functions were used, we now apply linear basis func-
tions based on a simplicial discretization. This allows to handle more dimensions and the
algorithm needs less operations per data point. We further extend the method to so-called
anisotropic sparse grids, where now dierent a-priori chosen mesh sizes can be used for
the discretization of each attribute. This can improve the run time of the method and the
approximation results in the case of data sets with dierent importance of the attributes.
We describe the sparse grid combination technique for the classication problem, give
implementational details and discuss the complexity of the algorithm. It turns out that
the method scales linearly with the number of given data points. Finally we report on the
quality of the classier built by our new method on data sets with up to 14 dimensions.
We show that our new method achieves correctness rates which are competitive to those
of the best existing methods.
Key words. data mining, classication, approximation, sparse grids, combination technique, simplicial
discretization
1
1 Introduction
Data mining is the process of nding patterns, relations and trends in large data sets. Examples
range from scientic applications like the post-processing of data in medicine or the evaluation
of satellite pictures to nancial and commercial applications, e.g. the assessment of credit risks
or the selection of customers for advertising campaign letters. For an overview on data mining
and its various tasks and approaches see [5, 12].
In this paper we consider the classication problem arising in data mining. Given is a set
of data points in a d-dimensional feature space together with a class label. From this data,
a classier must be constructed which allows to predict the class of any newly given data
point for future decision making. Widely used approaches are, besides others, decision tree
induction, rule learning, adaptive multivariate regression splines, neural networks, and support
vector machines. Interestingly, some of these techniques can be interpreted in the framework of
regularization networks [23]. This approach allows a direct description of the most important
neural networks and it also allows for an equivalent description of support vector machines and
n-term approximation schemes [22]. Here, the classication of data is interpreted as a scattered
data approximation problem with certain additional regularization terms in high-dimensional
spaces.
In [20] we presented a new approach to the classication problem. It is also based on the
regularization network approach but, in contrast to the other methods which employ mostly
global ansatz functions associated to data points, we use an independent grid with associated
local ansatz functions in the minimization process. This is similar to the numerical treatment
of partial dierential equations by nite element methods. Here, a uniform grid would result in
O(h
d
n
) grid points, where d denotes the dimension of the feature space and h
n
= 2
n
gives the
mesh size. Therefore the complexity of the problem would grow exponentially with d and we
encounter the curse of dimensionality. This is probably the reason why conventional grid-based
techniques are not used in data mining up to now.
However, there is the so-called sparse grids approach which allows to cope with the com-
plexity of the problem to some extent. This method has been originally developed for the
solution of partial dierential equations [2, 8, 30, 52] and is now used successfully also for in-
tegral equations [15, 29], interpolation and approximation [3, 28, 42, 45], eigenvalue problems
[18], and integration problems [21]. In the information based complexity community it is also
known as hyperbolic cross points and the idea can even be traced back to [44]. For a d-
dimensional problem, the sparse grid approach employs only O(h
1
n
(log(h
1
n
))
d1
) grid points
in the discretization. The accuracy of the approximation however is nearly as good as for the
conventional full grid methods, provided that certain additional smoothness requirements are
fullled. Thus a sparse grid discretization method can be employed also for higher-dimensional
problems. The curse of the dimensionality of conventional full grid methods aects sparse
grids much less.
In this paper, we apply the sparse grid combination technique [30] to the classication prob-
lem. Here, the regularization network problem is discretized and solved on a certain sequence
of conventional grids with uniform mesh sizes in each coordinate direction. The sparse grid
solution is then obtained from the solutions on the dierent grids by linear combination. Thus
the classier is build on sparse grid points and not on data points.
In contrast to [20], where d-linear functions stemming from a tensor-product approach were
used, we now apply linear basis functions based on a simplicial discretization. In comparison,
2
this approach allows the processing of more dimensions and needs less operations per data
point. A further extension of the approach to so-called anisotropic sparse grids allows to treat
attributes with a lot of variance dierently than attributes with only a few dierent values.
A discussion of the complexity of the method gives that the method scales linearly with
the number of instances, i.e. the amount of data to be classied. Therefore, our method is
well suited for realistic data mining applications where the dimension of the feature space is
moderately high (e.g. after some preprocessing steps) but the amount of data is very large.
Furthermore the quality of the classier build by our new method seems to be very good. Here
we consider standard test problems and problems with huge synthetical data sets in up to 14
dimensions. It turns out that our new method achieves correctness rates which are competitive
to those of the best existing methods. Note that the combination method is simple to use and
can be parallelized in a natural and straightforward way, see [19].
The remainder of this paper is organized as follows: In Section 2 we describe the classi-
cation problem in the framework of regularization networks as minimization of a (quadratic)
functional. We then discretize the feature space and derive the associated linear problem. Here
we focus on grid-based discretization techniques. Then, we describe the sparse grid combination
technique for the classication problem, discuss its properties and introduce the anisotropic
sparse grid combination technique. Furthermore, we present our new approach based on a
discretization by simplices and discuss complexity aspects. Section 3 presents the results of
numerical experiments conducted with the sparse grid combination method, demonstrates the
quality of the classier build by our new method and compares the results with the ones from
[20] and with the ones obtained with dierent forms of SVMs [11, 49]. Some nal remarks
conclude the paper.
2 The problem
Classication of data can be interpreted as traditional scattered data approximation problem
with certain additional regularization terms. In contrast to conventional scattered data ap-
proximation applications, we now encounter quite high-dimensional spaces. To this end, the
approach of regularization networks [23] gives a good framework. This approach allows a di-
rect description of the most important neural networks and it also allows for an equivalent
description of support vector machines and n-term approximation schemes [14, 22].
Consider the given set of already classied data (the training set)
S = {(x
i
, y
i
) R
d
R}
M
i=1
.
Assume now that these data have been obtained by sampling of an unknown function f which
belongs to some function space V dened over R
d
. The sampling process was disturbed by
noise. The aim is now to recover the function f from the given data as good as possible.
This is clearly an ill-posed problem since there are innitely many solutions possible. To get a
well-posed, uniquely solvable problem we have to assume further knowledge on f. To this end,
regularization theory [46, 50] imposes an additional smoothness constraint on the solution of
the approximation problem and the regularization network approach considers the variational
problem
min
fV
R(f)
3
with
R(f) =
1
M
M

i=1
C(f(x
i
), y
i
) +(f). (1)
Here, C(., .) denotes an error cost function which measures the interpolation error and (f) is a
smoothness functional which must be well dened for f V . The rst term enforces closeness
of f to the data, the second term enforces smoothness of f and the regularization parameter
balances these two terms. Typical examples are
C(x, y) = |x y| or C(x, y) = (x y)
2
,
and
(f) = ||Pf||
2
2
with Pf = f or Pf = f,
with denoting the gradient and the Laplace operator. The value of can be chosen
according to cross-validation techniques [13, 24, 40, 47] or to some other principle, such as
structural risk minimization [48]. Note that we nd exactly this type of formulation in the case
d = 2, 3 in scattered data approximation methods, see [1, 33], where the regularization term is
usually physically motivated.
2.1 Discretization
We now restrict the problem to a nite dimensional subspace V
N
V . The function f is then
replaced by
f
N
=
N

j=1

j
(x). (2)
Here the ansatz functions {
j
}
N
j=1
should span V
N
and preferably should form a basis for V
N
.
The coecients {
j
}
N
j=1
denote the degrees of freedom. Note that the restriction to a suitably
chosen nite-dimensional subspace involves some additional regularization (regularization by
discretization) which depends on the choice of V
N
.
In the remainder of this paper, we restrict ourselves to the choice
C(f
N
(x
i
), y
i
) = (f
N
(x
i
) y
i
)
2
and
(f
N
) = ||Pf
N
||
2
L
2
(3)
for some given linear operator P. This way we obtain from the minimization problem a feasible
linear system. We thus have to minimize
R(f
N
) =
1
M
M

i=1
(f
N
(x
i
) y
i
)
2
+Pf
N

2
L
2
, (4)
with f
N
in the nite dimensional space V
N
. We plug (2) into (4) and obtain after dierentiation
with respect to
k
, k = 1, . . . , N
0 =
R(f
N
)

k
=
2
M
M

i=1
_
_
N

j=1

j
(x
i
) y
i
_
_

k
(x
i
) + 2
N

j=1

j
(P
j
, P
k
)
L
2
(5)
4
This is equivalent to (k = 1, . . . , N)
N

j=1

j
_
M(P
j
, P
k
)
L
2
+
M

i=1

j
(x
i
)
k
(x
i
)
_
=
M

i=1
y
i

k
(x
i
). (6)
In matrix notation we end up with the linear system
(C +B B
T
) = By. (7)
Here C is a square N N matrix with entries C
j,k
= M (P
j
, P
k
)
L
2
, j, k = 1, . . . , N, and
B is a rectangular N M matrix with entries B
j,i
=
j
(x
i
), i = 1, . . . , M, j = 1, . . . , N. The
vector y contains the data labels y
i
and has length M. The unknown vector contains the
degrees of freedom
j
and has length N.
Depending on the regularization operator we obtain dierent minimization problems in
V
N
. For example if we use the gradient (f
N
) = ||f
N
||
2
L
2
in the regularization expression
in (1) we obtain a Poisson problem with an additional term which resembles the interpolation
problem. The natural boundary conditions for such a partial dierential equation are Neumann
conditions. The discretization (2) gives us then the linear system (7) where C corresponds to
a discrete Laplacian. To obtain the classier f
N
we now have to solve this system.
2.2 Grid based discrete approximation
Up to now we have not yet been specic what nite-dimensional subspace V
N
and what type
of basis functions {
j
}
N
j=1
we want to use. In contrast to conventional data mining approaches
like radial basis approaches or support vector machines, which work with ansatz functions
associated to data points, we now use a certain grid in the attribute space to determine the
classier with the help of these grid points. This is similar to the numerical treatment of partial
dierential equations.
For reasons of simplicity, here and in the remainder of this paper, we restrict ourselves to
the case x
i
= [0, 1]
d
. This situation can always be reached by a proper rescaling of the
data space. A conventional nite element discretization would now employ an equidistant grid

n
with mesh size h
n
= 2
n
for each coordinate direction, where n is the renement level.
In the following we always use the gradient P = in the regularization expression (3). Let
j denote the multi-index (j
1
, ..., j
d
) N
d
. We now use piecewise d-linear, i.e. linear in each
dimension, so-called hat functions as test- and trial-functions
n,j
(x) on grid
n
. Each basis
function
n,j
(x) is thereby 1 at grid point j and 0 at all other points of grid
n
. A nite element
method on grid
n
now would give
(f
N
(x) =)f
n
(x) =
2
n

j
1
=0
...
2
n

j
d
=0

n,j

n,j
(x)
and the variational procedure (4) - (6) would result in the discrete linear system
(C
n
+B
n
B
T
n
)
n
= B
n
y (8)
of size (2
n
+ 1)
d
and matrix entries corresponding to (7). Note that f
n
lives in the space
V
n
:= span{
n,j
, j
t
= 0, .., 2
n
, t = 1, ..., d}.
5
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q

4,1

q q q q q q q q q
q q q q q q q q q
q q q q q q q q q
q q q q q q q q q
q q q q q q q q q

3,2

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

2,3

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

1,4

q q q q q q q q q
q q q q q q q q q
q q q q q q q q q

3,1

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

2,2

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

1,3
=
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q q q q q q
q q q q q q q q q
q q q q q q q q q
q q q q q q q q q
q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

c
4,4
Figure 1: Combination technique with level n = 4 in two dimensions
The discrete problem (8) might in principle be treated by an appropriate solver like the con-
jugate gradient method, a multigrid method or some other suitable ecient iterative method.
However, this direct application of a nite element discretization and the solution of the result-
ing linear system by an appropriate solver is clearly not possible for a d-dimensional problem
if d is larger than four. The number of grid points is of the order O(h
d
n
) = O(2
nd
) and, in
the best case, the number of operations is of the same order. Here we encounter the so-called
curse of dimensionality: The complexity of the problem grows exponentially with d. At least
for d > 4 and a reasonable value of n, the arising system can not be stored and solved on even
the largest parallel computers today.
2.3 The sparse grid combination technique
Therefore we proceed as follows: We discretize and solve the problem on a certain sequence
of grids
l
=
l
1
,...,l
d
with uniform mesh sizes h
t
= 2
l
t
in the t-th coordinate direction.
These grids may possess dierent mesh sizes for dierent coordinate directions. To this end,
we consider all grids
l
with
l
1
+... +l
d
= n + (d 1) q, q = 0, .., d 1, l
t
> 0. (9)
For the two-dimensional case, the grids needed in the combination formula of level 4 are shown
in Figure 1. The nite element approach with piecewise d-linear test- and trial-functions

l,j
(x) :=
d

t=1

l
t
,j
t
(x
t
) (10)
6
on grid
l
now would give
f
l
(x) =
2
l
1

j
1
=0
...
2
l
d

j
d
=0

l,j

l,j
(x)
and the variational procedure (4) - (6) would result in the discrete system
(C
l
+B
l
B
T
l
)
l
= B
l
y (11)
with the matrices
(C
l
)
j,k
= M (
l,j
,
l,k
) and (B
l
)
j,i
=
l,j
(x
i
),
j
t
, k
t
= 0, ..., 2
l
t
, t = 1, ..., d, i = 1, ..., M, and the unknown vector (
l
)
j
, j
t
= 0, ..., 2
l
t
, t = 1, ..., d.
We then solve these problems by a feasible method. To this end we use here a diagonally
preconditioned conjugate gradient algorithm. But also an appropriate multigrid method with
partial semi-coarsening can be applied. The discrete solutions f
l
are contained in the spaces
V
l
:= span{
l,j
, j
t
= 0, ..., 2
l
t
, t = 1, ..., d}, (12)
of piecewise d-linear functions on grid
l
.
Note that all these problems are substantially reduced in size in comparison to (8). Instead
of one problem with size dim(V
n
) = O(h
d
n
) = O(2
nd
), we now have to deal with O(dn
d1
)
problems of size dim(V
l
) = O(h
1
n
) = O(2
n
). Moreover, all these problems can be solved
independently, which allows for a straightforward parallelization on a coarse grain level, see
[25]. There is also a simple but eective static load balancing strategy available [27].
Finally we linearly combine the results f
l
(x) V
l
, f
l
=

j

l,j

l,j
(x), from the dierent
grids
l
as follows:
f
(c)
n
(x) :=
d1

q=0
(1)
q
_
d 1
q
_

|l|
1
=n+(d1)q
f
l
(x). (13)
The resulting function f
(c)
n
lives in the sparse grid space
V
(s)
n
:=
_
l
1
+... +l
d
= n + (d 1) q
q = 0, ..., d 1 l
t
> 0
V
l
.
This space has dim(V
(s)
n
) = O(h
1
n
(log(h
1
n
))
d1
). It is spanned by a piecewise d-linear hierar-
chical tensor product basis, see [8].
Note that the summation of the discrete functions from dierent spaces V
l
in (13) involves
d-linear interpolation which resembles just the transformation to a representation in the hierar-
chical basis. For details see [26, 30, 31]. However we never explicitly assemble the function f
(c)
n
but keep instead the solutions f
l
on the dierent grids
l
which arise in the combination for-
mula. Now, any linear operation F on f
(c)
n
can easily be expressed by means of the combination
formula (13) acting directly on the functions f
l
, i.e.
F(f
(c)
n
) =
d1

q=0
(1)
q
_
d 1
q
_

l
1
+...+l
d
=n+(d1)q
F(f
l
). (14)
7
Figure 2: Two-dimensional sparse grid (left) and three-dimensional sparse grid (right), n = 5
Therefore, if we now want to evaluate a newly given set of data points { x
i
}

M
i=1
(the test or
evaluation set) by
y
i
:= f
(c)
n
( x
i
), i = 1, ...,

M
we just form the combination of the associated values for f
l
according to (13). The evaluation
of the dierent f
l
in the test points can be done completely in parallel, their summation needs
basically an all-reduce/gather operation.
For second order elliptic PDE model problems, it was proven that the combination solution
f
(c)
n
is almost as accurate as the full grid solution f
n
, i.e. the discretization error satises
||e
(c)
n
||
L
p
:= ||f f
(c)
n
||
L
p
= O(h
2
n
log(h
1
n
)
d1
)
provided that a slightly stronger smoothness requirement on f than for the full grid approach
holds. We need the seminorm
|f|

:=

2d
f

d
j=1
x
2
j

(15)
to be bounded. Furthermore, a series expansion of the error is necessary for the combination
technique. Its existence was shown for PDE model problems in [10].
The combination technique is only one of the various methods to solve problems on sparse
grids. Note that there exist also nite dierence [26, 41] and Galerkin nite element approaches
[2, 8, 9] which work directly in the hierarchical product basis on the sparse grid. But the
combination technique is conceptually much simpler and easier to implement. Moreover it allows
to reuse standard solvers for its dierent subproblems and is straightforwardly parallelizable.
In [20] and in (9) we forced l
t
> 0 which is necessary for the numerical solution of partial
dierential equations with Dirichlet boundary conditions. But since equation (6) uses Neumann
boundary conditions this is not necessary and we can also use the grids
l
with l
t
0, i.e.
l
1
+... +l
d
= n q, q = 0, .., d 1, l
t
0 (16)
8
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q

4,0

q q q q q q q q q
q q q q q q q q q
q q q q q q q q q

3,1

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

2,2

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

1,3

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

0,4

q q q q q q q q q
q q q q q q q q q

3,0

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

2,1

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

1,2

q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

0,3
=
q q q q q q q q q q q q q q q q q
q q q q q q q q q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

4
Figure 3: Combination technique based on (16) with level n = 4, d = 2
for the sequences of grids resulting in the sparse grid
c

n
of level n. In Figure 3 we show the
case of n = 4 and d = 2 of this version of the combination technique.
Now, for n xed, the grids in the sequence possess less grids points than for (9) and,
consequently, less memory is needed. Thus one can handle classication problems with some
more attributes than before. Nevertheless a limit on the number of attributes is still imposed
due to memory constraints.
2.4 Anisotropic sparse grids
Up to now we treated all attributes of the classication problem the same, i.e. we used the
same mesh renement level for all attributes. Obviously attributes have dierent properties,
dierent number of distinct values, and dierent variances. For example to discretize the range
of a binary attribute one does not need more than two grid points.
We generalized our approach to account for such situations as well. We use dierent mesh
sizes for each dimension along the lines of [21]. This results in a so-called anisotropic sparse
grid. Now dierent renement level n
j
for each dimension j, j = 1, . . . , d can be given instead
of only the same renement level n for the dierent dimensions. This extension of our approach
can result in less computing time and better approximation results, depending on the actual
data set.
To this end we dene the so-called index set I
n
for given n = (n
1
, . . . , n
d
) of an anisotropic
sparse grid. An index set is valid, if for all k I
n
the following holds
k e
j
I
n
for 1 j d, k
j
> 1, (17)
9
0 1 2 3 4 5 6 7 8 9
1
3
5
6
0
2
4
7
8
9
0 1 2 3 4 5 6 7 8 9
1
3
5
6
0
2
4
7
8
9
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q q q q q q q
q q q q q q q q q

c
5,3
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
q q q q q q q q q
q q q
q q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q

5,3
Figure 4: Index sets I
n
for n = (6, 6) and n = (6, 8) (left), and anisotropic sparse grid
c
n
with
n = (5, 3) based on (9) and (16), respectively (right)
with e
j
denoting the j-th unit vector.
For a given n = (n
1
, . . . , n
d
) we now dene the index set I
n
as all the indices which are below
or on the associated hyperplane dened by the indices (n
1
, 0, . . . , 0), . . . , (0, . . . , 0, n
j
, 0, . . . , 0),
. . . , (0, . . . , 0, n
d
). Thus, I
n
consists of all indices k = (k
1
, . . . , k
d
) with
d

j=1
k
j
n
j
1, with k
j
0, (18)
where we set k
j
/n
j
:= 0 for k
j
= n
j
= 0. The union of all grids associated to the indices
in I
n
gives an anisotropic sparse grid of level n in the sense of (16). For the corresponding
denition in the sense of (9) we have to replace n by n1 and k by k 1, with 1 = (1, . . . , 1),
in (18). Figure 4 (right) shows the anisotropic sparse grid of level (5,3) for both variants. For
example the index set of the sparse grid
c
5,3
, in the sense of (9), is characterised by the indices
(5, 1), (3, 2), (1, 3), and consists of these 3 indices and all smaller ones.
We dene the characteristic function
I
n
of I
n
as

I
n
(k) =
_
1 if k I
n,
0 else.
The generalized combination technique is now dened as follows:
f
(c)
I
n
(x) :=

kI
n
_
_
(1,...,1)

z=(0,...,0)
(1)
|z|
1

I
n
(k +z)
_
_
f
k
(x). (19)
Of course, only if the sum over the
I
n
of the neighbors k +z is non-zero for a k I
n
, the
discrete problem (11) on the corresponding grid
k
has to be dealt with to compute the classier
(19) on the anisotropic sparse grid
c
n
. In Figure 4 (left) these non-zero indices are marked by
stars, zero indices by crosses.
The diculty now is to know a-priori for which dimensions higher renement levels n
j
are
worthwhile. Sometimes further knowledge on the data is available which determines such dimen-
sions but, in general, strategies like cross-validation are necessary to nd the best anisotropic
sparse grid.
10
(1,1,1)
(0,0,0)
Figure 5: Kuhns triangulation of a three-dimensional unit cube
2.5 Simplicial basis functions
So far we only mentioned d-linear basis functions based on a tensor-product approach, this
case was presented in detail in [20]. But, on the grids of the combination technique, linear
basis functions based on a simplicial discretization are also possible. Here, we use the so-called
Kuhns triangulation [16, 34] for each rectangular block, see Figure 5. Now, the summation of
the discrete functions for the dierent spaces V
l
in (13) only involves linear interpolation.
d-linear basis functions linear basis functions
C
l
G
l
:= B
l
B
T
l
C
l
G
l
:= B
l
B
T
l
storage O(3
d
N) O(3
d
N) O((2 d + 1) N) O(2
d
N)
assembly O(3
d
N) O(d 2
2d
M) O((2 d + 1) N) O((d + 1)
2
M)
mv-multiplication O(3
d
N) O(3
d
N) O((2 d + 1) N) O(2
d
N)
Table 1: Complexities of the storage, the assembly and the matrix-vector multiplication for
the dierent matrices arising in the combination method on one grid
l
for both discretization
approaches. Note that C
l
and G
l
can be stored together in one matrix structure.
The theoretical properties of this variant of the sparse grid technique still has to be inves-
tigated in more detail. However the results which are presented in section 3 warrant its use.
We see, if at all, just slightly worse results with linear basis functions than with d-linear basis
functions and we believe that our new approach results in the same approximation order.
Since in our new variant of the combination technique the overlap of supports, i.e. the
regions where two basis functions are both non-zero, is greatly reduced due to the use of a
simplicial discretization, the complexities scale signicantly better. This concerns both the
costs of the assembly and the storage of the non-zero entries of the sparsely populated matrices
from (8), see Table 1. Note that for general operators P the complexities for C
l
scale with
O(2
d
N). But for our choice of P = structural zero-entries arise, which need not to be
considered and which further reduce the complexities, see Table 1 (right), column C
l
. The
actual iterative solution process (by a diagonally preconditioned conjugate gradient method)
scales independent of the number of data points for both approaches.
Note however that both the storage and the run time complexities still depend exponentially
on the dimension d. Presently, due to the limitations of the memory of modern workstations
(512 MByte - 2 GByte), we therefore can only deal with the case d 10 for d-linear basis
11
level training correctness testing correctness
5 0.0003 94.87 % 82.99 %
6 0.0006 97.42 % 84.02 %
7 0.00075 100.00 % 88.66 %
8 0.0006 100.00 % 89.18 %
9 0.0006 100.00 % 88.14 %
Table 2: Leave-one-out cross-validation results for the spiral data set
functions and d 15 for linear basis functions. A decomposition of the matrix entries over
several computers in a parallel environment would permit more dimensions.
3 Numerical results
We now apply our approach to dierent test data sets. Here we mainly consider the properties
of the method with simplicial basis functions, the main result of this paper, and give only a few
results with anisotropic sparse grids. Both synthetical data and real data from practical data
mining applications are used. All the data sets are rescaled to [0, 1]
d
for ease of computation.
For data sets with a small number of data points we map the median of each attribute to
0.5 and scale the other values accordingly. The bigger data sets are just normalized to [0, 1]
d
.
To evaluate our method we give the correctness rates on testing data sets, if available, or the
ten-fold cross-validation results otherwise. For further details and a critical discussion on the
evaluation of the quality of classication algorithms see [13, 40]. The run times are measured
on a Pentium III 700 MHz machine.
3.1 Two-dimensional problems
We rst consider synthetic two-dimensional problems with small sets of data which correspond
to certain structures.
3.1.1 Spiral
The rst example is the spiral data set proposed by Alexis Wieland of MITRE Corp [51].
Here, 194 data points describe two intertwined spirals, see Figure 6. This is surely an articial
problem which does not appear in practical applications. However it serves as a hard test case
for new data mining algorithms. It is known that neural networks can have severe problems
with this data set and some neural networks can not separate the two spirals at all [43].
In Table 2 we give the correctness rates achieved with the leave-one-out cross-validation
method, i.e. a 194-fold cross-validation. The best testing correctness was achieved on level 8
with 89.18% in comparison to 77.20% in [43].
In Figure 6 we show the corresponding results obtained with our sparse grid combination
method for the levels 5 to 8. With level 7 the two spirals are clearly detected and resolved.
Note that here 1281 grid points are contained in the sparse grid. For level 8 (2817 sparse grid
12
Figure 6: Spiral data set, sparse grid with level 5 (top left) to 8 (bottom right)
points) the shape of the two reconstructed spirals gets smoother and the reconstruction gets
more precise.
3.1.2 Ripley
This data set, taken from [39], consists of 250 training data and 1000 test points. The data
set was generated synthetically and is known to exhibit 8 % error. Thus no better testing
correctness than 92 % can be expected.
Since we now have training and testing data, we proceed as follows: First we use the training
set to determine the best regularization parameter per ten-fold cross-validation. The best
test correctness rate and the corresponding are given for dierent levels n in the rst two
columns of Table 3. With this we then compute the sparse grid classier from the 250 training
data. Column three of Table 3 gives the result of this classier on the (previously unknown)
test data set. We see that our method works well. Already level 4 is sucient to obtain a test
13
linear basis d-linear basis best possible %
level ten-fold test % test data % test data % linear d-linear
1 85.2 0.0020 89.9 89.8 90.6 90.3
2 85.2 0.0065 90.3 90.4 90.4 90.9
3 88.4 0.0020 90.9 90.6 91.0 91.2
4 87.2 0.0035 91.4 90.6 91.4 91.2
5 88.0 0.0055 91.3 90.9 91.5 91.1
6 86.8 0.0045 90.7 90.8 90.7 90.8
7 86.8 0.0008 89.0 88.8 91.1 91.0
8 87.2 0.0037 91.0 89.7 91.2 91.0
9 87.7 0.0015 90.1 90.9 91.1 91.0
10 89.2 0.0020 91.0 90.6 91.2 91.1
Table 3: Results for the Ripley data set
Figure 7: Ripley data set, combination technique with linear basis functions. Left: level 4,
= 0.0035. Right: level 8, = 0.0037
correctness rate of 91.4 %. The reason is surely the relative simplicity of the data, see Figure
7. Just a few hyperplanes should be enough to separate the classes quite properly. We also see
that there is not much need to use any higher levels, on the contrary there is even an overtting
eect visible in Figure 7. In column 4 we show the results from [20]. We see that we achieve
almost the same results with d-linear functions. To see what kind of results could be possible
with a more sophisticated strategy for determining we give in the last two columns of Table
3 the testing correctness which is achieved for the best possible . To this end we compute
for all (discrete) values of the sparse grid classiers from the 250 data points and evaluate
them on the test set. We then pick the best result. We clearly see that there is not much of a
dierence. This indicates that the approach to determine the value of from the training set
by cross-validation works well. Again we have almost the same results with linear and d-linear
14
Sparse Grid Combination Technique Results with
d-linear linear lin. dim 3 rened SVM[17]
Level 10-fold time (sec) time (sec) time (sec) linear non-linear
1 train 77.7 % 2.2 82.3 % 0.1 78.7 % 0.2 70.2 % 75.8 %
test 71.8 % 72.4 % 73.9 % 70.0 % 73.7 %
2 train 84.3 % 27.0 80.0 % 1.0 84.1 % 1.4 time (sec.)
test 70.4 % 72.5 % 71.6 % 0.8 20.6
3 train 91.4 % 194.1 86.6 % 9.6 88.0 % 10.8
test 70.8 % 69.9 % 70.5 %
4 train 92.6 % 1217.6 94.2 % 68.3 93.1 % 75.4
test 68.8 % 71.4 % 70.5 %
Table 4: Results for the BUPA liver disorders data set
basis functions. Note that a testing correctness of 90.6 % and 91.1 % was achieved with neural
networks in [39] and [38], respectively, for this data set.
3.2 Small data sets
3.2.1 BUPA Liver
The BUPA Liver Disorders data set from Irvine Machine Learning Database Repository [6]
consists of 345 data points with 6 features and a selector eld used to split the data into 2 sets
with 145 instances and 200 instances respectively. Here we have no test data and therefore can
only report our ten-fold cross-validation results.
We compare with our d-linear results from [20] and with results using linear and non-linear
support vector machines. The results are given in Table 4. Our sparse grid combination
approach with linear basis functions performs slightly better than the d-linear approach, the
best result was 72.5% testing correctness on level 2, and needs signicantly less computing time.
It also performs better than a linear SVM (70.0 %), but worse than a non-linear SVM (73.7
%), see [17].
Rening dimension 3 once, i.e. using the anisotropic grid
c
1,1,2,1,1,1
, we achieve the best
result of 73.9 % using linear basis functions. We picked this attribute for renement through
cross-validation.
3.2.2 Galaxy Dim
The Galaxy Dim data set is a commonly used subset of the data presented in [37]. It consists of
4192 data points with 14 attributes, the two classes have almost the same number of instances.
Again no test data is present and therefore we can only report our ten-fold cross-validation
results. Since this data set is in 14 dimensions we use the combination technique of equation
(16) with simplicial basis functions to be able to t the matrixes into main memory.
With isotropic sparse grids we achieve a testing correctness of 95.6 % for level 0 and on level
1 a slightly better result with 95.9 %, see Table 5. If we rene dimension 5 once we already
achieve with level 0 the slightly better results of 96.0 %. We need for this result with 66 seconds
15
Sparse Grid Combination Technique Results with
normal dim 5 rened linear
Level 10-fold time (sec) time (sec) SVM[17]
0 train 97.2 % 58 97.9 % 66 95.0 %
test 95.6 % 96.0 % 95.0 %
1 train 97.6 % 965 98.1 % 1185 time (sec.)
test 95.9 % 96.2 % 5.2
Table 5: Results for the Galaxy Dim data set
far less time than for the original level 1 result (965 seconds). If we use level 1 with dimension
5 rened once, the result improves slightly to 96.2 %. Note that with linear SVMs a 95.0 %
correctness rate in 5 seconds was achieved in [17].
3.3 Big and massive data sets
3.3.1 Synthetic massive data set in 6D
To measure the performance on a massive data set we produced with DatGen [36] a 6-dimensional
test case with 5 million training points and 20 000 points for testing. We used the call dat-
gen -r1 -X0/100,R,O:0/100,R,O:0/100,R,O: 0/100,R,O:0/200,R,O:0/200,R,O -R2 -C2/4 -D2/5
-T10/60 -O5020000 -p -e0.15. The results are given in Table 6. Note that already on level 1 a
testing correctness of over 90 % was achieved with just = 0.01. The main observation on this
test case concerns the execution time. Besides the total run time, we also give the CPU time
which is needed for the computation of the matrices G
l
= B
l
B
T
l
.
We see that with linear basis functions really huge data sets of 5 million points can be
processed in reasonable time. Note that more than 50 % of the computation time is spent for
the data matrix assembly only and, more importantly, that the execution time scales linearly
with the number of data points. The latter is also the case for the d-linear functions, but, as
mentioned, this approach needs more operations per data point and results in a much longer
execution time, compare also Table 6. Especially the assembly of the data matrix needs more
than 96 % of the total run time for this variant. For our present example the linear basis
approach is about 40 times faster than the d-linear approach on the same renement level, e.g.
for level 2 we need 17 minutes in the linear case and 11 hours in the d-linear case. For higher
dimensions this factor will be even larger.
3.3.2 Forest cover type
The forest cover type data set comes from the UCI KDD Archive [4], it was also used in [32],
where an approach similar to ours was followed. It consists of cartographic variables for 30 x 30
meter cells. Here, a forest cover type is to be predicted. The 12 originally measured attributes
resulted in 54 attributes in the data set, besides 10 quantitative variables there are 4 binary
wilderness areas and 40 binary soil type variables. We only use the 10 quantitative variables.
The class label has 7 values, Spruce/Fir, Lodgepole Pine, Ponderosa Pine, Cottonwood/Willow,
Aspen, Douglas-r and Krummholz. Like [32] we only report results for the classication of
16
training testing total data matrix # of
# of points correctness correctness time (sec) time (sec) iterations
linear basis functions
50 000 90.4 90.5 3 1 23
level 1 500 000 90.5 90.5 25 8 25
5 million 90.5 90.6 242 77 28
50 000 91.4 91.0 12 5 184
level 2 500 000 91.2 91.1 110 55 204
5 million 91.1 91.2 1086 546 223
50 000 92.2 91.4 48 23 869
level 3 500 000 91.7 91.7 417 226 966
5 million 91.6 91.7 4087 2239 1057
d-linear basis functions
level 1 500 000 90.7 90.8 597 572 91
5 million 90.7 90.7 5897 5658 102
level 2 500 000 91.5 91.6 4285 4168 656
5 million 91.4 91.5 42690 41596 742
Table 6: Results for a 6D synthetic massive data set, = 0.01
Ponderosa Pine, which has 35754 instances out of the total 581012.
Since far less than 10 % of the instances belong to Ponderosa Pine we enforce this class with
a factor of 5, i.e. Ponderosa Pine has a class value of 5, all others of -1 and the threshold value
for separating the classes is 0. The data set was randomly separated into a training set, a test
set, and a evaluation set, all similar in size.
In [32] only results up to 6 dimensions could be reported. In Table 7 we present our results
for the 6 dimensions chosen there, i.e. the dimensions 1,4,5,6,7, and 10, and for all 10 dimensions
as well. To give an overview of the behavior over several s we present for each level n the
overall correctness results, the correctness results for Ponderosa Pine and the correctness result
for the other class for three values of . We then give results on the evaluation set for a chosen
.
We see in Table 7 that already with level 1 we have a testing correctness of 93.95 % for
the Ponderosa Pine in the 6 dimensional version. Higher renement levels do not give better
results. The result of 93.52% on the evaluation set is almost the same as the corresponding
testing correctness. Note that in [32] a correctness rate of 86.97 % was achieved on the evaluation
set.
The usage of all 10 dimensions improves the results slightly, we get 93.81 % as our evaluation
result on level 1. As before, higher renement levels do not improve the results for this data
set.
In [32] it was remarked that the elevation is the dominant attribute of this data set. We
therefore choose this dimension for renement. In Table 8 we present the results. For a
conventional isotropic sparse grid a ner resolution for all dimensions does not improve the
result, but if we just rene the sparse grid discretization for the elevation attribute only, the
results improve slightly for one renement step. It improves further for two and three renement
17
testing correctness time
overall Ponderosa Pine other class (sec.)
6 dimensions
level 1 0.0005 92.7 % 93.9 % 92.6 % 3.4
0.0050 92.5 % 94.0 % 92.4 % 3.4
0.0500 92.5 % 93.4 % 92.4 % 3.4
on evaluation set 0.0050 92.5 % 93.5 % 92.4 % 3.4
level 2 0.0001 93.4 % 92.1 % 93.4 % 23.1
0.0010 93.2 % 92.3 % 93.3 % 23.1
0.0100 92.3 % 89.0 % 92.5 % 23.1
on evaluation set 0.0010 93.2 % 91.7 % 93.3 % 23.1
level 3 0.0010 92.8 % 90.9 % 92.9 % 93.9
0.0100 93.1 % 91.7 % 93.2 % 93.8
0.1000 93.5 % 88.0 % 93.9 % 93.8
on evaluation set 0.0100 93.0 % 91.4 % 93.1 % 93.8
10 dimensions
level 1 0.0025 93.6 % 94.0 % 93.6 % 20.1
0.0250 93.6 % 94.2 % 93.5 % 18.2
0.2500 93.6 % 92.3 % 93.7 % 17.8
on evaluation set 0.0250 93.5 % 93.8 % 93.5 % 18.2
level 2 0.0050 93.0 % 92.4 % 93.0 % 316.6
0.0500 93.7 % 93.0 % 93.7 % 282.2
0.5000 93.1 % 91.8 % 93.2 % 273.6
on evaluation set 0.0500 93.7 % 92.9 % 93.8 % 282.2
Table 7: Results for forest cover type data set using 6 and 10 attributes
steps to the best result of 95.5 % evaluation correctness for Ponderosa Pine. Note that the
results did not improve for the anisotropic grids
c
k,2,...,2
, k = 3, 4, 5.
Note that the forest cover example is sound enough to serve as an example of classication,
but it might strike forest scientists as being amusingly supercial. It has been known for 30
years that the dynamics of forest growth can have a dominant eect on which species is present
at a given location [7], yet there are no dynamic variables in the classier. This can be seen as
a warning never to assume that the available data contains all the relevant information.
3.3.3 ndcHard data set
This 10-dimensional data set of two million instances was synthetically generated and rst used
in [35]. Like in the synthetical 6-dimensional example of section 3.3.1 the main observations
concern the run time. Besides the total run time, we also give the CPU time which is needed
for the computation of the matrices G
l
= B
l
B
T
l
. Note that the highest amount of memory
needed (for level 2 in the case of 2 million data points) was 350 MBytes, about 250 MBytes for
the matrix and about 100 MBytes for keeping the data points in memory.
In Table 9 we give the results using the combination technique of equation (9). More than
18
testing correctness time
overall Ponderosa Pine other class (sec.)

c
2,1,...,1
0.0050 93.9 % 94.4 % 93.9 % 29.6
on evaluation set 0.0050 93.9 % 94.0 % 93.9 % 29.6

c
3,1,...,1
0.0200 94.2 % 95.2 % 94.1 % 47.7
on evaluation set 0.0200 94.2 % 95.2 % 94.2 % 47.7

c
4,1,...,1
0.00150 94.6 % 95.5 % 94.5 % 115.1
on evaluation set 0.00150 94.6 % 95.5 % 94.6 % 115.4
Table 8: Results for forest cover type data set with anisotropic grids
50 % of the run time is spent for the assembly of the data matrix. The time needed for the data
matrix scales linearly with the number of data points. The total run time seems to scale even
better than linear. Already for level 1 we get a 84.9 % testing correctness, no improvement
with level 2 is achieved.
Since all dimensions are treated equally during the generation of the data set one should
expect no improvement through the use of anisotropic sparse grids. Numerical tests with rened
dimensions conrm this expectation. Note that with support vector machines correctness rates
of 69.5 % were reported in [17, 35].
training testing total data matrix # of
# of points correctness correctness time (sec) time (sec) iterations
20 000 86.2 % 84.2 % 6.3 0.9 45
level 1 200 000 85.1 % 84.8 % 16.2 8.7 51
2 million 84.9 % 84.9 % 114.9 84.9 53
20 000 85.1 % 83.8 % 134.6 10.3 566
level 2 200 000 84.5 % 84.2 % 252.3 98.2 625
2 million 84.3 % 84.2 % 1332.2 966.6 668
Table 9: Results for the ndcHard data set
4 Conclusions
We presented the anisotropic sparse grid combination technique with linear basis functions
based on simplices for the classication of data in moderate-dimensional spaces. Our new
method gave good results for a wide range of problems. It is capable to handle huge data
sets with millions of points. The run time scales only linearly with the number of data. This
is an important property for many practical applications where often the dimension of the
problem can substantially be reduced by certain preprocessing steps but the number of data
can be extremely huge. We believe that our sparse grid combination method possesses a great
potential in such practical application problems.
For the used data sets the full eect of an anisotropic sparse grid did not take place since
the best results were often achieved already for level 1 or 2.
19
Right now the renement levels for each attribute have to be given a-priori. To nd good val-
ues for these one has to use either trial-and-error approaches with cross-validation or knowledge
about the attributes. For numerical integration [21] or the numerical solution of partial dier-
ential equations self-adaptive strategies are well-known and widely used. These ideas should
be transferable to the classication problem. Furthermore, such dimension adaptive renement
strategies could be extended to a tool for dimension reduction.
A parallel version of the sparse grid combination technique reduces the run time signicantly,
see [19]. Note that our method is easily parallelizable already on a coarse grain level. A second
level of parallelization is possible on each grid of the combination technique with the standard
techniques known from the numerical treatment of partial dierential equations.
Note furthermore that our approach delivers a continuous classier function which approx-
imates the data. It therefore can be used without modication for regression problems as well.
This is in contrast to many other methods like e.g. decision trees.
Finally, for reasons of simplicity, we used the operator P = . But other dierential (e.g.
P = ) or pseudo-dierential operators can be employed here with their associated regular
nite element ansatz functions.
5 Acknowledgements
Part of the work was supported by the German Bundesministerium f ur Bildung und Forschung
(BMB+F) within the project 03GRM6BN. This work was carried out in cooperation with
Prudential Systems Software GmbH, Chemnitz. The authors thank one of the original referees
for his remarks on the forest cover data set.
References
[1] E. Arge, M. Dhlen, and A. Tveito. Approximation of scattered data using smooth grid
functions. J. Comput. Appl. Math, 59:191205, 1995.
[2] R. Balder. Adaptive Verfahren f ur elliptische und parabolische Dierentialgleichungen auf
d unnen Gittern. Dissertation, Technische Universitat M unchen, 1994.
[3] G. Baszenski. Nth order polynomial spline blending. In W. Schempp and K. Zeller,
editors, Multivariate Approximation Theory III, ISNM 75, pages 3546. Birkhauser, Basel,
1985.
[4] S. D. Bay. The UCI KDD archive. http://kdd.ics.uci.edu, 1999.
[5] M. J. A. Berry and G. S. Lino. Mastering Data Mining. Wiley, 2000.
[6] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
http://www.ics.uci.edu/mlearn/MLRepository.html.
[7] D. Botkin, J. Janak, and J. Wallis. Some ecological consequences of a computer model of
forest growth. J. Ecology, 60:849872, 1972.
20
[8] H.-J. Bungartz. D unne Gitter und deren Anwendung bei der adaptiven L osung der dreidi-
mensionalen Poisson-Gleichung. Dissertation, Institut f ur Informatik, Technische Univer-
sitat M unchen, 1992.
[9] H.-J. Bungartz, T. Dornseifer, and C. Zenger. Tensor product approximation spaces for
the ecient numerical solution of partial dierential equations. In Proc. Int. Workshop on
Scientic Computations, Konya, 1996. Nova Science Publishers, 1997. to appear.
[10] H.-J. Bungartz, M. Griebel, D. Roschke, and C. Zenger. Pointwise convergence of the
combination technique for the Laplace equation. East-West J. Numer. Math., 2:2145,
1994.
[11] V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and Methods. John
Wiley & Sons, 1998.
[12] K. Cios, W. Pedrycz, and R. Swiniarski. Data Mining Methods for Knowledge Discovery.
Kluwer, 1998.
[13] T. G. Dietterich. Approximate statistical tests for comparing supervised classication
learning algorithms. Neural Computation, 10(7):18951924, 1998.
[14] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector ma-
chines. Advances in Computational Mathematics, 13:150, 2000.
[15] K. Frank, S. Heinrich, and S. Pereverzev. Information Complexity of Multivariate Fred-
holm Integral Equations in Sobolev Classes. J. of Complexity, 12:1734, 1996.
[16] H. Freudenthal. Simplizialzerlegungen von beschrankter Flachheit. Annals of Mathematics,
43:580582, 1942.
[17] G. Fung and O. Mangasarian. Proximal support vector machine classiers. In F. Provost
and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, pages 7786, 2001.
[18] J. Garcke and M. Griebel. On the computation of the eigenproblems of hydrogen and
helium in strong magnetic and electric elds with the sparse grid combination technique.
Journal of Computational Physics, 165(2):694716, 2000.
[19] J. Garcke and M. Griebel. On the parallelization of the sparse grid approach for data
mining. In Large-Scale Scientic Computations, Third International Conference, LSSC
2001, Sozopol, Bulgaria, volume 2179 of Lecture Notes in Computer Science, pages 22-32,
2001.
[20] J. Garcke, M. Griebel, and M. Thess. Data mining with sparse grids. Computing, 67(3):225
253, 2001.
[21] T. Gerstner and M. Griebel. Numerical Integration using Sparse Grids. Numer. Algorithms,
18:209232, 1998.
[22] F. Girosi. An equivalence between sparse approximation and support vector machines.
Neural Computation, 10(6):14551480, 1998.
21
[23] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architec-
tures. Neural Computation, 7:219265, 1995.
[24] G. Golub, M. Heath, and G. Wahba. Generalized cross validation as a method for choosing
a good ridge parameter. Technometrics, 21:215224, 1979.
[25] M. Griebel. The combination technique for the sparse grid solution of PDEs on multipro-
cessor machines. Parallel Processing Letters, 2(1):6170, 1992.
[26] M. Griebel. Adaptive sparse grid multilevel methods for elliptic PDEs based on nite
dierences. Computing, 61(2):151179, 1998.
[27] M. Griebel, W. Huber, T. Stortkuhl, and C. Zenger. On the parallel solution of 3D PDEs
on a network of workstations and on vector computers. In A. Bode and M. D. Cin, editors,
Parallel Computer Architectures: Theory, Hardware, Software, Applications, volume 732
of Lecture Notes in Computer Science, pages 276291. Springer Verlag, 1993.
[28] M. Griebel and S. Knapek. Optimized tensor-product approximation spaces. Constructive
Approximation, 16(4):525540, 2000.
[29] M. Griebel, P. Oswald, and T. Schiekofer. Sparse grids for boundary integral equations.
Numer. Mathematik, 83(2):279312, 1999.
[30] M. Griebel, M. Schneider, and C. Zenger. A combination technique for the solution of
sparse grid problems. In P. de Groen and R. Beauwens, editors, Iterative Methods in
Linear Algebra, pages 263281. IMACS, Elsevier, North Holland, 1992.
[31] M. Griebel and V. Thurner. The ecient solution of uid dynamics problems by the
combination technique. Int. J. Num. Meth. for Heat and Fluid Flow, 5(3):251269, 1995.
[32] M. Hegland, O. M. Nielsen, and Z. Shen. High dimensional smoothing based on multi-
level analysis. Technical report, Data Mining Group, The Australian National University,
Canberra, November 2000. Submitted to SIAM J. Scientic Computing.
[33] J. Hoschek and D. Lasser. Grundlagen der goemetrischen Datenverarbeitung, chapter 9.
Teubner, 1992.
[34] H. W. Kuhn. Some combinatorial lemmas in topology. IBM J. Res. Develop., 4:518524,
1960.
[35] O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal of
Machine Learning Research, 1:161177, 2001.
[36] G. Melli. Datgen: A program that creates structured data. Website.
http://www.datasetgenerator.com.
[37] S. Odewahn, E. Stockwell, R. Pennington, R. Humphreys, and W. Zumach. Automated
star/galaxy discrimination with neural networks. Astronomical Journal, 103(1):318331,
1992.
22
[38] W. D. Penny and S. J. Roberts. Bayesian neural networks for classication: how useful is
the evidence framework ? Neural Networks, 12:877892, 1999.
[39] B. D. Ripley. Neural networks and related methods for classication. Journal of the Royal
Statistical Society B, 56(3):409456, 1994.
[40] S. L. Salzberg. On comparing classiers: Pitfalls to avoid and a recommended approach.
Data Mining and Knowledge Discovery, 1:317327, 1997.
[41] T. Schiekofer. Die Methode der Finiten Dierenzen auf d unnen Gittern zur L osung el-
liptischer und parabolischer partieller Dierentialgleichungen. Dissertation, Institut f ur
Angewandte Mathematik, Universitat Bonn, 1999.
[42] W. Sickel and F. Sprengel. Interpolation on sparse grids and NikolskijBesov spaces of
dominating mixed smoothness. J. Comput. Anal. Appl., 1:263288, 1999.
[43] S. Singh. 2d spiral pattern recognition with possibilistic measures. Pattern Recognition
Letters, 19(2):141147, 1998.
[44] S. A. Smolyak. Quadrature and interpolation formulas for tensor products of certain classes
of functions. Dokl. Akad. Nauk SSSR, 148:10421043, 1963. Russian, Engl. Transl.: Soviet
Math. Dokl. 4:240243, 1963.
[45] V. N. Temlyakov. Approximation of functions with bounded mixed derivative. Proc.
Steklov Inst. Math., 1, 1989.
[46] A. N. Tikhonov and V. A. Arsenin. Solutios of ill-posed problems. W.H. Winston, Wash-
ington D.C., 1977.
[47] F. Utreras. Cross-validation techniques for smoothing spline functions in one or two dimen-
sions. In T. Gasser and M. Rosenblatt, editors, Smoothing techniques for curve estimation,
pages 196231. Springer-Verlag, Heidelberg, 1979.
[48] V. N. Vapnik. Estimation of dependences based on empirical data. Springer-Verlag, Berlin,
1982.
[49] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
[50] G. Wahba. Spline models for observational data, volume 59 of Series in Applied Mathe-
matics. SIAM, Philadelphia, 1990.
[51] A. Wieland. Spiral data set. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/ai-
repository/ai/areas/neural/bench/cmu/0.html.
[52] C. Zenger. Sparse grids. In W. Hackbusch, editor, Parallel Algorithms for Partial Dif-
ferential Equations, Proceedings of the Sixth GAMM-Seminar, Kiel, 1990, volume 31 of
Notes on Num. Fluid Mech. Vieweg-Verlag, 1991.
23

You might also like