Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
Fabrice Rossi, Brieuc Conan-Guez and Francois Fleuret - Theoretical Properties of Functional Multi Layer Perceptrons
Perceptrons
Fabrice Rossi
, Brieuc Conan-Guez
i=1
a
i
T
_
b
i
+
_
w
i
g d
_
, (1)
where w
i
are functions of the functional weight space.
3 Universal approximation
The practical usefulness of MLP is directly related to one of their most impor-
tant properties: they are universal approximators (e.g. [3]). M. Stinchcombe
has demonstrated in [7] that universal approximation is possible even if the in-
put space of the MLP is an (almost) arbitrary vectorial space. Unfortunately,
the proposed theorems use quite complex assumptions on linear form approx-
imations. We propose here simplied versions that are directly applicable to
FDA.
3.1 Theoretical results
Following [7], we introduce a denition:
Denition 1. If X is a topological vector space, A a set of functions from
X to R and T a function from R to R, S
X
T
(A) is the set of functions exactly
computed by one hidden layer functional perceptrons with input in X, one real
output, and weight forms in A, i.e. the set of functions from X to R of the
form h(x) =
p
i=1
i
T(l
i
(x) +b
i
) where p N,
i
R, b
i
R and l
i
A.
We have the following results:
Corollary 1. Let be a nite positive Borel measure on R
n
. Let 1 < p
be an arbitrary real number and q be the conjugate exponent of p. Let M be a
dense subset of L
q
() . Let A
M
be the set of linear forms on L
p
() of the form
l(f) =
_
fg d, where g M. Let T be a measurable function from R to R
that is non polynomial and Riemann integrable on some compact interval (not
reduced to one point) of R. Then S
L
p
()
T
(A
M
) contains a set that is dense for
the uniform norm in C(K, R), where K is any compact subset of L
p
() and
C(K, R) is the set of continuous functions from K to R.
Proof. We give here a sketch of the proof which can be found in [5]. For p < ,
A
M
is dense in (L
p
())
= L
q
() and therefore, corollary 5.1.3 of [7] applies
(hypothesis on T allow to satisfy hypothesis of this corollary, thanks to [2]).
For p = , we show that the set of functions dened on L
() by l(f) =
+
_
fg d where g M separates points in K, thanks to approximation of
elements of L
1
(). Then, we apply theorem 5.1 of [7].
For p = 1, we must add an additional condition:
Corollary 2. Let be a nite positive compactly supported Borel measure on
R
n
. Let T be a measurable function from R to R, that is non polynomial and
Riemann integrable on some compact interval (not reduced to one point) of R.
Let M be a subset of L
= L
() by a
compactly supported continuous function (thanks to Lusin theorem, e.g. [6]).
Then we approximate this function on the support of the measure thanks
to hypothesis on M. The conclusion is obtained thanks to corollary 5.1.3 of
[7].
3.2 Practical consequences
Corollaries 1 and 2 show that as long as we can approximate functions in L
q
()
or in C(R
n
, R) (with elements of M), we can approximate continuous functions
from a compact subset of L
p
().
On a practical point of view, M can be implemented by numerical MLPs.
Indeed, [1] provides density results in L
p
() spaces (p < ) for MLP calculated
functions, which is exactly what is needed for corollary 1. For corollary 2 we
need universal approximation on compacta of continuous functions, which can
again be done by numerical MLP, according to results of [2].
Corollaries 1 and 2 allows therefore to conclude that given a continuous
function from a compact subset of a L
p
() space to R and a given precision,
there is a functional MLP (constructed thanks to numerical MLP) that ap-
proximates the given function to the specied accuracy. The unexpected result
is that the approximating MLP uses a nite number of numerical parameters,
exactly as in the case of nite dimensional inputs.
4 Consistency
4.1 Probabilistic framework
In practical situations, input functions are not completely known but only
through a nite set of input/output pairs, i.e., (x
i
, g(x
i
)). In general, the x
i
are randomly chosen measurement points. To give a probabilistic meaning to
this model, we assume given a probability space P = (, A, P), on which is
dened a sequence of sequences of independent identically distributed random
variables, (X
j
i
)
iN
with value in Z, a metric space considered with its Borel
sigma algebra. We call P
X
the nite measure induced on Z by X = X
1
1
(this
observation measure plays the role of in the universal approximation results).
We assume dened on P a sequence of independent identically distributed
random elements (G
j
)
jN
with values in L
p
(P
X
) and we denote G = G
1
.
Let us now consider a one hidden layer functional perceptron that theo-
retically computes H(a, b, w, g
j
) =
k
i=1
a
i
T
_
b
i
+
_
w
i
g
j
dP
X
_
, where g
j
is a
realization of G
j
, and each w
i
belongs to L
q
(P
X
). We replace this exact cal-
culation which is not practically possible by the following approximation (a
random variable) :
H(a, b, w, G
j
)
m
=
k
i=1
a
i
T
_
b
i
+
1
m
m
l=1
w
i
(X
j
l
)G
j
(X
j
l
)
_
(2)
In practical settings, we will compute a realization of
H
m
for each input func-
tion realization associated with its evaluation points (which are themselves
realization of the evaluation sequences).
4.2 Parametric approach
As explained in section 3.2, we propose to rely on numerical MLP for practical
representation of weight functions. More generally, we can use parametric
regressors, that is a easily computable function F from W Z to R, where W
is a nite dimensional weight space (numerical MLPs are obviously a special
case of parametric regressors when the number of parameters is xed). With
parametric regressors, equation 2 is replaced by:
H(a, b, w, G
j
)
m
=
k
i=1
a
i
T
_
b
i
+
1
m
m
l=1
F
i
(w
i
, X
j
l
)G
j
(X
j
l
)
_
(3)
4.3 Consistency result
The parametric approach just proposed allows to tune a given functional MLP
for a proposed task. In regression or discrimination problems, each studied
function G
j
is associated to a real value Y
j
((Y
j
)
jN
is a sequence of inde-
pendent identically distributed random variables dened on P and we denote
Y = Y
1
). We want the MLP to approximate the mapping from G
i
to Y
i
and
we measure the quality of this approximation thanks to a cost function, l. In
order to simplify the presentation, we consider as in [8] that l models both the
cost function and the calculation done by the functional MLP, so that it can be
considered as a function from L
p
(P
X
) R W, where W is a compact subset
of R
q
which corresponds to all numerical parameters used in the functional
MLP (this includes parameters directly used by the functional MLP as well as
parameters used by embedded parametric regressors).
The goal of MLP training is to minimize (w) = E(l(G, Y, w)). Unfortu-
nately, it is not possible to calculate exactly which is replaced by the random
variable
n
(w) =
1
n
n
j=1
l(G
j
, Y
j
, w). In [8], H. White shows that for nite
dimensional input spaces and under regularity assumptions on l,
n
converges
almost surely uniformly on W to , which allows to conclude that estimated
optimal parameters (i.e., solution to min
wW
n
(w)) converge almost surely to
optimal parameters (i.e. to the set W
of solution to min
wW
(w)). We have
demonstrated in [5] that this result can be extended to innite dimensional
input spaces, i.e. to L
p
(P
X
).
Unfortunately, we cannot directly rely on this result for practical situations,
because we cannot compute exactly H, but rather
H
m
. We dene therefore
l
m
by exactly the same rational as l except that the exact output of the functional
MLP is replaced by
H
m
. This allows to dene
m
n
(w) =
1
n
N
j=1
l
m
j
(G
j
, Y
j
, w),
where m = inf
1jn
m
j
, and w
m
n
a solution to min
wW
m
n
(w).
We show in [5] that under regularity assumptions on the cost function l and
on the parametric regressors, lim
n
lim
m
d( w
m
n
, W
) = 0.
The practical meaning of this result is quite similar to the corresponding
result for numerical MLP: we do not make systematic errors when we estimate
optimal parameters for a functional MLP using a nite number of input func-
tions known at a nite number of measurement points, because the estimated
optimal parameters converge almost surely to the true optimal parameters.
5 Conclusion
We have proposed in this paper an extension of Multi Layer Perceptrons that
allows processing of functional inputs. We have demonstrated that the proposed
model is an universal approximator, as numerical MLP are. We have also
demonstrated that despite a very limited knowledge (we have in general a
nite number of example functions and a nite number of evaluation points
for each function), it is possible to consistently estimate optimal parameters on
available data (as in the case of numerical MLP).
Those theoretical results show that functional MLP share with numerical
MLP their fundamental properties and that they can therefore be considered as
a possible way to introduce non linear modeling in Functional Data Analysis.
Further work is needed to assess their practical usefulness on real data and
to compare them both with linear FDA methods and with basis expansion
representation techniques used in FDA.
References
[1] Kurt Hornik. Approximation capabilities of multilayer feedforward net-
works. Neural Networks, 4(2):251257, 1991.
[2] Kurt Hornik. Some new results on neural network approximation. Neural
Networks, 6(8):10691072, 1993.
[3] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feed-
forward networks are universal approximators. Neural Networks, 2:359366,
1989.
[4] Jim Ramsay and Bernard Silverman. Functional Data Analysis. Springer
Series in Statistics. Springer Verlag, June 1997.
[5] Fabrice Rossi, Brieuc Conan-Guez, and Fran cois Fleuret. Functional multi
layer perceptrons. Technical report, LISE/CEREMADE & INRIA, 2001.
[6] Walter Rudin. Real and complex Analysis. Mc Graw Hill, 1974.
[7] Maxwell B. Stinchcombe. Neural network approximation of continuous
functionals and continuous functions on compactications. Neural Net-
works, 12(3):467477, 1999.
[8] Halbert White. Learning in Articial Neural Networks: A Statistical Per-
spective. Neural Computation, 1(4):425464, 1989.