0% found this document useful (0 votes)
55 views6 pages

Kolmogorov Theorem Is Relevant

Uploaded by

Shraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views6 pages

Kolmogorov Theorem Is Relevant

Uploaded by

Shraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Communicated by Halbert White

Kolmogorov’s Theorem Is Relevant

VSra Kurkovzi

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


Institute of Computer Science, Czechoslovak Academy of Sciences,
P. 0. Box 5, 182 07 Puague 8, Czechoslovakia

We show that Kolmogorov‘s theorem on representations of continu-


ous functions of n-variables by sums and superpositions of continuous
functions of one variable is relevant in the context of neural networks.
We give a version of this theorem with all of the one-variable functions
approximated arbitrarily well by linear combinations of compositions
of affine functions with some given sigmoidal function. We derive an
upper estimate of the number of hidden units.

Hecht-Nielsen (1987) suggested that a remarkable mathematical result of


Kolmogorov (1957)could provide new insights and tools for understand-
ing multilayer neural networks. There are several theorems in different
branches of mathematics named after this great Russian mathematician.
The one mentioned by Hecht-Nielsen was a theorem disproving Hilbert’s
conjecture formulated as the thirteenth of the famous list of 23 open
problems that Hilbert supposed to be of the greatest importance for the
development of mathematics in this century.
The thirteenth problem, although formulated as a concrete minor hy-
pothesis, is connected with the basic problem of algebra - the solution
of polynomial equations. Could roots of a general algebraic equation
of higher degree be expressed, analogously to the solution by radicals,
by sums and compositions of a one-variable function of some suitable
type? Hilbert conjectured that some continuous functions of three vari-
ables are not representable by sums and superpositions even of functions
of two variables. This was refuted by Arnold (1956). Kolmogorov (1957)
even proved a general representation theorem stating that any contin-
uous function f defined on an n-dimensional cube is representable by
sums and superpositions of continuous functions of only one variable.
Kolmogorov’s formula

readily brings to mind perceptron type networks with the qualification


+
that the one-variable functions p4(q= 1,. . . ,2n 1 ) and $ J (p~ =~ 1,. . . , n,
q = 1,. . . ,2n + 1) are far from being any of the type of functions currently

Neural Cornputofion 3, 617-622 (1991) @ 1991 Massachusetts Institute of Technology


618 Vera Kurkova

used in neurocomputing. In fact, having even fractal graphs, they are


highly nonsmooth.
This was the reason for Girosi and Poggio’s (1989) criticism of Hecht-
Nielsen‘s proposal. They formulated two main reservations:

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


1. The functions $lPq are highly nonsmooth.
2. The functions pq depend on the specific function f and hence are
not representable in a parameterized form.
We shall show that by replacing the equality in equation 1.1by only an
approximation, we can eliminate both of these difficulties. Highly non-
smooth functions encountered in mathematics are mostly constructed as
limits or sums of infinite series of smooth functions. This is the case, e.g.,
with the classical Weierstrass‘s function with no derivative at any point
and many other famous examples of functions with fractal graphs. Since
in the context of neural networks we are interested only in approxima-
tions of functions, the only problem concerning the possible relevance
of Kolmogorov’s theorem for neurocomputing is whether Kolmogorov’s
construction can be modified in such a way that all of the one-variable
functions are limits of sequences of smooth functions used in perceptron
type networks.
By a perceptron type network we mean a multilayer network where
units in each hidden layer sum up weighted inputs from the preceding
layer, add to this sum a constant (bias), and then apply a sigmoidal non-
linearity, while units in the output layer sum only weighted inputs. So
functions used in perceptron type networks are finite linear combinations
of compositions of affine transformations of the real line E l with some
given sigmoidal function [a function ~7 : El -+ (0,1] with limt--oo o(t) = 0
and limb+ooo(t) = 11. We call them staircase-like functions of a sigmoidal
type (or of a type a).
Kolmogorov’s construction of the functions pqand 1Lp4 and their later
improvements by Lorentz (1962) and Sprecher (1965) are, in fact, per-
fectly suited for staircase-like functions of any sigmoidal type. Being
very complex, all of these arguments contain a lot of unnecessary as-
sumptions. But the only really relevant property of the functions used in
inductive construction of one-variable functions pq and $+q is that they
have prescribed values on finitely many closed intervals; elsewhere they
can be arbitrary, provided they are sufficiently bounded. However, such
functions can be approximated arbitrarily well by staircase-like functions
of any sigmoidal type (KurkovP 1991).
To illustrate the idea of Kolmogorov’sconstruction of functions re-
$ J ~ ,

call the classical Devil’s staircase (Fig. 1). Kolmogorov, probably inspired
by this nineteenth-century construction, developed “the second genera-
tion Devil’s staircase,” something Mandelbrot (1982) would appreciate,
by replacing in each induction step the already constructed Devil’s stair-
case’s steps (within a very small neighborhood of each) by smaller steps.
Kolmogorov's Theorem Is Relevant 619

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


Figure 1: Devil's staircase.

The result was a strictly increasing function with, in contrast to the recti-
fiable classical Devil's staircase, a fractal graph. Nevertheless, both first
and second generation Devil's staircases are limits of uniformly converg-
ing series of staircase-like functions of any sigmoidal type.
In contrast to the functions (l'p4, being for the given dimension n uni-
versal, the functions pq depend on f. However, they can be also con-
structed as limits of staircase-like functions of any sigmoidal type. Con-
sider for staircase-like functions (+,of any sigmoidal type, the function
9 defined on the n-dimensional cube by 9 ( x l , . . . ,x,) = CF=, $ J ~ ( X ~ ) .
defines on the cube a Rubik's cube-like structure with small boxes hav-
ing edges corresponding to the steps of (+ and gaps corresponding to
the slopes of gP. Suppose that the small boxes are mapped by 9 into
closed mutually disjoint subintervals of the real line. Ascribing to these
intervals values off at chosen points in the small boxes that 9 maps into
these intervals, we define a finite family of steps that can be approxi-
mated arbitrarily well by a staircase-like function p of a given sigmoidal
type. This function p is representable in a parameterized form with the
620 VGra Kurkova

values of parameters depending on f . The function . \II approximates


f on the subset of the cube formed by the union of all small boxes. The
smaller the steps of $p, the better the approximation. However, f is not
approximated on the gaps. Now, we come to the reason why there are

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


2n + 1 terms under the summation in (1). By suitable shifts of the slopes
of the staircase we can gain 2n + 1 Rubik's cube-like structures on the
unit cube covering the n-dimensional cube sufficiently well in such a way
that for each point there are more structures containing it in a box than
structures containing it in a gap. We need 2n + 1 such structures, since
at some point of the cube it may happen that each of its n coordinates is
contained in the gaps of a different structure (at most n).
These are, roughly speaking, the ideas behind the proofs of the fol-
lowing theorems.

Theorem 1. (Kurkovd 1991). Let n , m be natural numbers with n 1 2, m 2


2n + I, and u : El + [0,I] be any sigrnoidal function. Then there exist such real
numbers w,,(p = 1,., , n,q = 1,.. . ,m ) and functions qjq(q= 1,.. . ,m) being
limits of uniformly converging sequences of staircase-like functions of a type u
that for every continuous function f : [0,1]" El there exists a continuous
-+

function p : El --$ El being a limit of a uniformly converging sequence of


staircase-like functions of a type a, suck that for every ( X I , . . . ,x,) E [O,l]"

Theorem 2. (Kurkovd 1991). Let n 2 2 be a natural number, (T : El + [0,1]


be a sigmoidal function, f : [0,1]" -+ El be a continuous function and f a
positive real number. Then there exist a natural number k and sfaircase-like
functions of a fype a $ p f , y i ( i = 1... . , k , p = 1.. . . , n ) such that for every
( X I , .... xn) E [0,1]"

Theorem 2 implies that any continuous function can be approximated


arbitrarily well by a four-layer perceptron type network. However, sev-
eral recent results (Funahashi 1989; Hecht-Nielsen 1989; Hornik et al.
1989; Cybenko 1989; Carroll and Dickinson 1989; Stinchcornbe and White
1989, 1990; Hornik 1991) established that three layers are sufficient for
approximations of general continuous functions.
Nevertheless, the approach based on the technique developed by
Kolmogorov is not without value. The above mentioned theorems are
proved very elegantly using advanced theorems from functional analy-
sis. However, nondirect proofs do not provide clear insight into con-
structions of approximating functions. The directness of our proofs can
Kolmogorov’s Theorem Is Relevant 621

be exploited for estimating the number of hidden units and for explor-
ing which properties of a function being approximated are relevant for
the growth of this number. The first step in this direction was done in
Kurkovii (1991), where the numbers of units in the second and the third

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


+ +
layer are estimated by n m ( m 1) and m2(m l ) n respectively,
3 where n
is the dimension of the unit cube I” and rn depends o n t/llfll as well as
on the rate with which f increases distances. Hopefully, further analysis
could bring finer estimates and more insight to the questions of what
properties of the function being implemented play a role in determining
the number of hidden units, and whether this number can be sufficiently
reduced by using two instead of only one hidden layer.

References

Alexandrov, P. S. (ed.) 1983. Die Hilbertschen Probleme. Akademische Verlagsge-


sellschaft, Leipzig.
Arnold, V. I. 1957. On functions of three variables. Dokl. Akad. Nauk USSR 114,
679-681.
Carroll, S. M., and Dickinson, 8. W. 1989. Construction of neural nets using the
Radon transform. In Proceedings of the International Joint Conference on Neural
Networks, pp. I, 607-611. IEEE, New York.
Cybenko, G. 1989. Approximation by superpositions of a single function. Math.
Control, Signals Syst. 2, 303-314.
Funahashi, K. 1989. On the approximate realization of continuous mappings
by neural networks. Neural Networks, 2, 183-192.
Girosi, F., and Poggio, T. 1989. Representation properties of networks: Kol-
mogorov’s theorem is irrelevant. Neural Comp. 1, 465-469.
Hecht-Nielsen, R. 1987. Kolmogorov’s mapping neural network existence theo-
rem. In Proceedings of the International Conference on Neural Networks, pp. 111,
11-14. IEEE, New York.
Hecht-Nielsen, R. 1989. Theory of the back-propagation neural network. In
Proceedings of the International Joint Conference on Neural Networks, pp. I, 593-
606. IEEE, New York.
Hecht-Nielsen, R. 1990. Neurocomputing. Addison-Wesley, New York.
Hornik, K., Stinchcombe, M., White, H. 1989. Multilayer feedforward networks
are universal approximators. Neural Networks, 2, 359-366.
Hornik, K. 1991. Approximation capabilities of multilayer feedforward net-
works. Neural Networks 2, 251-257.
Kolmogorov, A. N. 1957. On the representations of continuous functions of
many variables by superpositions of continuous functions of one variable
and addition. Dokl. Akad. Nauk USSR 114 (5), 953-956.
KurkovB, V. 1991. Kolmogorov’s theorem and multilayer neural networks. Neu-
ral Networks (in press).
Lorentz, G. G. 1962. Metric entropy, widths, and superpositions of functions.
A m . Math. Monthly 69, 469-485.
Mandelbrot, B. B. 1982. The Fractal Geometry of Nature. Freeman, San Francisco.
622 VGra Kirkova

Sprecher, D. A. 1965. On the structure of continuous functions of several vari-


ables. Trans. Am. Math. SOC.115, 340-355.
Stinchcombe, M., and White, H. 1989. Universal approximation using feed-
forward networks with non-sigmoid hidden layer activation functions. In

Downloaded from http://direct.mit.edu/neco/article-pdf/3/4/617/812230/neco.1991.3.4.617.pdf by INDIAN INSTITUTE OF SCIENCE user on 10 September 2024


Proceedings of the International Joint Conference on Neural Networks, pp. I, 613-
617. IEEE, New York.
Stinchcombe, M., and White, H. 1990. Approximating and learning unknown
mappings using multilayer feedforward networks with bounded weights.
In Proceedings of the International Joint Conference on Neural Netzuorks, pp. 111,
7-16. IEEE, New York.

Received 20 December 1991; accepted 6 June 1991.

You might also like