Universal Approximation To Nonlinear Operators by Neural Networks With Arbitrary Activation Functions and Its Application To Dynamical Systems
Universal Approximation To Nonlinear Operators by Neural Networks With Arbitrary Activation Functions and Its Application To Dynamical Systems
4, JULY 1995 91 1
Abstract- The purpose of this paper is to investigate neural Mhaskar and Micchelli [l I] showed that under some restric-
network capability systematically.The main results are: 1) every tion on the amplitude of a continuous function near infinity,
Tauber-Wiener function is qualified as an activation function any nonpolynomial function is qualified to be an activation
in the hidden layer of a three-layered neural network, 2) for a
continuous function in S’ (R’ ) to be a Tauber-Wiener function, function.
the necessary and sufficientcondition is that it is not a polynomial, It is clear that all the aforementioned works are concerned
3) the capability of approximating nonlinear functionals defined with approximation to a continuous function defined on a com-
on some compact set of a Banach space and nonlinear operators pact set in R” (a space of finite dimensions). In engineering
has been shown, which implies that 4) we show the possibility by
neural computation to approximate the output as a whole (not at problems such as computing the output of dynamic systems
a fixed point) of a dynamical system, thus identifying the system. or designing neural system identifiers, however, we often
encounter the problem of approximating nonlinear functionals
defined on some function space, even nonlinear operators from
I. INTRODUCTION
one function space (a space of infiinte dimensions) to another
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
912 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 199.5
as a whole (not merely at a special point, cf. [lo], [13]), S'(Rn) Tempered distributions, i.e., linear continuous
thus to identify the system? functionals defined on S(R")
In this paper, we systematically give strong results regarding C'"(R" Infinitely differentiable functions
these issues. CF (R" Infinitely differentiable functions with
The paper is organized as follows. In Section 11, we review compact support in R"
some definitions and notations. In Section 111, we show that the C,[-l, 1" All periodic functions with period two with
necessary and sufficient condition for a continuous function in respect to every variable I C ~ ,i = 1,. . . , n.
S'(R1) (tempered distributions in RI) to be a Tauber-Wiener
function (for definitions, see Section 11) is that it is not a
FUNCTIONS
111. CHARACTERISTICS OF ACTIVATION
polynomial, and any Tauber-Wiener function can be used
as an activation function, i.e., any nonpolynomial continuous In this section, we prove three theorems.
function in S ' ( R 1 ) is an activation function. What is more Theorem I : Suppose that 9 is a continuous function, and
interesting is that we show the approximation is equiuniform g E S'(R1), then g E (TW), if and only if g is not a
on any compact set in C ( K ) ,which is crucial in discussing polynomial.
approximation to continuous operators by neural networks. Theorem 2: If CT is a bounded sigmoidal function, then
In Section IV, we show the capability of neural networks to CT E (TW).
approximate continuous functionals defined on some compact Theorem 3: Suppose that K is a compact set in Rn, U is
set in C ( K ) , where K is a compact set in some Banach a compact set in C ( K ) ,g E (TW), then for any t > 0, there
space and through which we establish the capability of neural exist a positive integer N , real numbers O i , vectors wi E R",
networks to approximate continuous operators from C (K1) i = l , . .. ,N , which are independent of f E C ( K ) and
to C(K2). The main results in Section IV have potential constants ci (f), 1: = 1,. . . ,N depending on f , such that
applications to computing outputs of dynamic systems and
identifying the systems. This is an important issue in system
identification [17] and [18], and we will discuss it in more
detail in Section V.
holds for all IC E K and f E U. Moreover, each c , ( f ) is a
AND DEFINITIONS
11. NOTATIONS linear continuous functional defined on U.
Remark I : Theorem 3 shows that for a function (continuous
Definition I : A function CT : R1 -+ R1 is called a sigmoidal or discontinuous) to be qualified as an activation function, a
function, if it satisfies sufficient condition is that it belongs to TW class. Therefore,
to prove that a neural network is capable of approximating any
.(IC)
{ limz+-m
limz+m
= 0,
.(IC) = 1. continuous function of n variables, all we need to do is to deal
with the case n = 1, thus we have reduced the complexity
Definition 2: If a function g : R -+ R (continuous of the problem in terms of its dimensionality. Moreover,
or discontinuous) satisfies that all the linear combinations by examining the approximated function f ( 1 ~ 1.,. . , IC,) =
N
c,g(X,z+8,), A, E R, 8, E R, C , E R, i = 1, 2, . . . , N , f(z1, O ; . . , 0) = f * ( 1 ~ 1 ) ,where f * ( z l ) is a continuous
are dense in every C[a,b], then g is called a Tauber-Wiener function of one variable, it is straightforward to see that the
(TW) function. condition is also a necessary one.
Definition 3: Suppose that X is a Banach space, V C X is Remark 2: The equiuniform convergence property in The-
called a compact set in X , if for every sequence { T ~ } Fwith ?~ orem 3 will play a crucial role in approximation to nonlinear
all T , E V, there is a subsequence { I C , ~ } , which converges operators by neural networks.
to some element z E V. Remark3: When a sigmoidal function is used as an ac-
It is well known that if V G X is a compact set in X , then tivation in a neural network, Theorem 2 shows that the only
for any S > 0, there is a S-net N ( 6 ) = { X I , . . . , I C , ( ~ ) } , with necessary condition imposed on is its boundedness. In contrast,
all IC, E V, i = 1,.. . , n(S), i.e., for every z E X . there is in almost all other papers [ 11-[SI, sigmoidal functions must be
some z, E N ( 6 ) such that 1) 2, - 2 I I x < S. assumed to be either continuous or monotone.
In the sequel, we will often use the following table notations. Remark 4: In [12], some result similar to Theorem I was
obtained under more restrictions imposed on g, i.e., there
X some Banach space with norm 1) . 11s are positive integer N and a constant C N , such that 1(1 +
R" Euclidean space of dimension n lsl)-Ng(x)l 5 C for all IC E RI. This restriction is essential
K some compact set in a Banach space for [12], for the proof in [I21 depends heavily on a variation
C(K) Banach space of all continuous functions of Paley-Wiener Theorem. In Theorem 1, however, we only
defined on K , with norm assume that g E C ( R 1 )n S'(R"), which is weaker than the
II f llC(K)= maXzEK If(.)l assumptions used in 1121.
(TW) All the Tauber-Wiener functions Proof of Theorem I : We will prove by contradiction. If
S(R") Schwartz functions in tempered distribution the set of all the linear combinations E,"& c,g(X,z + 19,) is
theory, i.e., rapidly decreasing and infinitely not dense in C [ a ,b ] , then Hahn-Banach extension theorem
differentiable functions and Riesz representation of linear continuous functionals show
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
~
that there is a signed Bore1 measure dp with supp(dp) C [a, b] \.(U) - 11 < 1/M2; if U < -W, then Io(u)I < 1/M2. Let
and K > 0 be such that K . (1/2M) > W. Construct
g(z) = f ( - M ) o ( K ( z - t-1))
N
for all X # 0 and 0 E R1.Take any w E S(R1),then + C[f(.i)- f(zi-l)l.(K(z - ti-1)) (9)
i=l
Let Xz + 0 = U and change order of integration, we have 1g(z) - f ( z ) l < E forallz E [-I, 11. (IO)
+
and t; = 1/2(zi zi+l),t - 1 = -1 - 1/2M. From the forevery fEUandzE[-l,l]“providedthat a > ( n - 1)/2.
assumption, there exists W > 0, such that if U > W, then The proof of Lemma 4 can be found in [14].
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
914 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995
Proof of Theorem 3: Without loss of generality, we can is a continuous functional defined on U (and also a continuous
assume that K C [O, 11". By Lemma 3 , we can assume that functional defined on V), and c i ( f ) , being a finite linear
K = [-1, 11" and U 5 Cp[-l, 11". By Lemma 4, for any combination of the Fourier coefficients of f * , is surely a
E > 0, there exists R > 0, such that for any x = (21,. . . , 2
)
, E continuous functional defined on V. The proof of Theorem
[-1, 11" and f E U, there holds 3 is completed.
if(r) -
N
i=l
C Z ( f ) d ~ .x
i + Oil <E
U E V, provided that 11 z' - x" I(,y < S.
Now pick a sequence € 1 > c2 > . . . > E , -+ 0 , then we
can find another sequence 6 1 > Sz < . . * > 6, 4 0, such that
If(.) - !('U)\ < Ek for all f E V , provided that U , 'U E andv
(1 U - 'U /Ic(K) < 2 6 k , for f is a continuous functional defined
on a compact set V.
By Lemma 6, we can also find q1 > 7 2 > . . . qn -+ 0 such
that Iu(z' ) - u(x")J < SI,for all U E V, whenever x', IC'E 'K
and I( Z' - 2'' 1lX < V k .
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
915
3) Suppose ( ~ 2 is ) a~ sequence
~ in V * . If there is a
subsequence { ~ ' l } f ? ~ of { u ~ with } ~ all uir
~ E V,
and
I = 1, . , then by the fact that V is compact, there
is a subsequence of { u i l } & , which converges to some
U E V. Otherwise, to each U ' , there corresponds a
positive integer IC(i) and a vi E V such that ui = ut,( .
There are two possibilities: i) We can find infinite
j=1 and a fixed ko such that q ~ , ( =~ ~VI,(^^) ) = ... - -
for j = l,...,n(qk) . It is easy to verify that {Tvk,j(x)}
is a = . . . = qko, i.e., uil E Vqk0for all i l . By
partition of unity, i.e., proposition 1) of this lemma, Vvk0is a compact set, there
is a subsequence of { w ~ , ( ~ ) which
}, converges to some
U E V&, i.e., there is a subsequence of { U ' } converging
to w E Vqk0. ii) There are sequences il < i 2 < . . . + cc
T,k,J(x)= 1 (26) and IC(i1) < IC(i2) < ... + cc such that uif E V,E(I1).
j=1
Let u Z ~ E V be such that
T,,,3(x)
=0 if II 2 - x J Ilx > VI,. (27)
For each U E V, define a function
n(l)k)
UVk(2) = 4x63)T,kJ(x) (28) Since v21 E V and V is compact, we see that there is a
j=1 subsequence of {vzl}El, which converges to some v E
V. By proposition 2) of this lemma, the corresponding
and let V,, = {uOk: U E V } and V* = V U (ur=.=,V,,).We
subsequence of {u2l}& also converges to W. Thus the
then have the following result.
compactness of V * is proved.
Lemma 7:
Proof of Theorem 4: By the Tietze Extension Theorem,
1) For each fixed IC, V,, is a compact set in a subspace of
we can define a continuous functional on V* such that
dimension VI,) in C ( K ) .
2) For every U E V, there holds f * ( z )= f ( x ) if z E V. (34)
Because f * is a continuous functional defined on the compact
11 Ilc(K) (29)
set V * , therefore for any F > 0, we can find a S > 0 such
3) V* is a compact set in C ( K ) . that If*(u)- f*(v)1 < ~ / provided
2 that U , 2' E V* and
Proof: We will prove the three propositions individually II U - IlC(K) <S.
as follows. Let IC be fixed such that SI, < 6,then by (29) for every U E V
For a fixed k , let U!?, a = 1, 2, . . . , be a sequence in II 'U. - Uq, IIS < SI, (35)
V,, and u ( ~be) a sequence in V, such that which implies
n(l)k)
If*(U) - f*(%,)l< (36)
= u(~)(x~)T,,,~(x).
(30)
for all U E V.
j=1
By proposition 1) of Lemma 7, we see that f*(u,,) is
Since V is a compact set in C ( K ) ,there is a subse- a continuous functional defined on the compact set V,, in
quence ~ ( ~ " ( xwhich
), converges to some U E V, then Rn("). By Theorem 3, we can find N , c,, E Z j , O,, i =
it is obvious that u$;)(x) converges to u,,(x)E V,,, 1,... , N , j = 1,.. . , VI,), such that
i.e., V,, is a compact subset in C ( K ) .
By the definition and the property of unity partition, we
have
n(qk)
Combining it with (36), we conclude that
4.) - U,!% (x)= [4.) - 473)lT,k,j(4
j=1 I N / m \I
= [U(.) -441T,k,j(4.
IIl-zj IIx
(31) where m = VI,). Thus, Theorem 4 is proved.
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
916 E E E TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995
1
N
G ( u ) ( y )- ck(G(U))g(Wk ' y + Ck) < (39)
k=l
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.
CHEN AND CHEN: UNIVERSAL APPROXIMATION TO NONLINEAR OPERATORS 917
If the system is linear, then E , v(y) can be simplified as [2] B. hie and S. Miyake, “Capacity of three-layered perceptrons,” in Proc.
IEEE ICNN 1, 1988, pp. 641448.
[3] G. Cybenko, “Approximation by superpositions of a sigmoidal func-
tion,” Math. Contr., Signals Syst., vol. 2, no. 4, pp. 303-314, 1989.
[4] S. M. Carroll and B. W. Dickinson, “Construction of neural nets using
radon transform,” in Proc. IJCNN Proc. I, 1989, pp. 607411.
[5] K. Funahashi, “On the approximate realization of continuous mappings
by neural networks,” Neural Networks, vol. 2, pp. 183-192, 1989.
[6] K. Homik, M. Stinchcombe, and H. White, “Multi-layer feedforward
networks are universal approximators,” Neural Networks, vol. 2, pp.
359-366, 1989.
N M m [7] K. Homik, “Approximation capabilities of multilayer feedforward net-
works,” Neural Networks, vol. 4,pp. 251-257, 1991.
IC=l i = l j = 1 [8] V. Y. Kreinovich, “Arbitrary nonlinearity is sufficient to represent all
functions by neural networks: A theorem,” Neural Networks, vol. 4, pp.
The larger the values of n, L , m are, the better accuracy we 381-383, 1991.
[9] T. Chen, H. Chen, and R. Liu, “A constructive proof of Cybenko’s
will obtain for this approximation. approximation theorem and its extensions,” pp. 163-168 in Proc. 22nd
Therefore, we have pointed to a way of constructing neural Symp. Interface, East Lansing, Michigan, May 1990. Also submitted for
network models for identifying dynamic systems. publication.
[ 101 I. W. Sandberg, “Approximation theorems for fiscrete-time systems,”
IEEE Trans. Circuits Syst., vol. 38, no. 5, pp. 564-566, May 1991.
VI. CONCLUSION [ 111 -, “Approximations for nonlinear functionals,” IEEE Trans. Cir-
cuits Syst., vol. 39, no. 1, pp. 65-67, Jan. 1992.
In this paper, the problem of approximating functions of [12] H. N. Mhaskar and C. A. Micchelli, “Approximation by superposition of
several variables, functionals, and nonlinear operators are sigmoidal and radial basis functions,” Advances Applied Mathematics,
vol. 13, pp. 350-373, 1992.
thoroughly studied. The necessary and sufficient condition [13] T. Chen and H. Chen, “Approximation to continuous functionals by
for a continuous function in S’(R1)to be qualified for an neural networks with application to dynamical systems,” accepted by
activation function is given, which is a broad generalization IEEE Trans. Neural Networks, vol. 4, no. 6, Nov. 1993.
[14] E. M. Stein and G. Weiss, Introduction to Fourier Analysis on Euclidean
of previous results [1]-[8], especially [12]. It is also pointed Spaces. Princeton, NJ: Princeton Univ. Press, 1971.
out that to prove neural network approximation capability, one [ 151 E. M. Stein, Singular Integrals and Differentiability Properties of Func-
needs only to treat the one dimensional case. As applications, tions. Princeton, NJ: Princeton Univ. Press, 1970.
[16] J. Diedonne, Foundation of Modern Analysis. New York and London:
we show how to construct neural networks to approximate the Academic, 1969, p. 142.
output of a dynamical system as a whole, not merely at a fixed [17] K. S. Narendra and K. Parthasarathy, “Identification and control of
point, thus show the capability of neural network in identifying dynamic systems using neural networks,” IEEE Trans. Neural Networks,
vol. 1, pp. 4-27, 1990.
dynamic systems. Moreover, we point out that using existing [ 181 -, “Gradient methods for optimization of dynamical systems con-
algorithms in literatures (for example, backpropagation algo- taining neural networks,” IEEE Trans. Neural Networks, vol. 2, pp.
rithm), we can determine those parameters in the network, 252-262, 1991.
[19] T. Chen, H. Chen, and R. Liu, “Approximation capability in CR by
i.e., identify the system. multilayer feedforward networks and related problems,” IEEE Trans.
Neural Networks, vol. 6, no. 1, Jan. 1995.
ACKNOWLEDGMENT
The authors wish to express their gratefulness to the review-
ers for their valuable comments and suggestions on revising Tianping Chen, for photograph and biography, please see this TRANSACTIONS,
this paper. p. 910.
REFERENCES
[ l ] A. Wieland and R. Leighten, “Geometric analysis of neural network Hong Chen, for photograph and biography, please see this TRANSACTIONS,
capacity,” in Proc. IEEE First ICNN. 1, 1987, pp. 385-392. p. 910.
Authorized licensed use limited to: Univ of Calif Merced. Downloaded on February 05,2025 at 19:07:48 UTC from IEEE Xplore. Restrictions apply.