0% found this document useful (0 votes)

6 views8 pages

Dasgupta and Schnitger - 1992 - The Power of Approximating A Comparison of Activation Functions Paper

This document compares various activation functions in terms of their approximation power within feedforward neural networks, focusing on both analog and boolean inputs. It discusses the efficiency and quality of approximations, measuring them through network size, layer count, and Lipschitz-bounds, while presenting theorems that establish equivalences among different activation functions. The findings indicate that certain functions, such as the standard sigmoid, possess superior approximation capabilities compared to others like polynomials and linear splines.

Uploaded by

Debtanu Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views8 pages

Dasgupta and Schnitger - 1992 - The Power of Approximating A Comparison of Activation Functions Paper

Uploaded by

Debtanu Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

The Power of Approximating: a

Comparison of Activation Functions

Bhaskar DasGupta Georg Schnitger

Department of Computer Science Department of Computer Science
University of Minnesota The Pennsylvania State University
Minneapolis, MN 55455-0159 University Park, PA 16802
email: dasgupta~cs.umn.edu email: georg~cs.psu.edu

Abstract
We compare activation functions in terms of the approximation
power of their feedforward nets. We consider the case of analog as
well as boolean input.

1 Introduction
We consider efficient approximationsofa given multivariate function I: [-1, l]m-+
n by feedforward neural networks. We first introduce the notion of a feedforward
net.
Let r be a class of real-valued functions, where each function is defined on some
subset of n. A r-net C is an unbounded fan-in circuit whose edges and vertices
are labeled by real numbers. The real number assigned to an edge (resp. vertex)
is called its weight (resp. its threshold). Moreover, to each vertex v an activation
function IV E r is assigned. Finally, we assume that C has a single sink w.
The net C computes a function Ie : [-1,11 m --+ n as follows. The components
of the input vector x = (Xl, . .. , x m ) E [-1, 11 m are assigned to the sources of C.
Let Vl, ••• , Vn be the immediate predecessors of a vertex v. The input for v is then
sv(x) = E~=l WiYi -tv, where Wi is the weight of the edge (Vi, V), tv is the threshold
of v and Yi is the value assigned to Vi. If V is not the sink, then we assign the value
Iv (sv (x)) to v. Otherwise we assign Sv (x) to v.
Then Ie = Sw is the function computed by C where W is the unique sink of C.

615
616 DasGupta and Schnitger

A great deal of work has been done showing that nets of two layers can approximate
(in various norms) large function classes (including continuous functions) arbitrar-
ily well (Arai, 1989; Carrol and Dickinson, 1989; Cybenko, 1989; Funahashi, 1989;
Gallant and White, 1988; Hornik et al. 1989; Irie and Miyake,1988; Lapades and
Farber, 1987; Nielson, 1989; Poggio and Girosi, 1989; Wei et al., 1991). Various
activation functions have been used, among others, the cosine squasher, the stan-
dard sigmoid, radial basis functions, generalized radial basis functions, polynomials,
trigonometric polynomials and binary thresholds. Still, as we will see, these func-
tions differ greatly in terms of their approximation power when we only consider
efficient nets; i.e. nets with few layers and few vertices.
Our goal is to compare activation functions in terms of efficiency and quality of
approximation. We measure efficiency by the size of the net (i.e. the number of
vertices, not counting input units) and by its number of layers. Another resource
of interest is the Lipschitz-bound of the net, which is a measure of the numerical
stability of the net. We say that net C has Lipschitz-bound L if all weights and
thresholds of C are bounded in absolute value by L and for each vertex v of C and
for all inputs x, y E [-1, l]m,
Itv(sv(x)) -tv(sv(y»1 :::; L ·Isv(x) - sv(y)l·
(Thus we do not demand that activation function Iv has Lipschitz-bound L, but
only that Iv has Lipschitz-bound L for the inputs it receives.) We measure the
quality of an approximation of function I by function Ie by the Chebychev norm;
i.e. by the maximum distance between I and Ie over the input domain [-1, l]m.
Let r be a class of activation functions. We are particularly interested in the
following two questions .
• Given a function I : [-1, l]m -+ n, how well can we approximate I by a f-net
with d layers, size s, and Lipschitz-bound L? Thus, we are particularly interested
in the behavior of the approximation error e(s, d) as a function of size and number
of layers. This set-up allows us to investigate how much the approximation error
decreases with increased size and/or number of layers .
• Given two classes of activation functions fl and f2, when do f 1-nets and f 2-
nets have essentially the same "approximation power" with respect to some error
function e(s, d)?
We first formalize the notion of "essentially the same approximation power" .
Definition 1.1 Let e : N2 -+ n+ be a function. fl and f2 are classes of activation
functions.
(a). We say that fl simulates r 2 with respect to e if and only if there is a constant
k such that for all functions I : [-1, l]m -+ n
with Lipschitz-bound l/e(s, d),

if f can be approximated by a r 2 -net with d layers, size s, Lipschitz-

bound 2~ and approximation error e(s, d), then I can also be ap-
proximated with error e( s, d) by a r 1 -net with k( d + 1) layers, size
(s + l)k and Lipschitz-bound 28 " •
(b). We say that f1 and r2 are equivalent with respect to e if and only if f2
simulates f 1 with respect to e and f 1 simulates f2 with respect to e.
The Power of Approximating: a Comparison of Activation Functions 617

In other words, when comparing the approximation power of activation functions,

we allow size to increase polynomially and the number of layers to increase by a
constant factor, but we insist on at least the same approximation error. Observe
that we have linked the approximation error e( s, d) and the Lipschitz-bound of the
function to be approximated. The reason is that approximations of functions with
high Lipschitz-bound "tend" to have an inversely proportional approximation error.
Moreover observe that the Lipschitz-bounds of the involved nets are allowed to be
exponential in the size of the net. We will see in section 3, that for some activation
functions far smaller Lipschitz-bounds suffice.
Below we discuss our results. In section 2 we consider the case of tight approx-
imations, i.e. e(s, d) =
2-'. Then in section 3 the more relaxed error model
e(s, d) = s-d is discussed . In section 4 we consider the computation of boolean
functions and show that sigmoidal nets can be far more efficient than threshold-
nets.

2 Equivalence of Activation Functions for Error e( s, d) = 2- 8

We obtain the following result.

Theorem 2.1 The following activation functions are equivalent with respect to er-
ror e(s, d)=2- 3 •
• the standard sigmoid O'(x) = l+ex~(-r)'
• any rational function which is not a polynomial,
• any root x a , provided Q is not a natural number,
• the logarithm (for any base b > 1),

• the gaussian e- x2 ,
• the radial basis functions (1 + x2)a, < 1,
Q Q #0
Notable exceptions from the list of functions equivalent to the standard sigmoid are
polynomials, trigonometric polynomials and splines. We do obtain an equivalence
to the standard sigmoid by allowing splines of degree s as activation functions for
nets of size s. (We will always assume that splines are continuous with a single knot
only.)

Theorem 2.2 Assume that e(s, d) = 2-'. Then splines (of degree s for nets of size
s) and the standard sigmoid are equivalent with respect to e(s, d).

Remark 2.1
(a) Of course, the equivalence of spline-nets and {O' }-nets also holds for binary
input. Since threshold-nets can add and multiply m m-bit numbers with constantly
many layers and size polynomial in m (Rei/, 1987), threshold-nets can efficiently
approximate polynomials and splines.
618 DasGupta and Schnitger

Thus, we obtain that {u }-nets with d layers, size s and Lipschitz-bound L can
be simulated by nets of binary thresholds. The number of layers of the simulat-
ing threshold-net will increase by a constant factor and its size will increase by a
polynomial in (s + n) log(L), where n is the number of input bits. (The inclusion
of n accounts for the additional increase in size when approximately computing a
weighted sum by a threshold-net.)
(b) If we allow size to increase by a polynomial in s + n, then threshold-nets and
{u }-nets are actually equivalent with respect to error bound 2-". This follows, since
a threshold function can easily be implemented by a sigmoidal gate (Maass et al.,
1991).
Thus, if we allow size to increase polynomially (in s + n) and the number of layers
to increase by a constant factor, then {u }-nets with weights that are at most expo-
nential (in s + n) can be simulated by {u} -nets with weights of size polynomial in
s.

{u }-nets and threshold-nets (respectively nets of linear thresholds) are not equiva-
lent for analog input. The same applies to polynomials, even if we allow polynomials
of degree s as activation function for nets of size s:

Theorem 2.3
(a) Let sq(x) = x 2 • If a net of linear splines (with d layers and size s) approximates
sq( x) over the interval [-1, 1], then its approximation error will be at least s-o( d) .
(b) Let abs(x) =1 x I. If a polynomial net with d layers and size s approximates
abs(x) over the interval [-1,1]' then the approximation error will be at least s-O(d).

We will see in Theorem 2.5 that the standard sigmoid (and hence any activation
function listed in Theorem 2.1) is capable of approximating sq(x) and abs(x) with
error at most 2-" by constant-layer nets of size polynomial in s. Hence the standard
sigmoid is properly stronger than linear splines and polynomials. Finally, we show
that sine and the standard sigmoid are inequivalent with respect to error 2-'.

Theorem 2.4 The function sine(Ax) can be approximated by a {u}-net CA with d

=
layers, size s AO(l/d) and error at most sO( -d). On the other hand, every {u }-net
with d layers which approximates sine(Ax) with error at most ~, has to have size
at least AO(l/d).

Below we sketch the proof of Theorem 2.1. The proof itself will actually be more
instructive than the statement of Theorem 2.1. In particular, we will obtain a
general criterion that allows us to decide whether a given activation function (or
class of activation functions) has at least the approximation power of splines.

2.1 Activation F\lnctions with the Approximation Power of Splines

Obviously, any activation function which can efficiently approximate polynomials

and the binary threshold will be able to efficiently approximate splines. This follows
since a spline can be approximated by the sum p + t . q with polynomials p and q
The Power of Approximating: a Comparison of Activation Functions 619

and a binary threshold t. (Observe that we can approximate a product once we can
approximately square: (x + y)2/2 - x 2/2 - y2/2 = x . y.)
Firstly, we will see that any sufficiently smooth activation function is capable of
approximating polynomials.

Definition 2.1 Ld -y : n ---+ n be a function. We call -y suitable if and only if

there exists real numbers a, f3 (a > 0) and an integer k such that
(a) -y can be represented by the power series 2:~o ai(x - f3)i for all x E [-a, a].
The coefficients are rationals of the form ai = ~ with IPi I, IQil ~ 2ki (for i > 1).

(b) For each i > 2 there exists j with i ~ j ~ i k and aj "# O.

Proposition 2.1 Assume that -y is suitable with parameter k.
Then, over the domain [-D, D], any degree n polynomial p can be approximated
with errore by a {-y}-net Cpo Cp has 2 layers and size 0(n2k); its weights are
rational numbers whose numerator and denominator are bounded in absolute value
by
Pmax(2 + D)PO,y(n)lh(N+l)II[_a,a1;'
Here we have assumed that the coefficients of p are rational numbers with numerator
and denominator bounded in absolute value by Pmax.

Thus, in order to have at least the approximation power of splines, a suitable activa-
tion function has to be able to approximate the binary threshold. This is achieved
by the following function class,

Definition 2.2 Let r be a class of activation functions and let 9 : [1,00] ---+ n be a
function.
(a). We say that 9 is fast converging if and only if
I g(x) - g(x + e) 1= 0(e/X 2 ) for x ~ 1, e ~ 0,

o < roo g( u 2 )du < 00 ~

and I
Jroo g( u )du 1= 0(1/ N) for all N 1.
2
J1 2N

(b). We say that r is powerful if and only if at least one function in r is suitable
and there is a fast converging function g which can be approximated for all s > 1
(over the domain [-2",2"]) with error 2-" by a {r}-net with a constant number of
layers, size polynomial in s and Lipschitz-bound 2".

Fast convergence can be checked easily for differentiable functions by applying the
mean value theorem. Examples are x-a for a ~ 1, exp( -x) and 0'( -x). Moreover,
it is not difficult to show that each function mentioned in Theorem 2.1 is powerful.
Hence Theorem 2.1 is a corollary of

Theorem 2.5 Assume that r is powerful.

(a) r simulates splines with respect to error e(s, d) = 2- 3 •
620 DasGupta and Schnitger

(b) Assume that each activation function in r can be approximated (over the
domain [-2',2']) with error 2-' by a spline-net N, of size s and with constantly
many layers. Then r is equivalent to splines.

Remark 2.2 Obviously, 1/x is po'wer/ul. Therefore Theorem 2.5 implies that
constant-layer {l/x}-nets of size s approximate abs(x) = Ixl with error 2-'. The
degree of the resulting rational function will be polynomial in s. Thus Theorem
2.5 generalizes .N ewman's approximation of the absolute value by rational functions.
(Newman, 1964)

3 Equivalence of Activation Functions for Error s-d

The lower bounds in the previous section suggest that the relaxed error bound
e(s, d) = s-d is of importance. Indeed, it will turn out that many non-trivial smooth
activation functions lead to nets that simulate {(T }-nets, provided the number of
input units is counted when determining the size of the net. (We will see in section
4, that linear splines and the standard sigmoid are not equivalent if the number of
inputs is not counted). The concept of threshold-property will be crucial for us.

Definition 3.1 Let r

be a collection of activation functions. We say that r has
the threshold-property if there is a constant c such that the following two properties
are satisfied for all m > 1.
(a) For each 'Y E there is a threshold-net T-y,m with c layers and size (s + m)C
r
which computes the binary representation of'Y'(x) where h(x)-t'(x)1 ~ 2- m .
The input x of T-y ,m is given in binary and consists of 2m + 1 bits; m bits describe
the integral part of x, m bits describe its fractional part and one bit indicates the
sign . s + m specifies the required number of output bits, i.e. s = rlog2(sup{'Y(x) :
_2m+l < x < 2m+l})1.

(b) There is a r -net

with c layers, size m C and Lipschitz bound 2 mc which approx-
imates the binary threshold over D = [-1,1] - [-11m, 11m] with error 11m.
We can now state the main result of this section.

Theorem 3.1 Assume that e(s, d) = s-d.

(a) Let rbe a class of activation functions and assume that r has the threshold
property. Then, (T and r are equivalent with respect to e . Moreover, {(T} -nets only
require weights and thresholds of absolute value at most s. (Observe that r -nets are
allowed to have weights as large as 2' I)
(b) If rand (T are equivalent with respect to error 2-', then rand (T are equivalent
with respect to error s-d.
(c) Additionally, the following classes are equivalent to {(T }-nets with respect to e.
(We assume throughout that all coefficients, weights and thresholds are bounded by
2 3 for nets of size s) .
• polynomial nets (i. e. polynomials of degree s appear as activation function for
nets of size s),
The Power of Approximating: a Comparison of Activation Functions 621

where ~/ is a suitable function and

• {"y }-nets, "y satisfies part (a) of Definition 3.1.
(This includes the sine-function.)
• nets of linear splines

The equivalence proof involves a first phase of extracting O(dlogs) bits from the
analog input. In a second phase, a binary computation is mimicked. The extraction
process can be carried out with error s-1 (over the domain [-1,1] - [-l/s, l/s])
once the binary threshold is approximated.

4 Computing boolean functions

As we have seen in Remark 2.1, the binary threshold (respectively linear splines)
gains considerable power when computing boolean functions as compared to approx-
imating analog functions. But sigmoidal nets will be far more powerful when only
the number of neurons is counted and the number of input units is disregarded. For
instance, sigmoidal nets are far more efficient for "squaring", i.e when computing:
Mn = {(x, y): x E {O, l}n, y E {O, l}n:l and [xJ2;;::: [y]} (where [z] = L Zi).
i

Theorem 4.1 A threshold-net computing Mn must have size at least n(logn). But
Mn can be computed by a (1'-net with constantly many gates.
The previously best known separation of threshold-nets and sigmoidal-nets is due
to Maass, Schnitger and Sontag (Maass et al., 1991). But their result only applies
to threshold-nets with at most two layers; our result holds without any restriction
on the number oflayers. Theorem 4.1 can be generalized to separate threshold-nets
and 3-times differentiable activation functions, but this smoothness requirement is
more severe than the one assumed in (Maass et al., 1991).

5 Conclusions
Our results show that good approximation performance (for error 2-") hinges on
two properties, namely efficient approximation of polynomials and efficient approx-
imation of the binary threshold. These two properties are shared by a quite large
class of activation functions; i.e. powerful functions. Since (non-polynomial) ratio-
nal functions are powerful, we were able to generalize Newman's approximation of
I x I by rational functions .
On the other hand, for a good approximation performance relative to the relaxed
error bound s-d it is already sufficient to efficiently approximate the binary thresh-
old. Consequently, the class of equivalent activation functions grows considerably
(but only if the number of input units is counted). The standard sigmoid is dis-
tinguished in that its approximation performance scales with the error bound: if
larger error is allowed, then smaller weights suffice.
Moreover, the standard sigmoid is actually more powerful than the binary threshold
even when computing boolean functions. In particular, the standard sigmoid is able
to take advantage of its (non-trivial) smoothness to allow for more efficient nets.
622 DasGupta and Schnitger

Acknowledgements. We wish to thank R. Paturi, K. Y. Siu and V. P. Roy-

chowdhury for helpful discussions. Special thanks go to W. Maass for suggesting
this research, to E. Sontag for continued encouragement and very valuable advice
and to J. Lambert for his never-ending patience.
The second author gratefully acknowledges partial support by NSF-CCR-9114545.

References
Arai, W. (1989), Mapping abilities of three-layer networks, in "Proc. of the Inter-
national Joint Conference on Neural Networks", pp. 419-423.
Carrol, S. M., and Dickinson, B. W. (1989), Construction of neural nets using
the Radon Transform,in "Proc. of the International Joint Conference on Neural
Networks", pp. 607-611.
Cybenko, G. (1989), Approximation by superposition of a sigmoidal function, Math-
ematics of Control, Signals, and System, 2, pp. 303-314.
Funahashi, K. (1989), On the approximate realization of continuous mappings by
neural networks, Neural Networks, 2, pp. 183-192.
Gallant, A. R., and White, H. (1988), There exists a neural network that does not
make avoidable mistakes, in "Proc. of the International Joint Conference on Neural
Networks" , pp. 657-664.
Hornik, K., Stinchcombe, M., and White, H. (1989), Multilayer Feedforward Net-
works are Universal Approximators, Neural Networks, 2, pp. 359-366.
Irie, B., and Miyake, S. (1988), Capabilities of the three-layered perceptrons, in
"Proc. of the International Joint Conference on Neural Networks", pp. 641-648.
Lapades, A., and Farbar, R. (1987), How neural nets work, in "Advances in Neural
Information Processing Systems" , pp. 442-456.
Maass, W., Schnitger, G., and Sontag, E. (1991), On the computational power of
sigmoid versus boolean threshold circuits, in "Proc. of the 32nd Annual Symp. on
Foundations of Computer Science" , pp. 767-776.
Newman, D. J. (1964), Rational approximation to I x I , Michigan Math. Journal,
11, pp. 11-14.
Hecht-Nielson, R. (1989), Theory of backpropagation neural networks, in "Proc. of
the International Joint Conference on Neural Networks", pp. 593-611.
Poggio, T., and Girosi, F. (1989), A theory of networks for Approximation and
learning, Artificial Intelligence Memorandum, no 1140.
Reif, J. H. (1987), On threshold circuits and polynomial computation, in "Proceed-
ings of the 2nd Annual Structure in Complexity theory", pp. 118-123.
Wei, Z., Yinglin, Y., and Qing, J. (1991), Approximation property of multi-layer
neural networks ( MLNN ) and its application in nonlinear simulation, in "Proc. of
the International Joint Conference on Neural Networks", pp. 171-176.

Mathematical Theory of Deep
No ratings yet
Mathematical Theory of Deep
275 pages
DL Unit2 HD
No ratings yet
DL Unit2 HD
141 pages
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
No ratings yet
Deep Neural Networks and Partial Differential Equations - Approximation Theory and Structural Properties) Philipp Petersen, University of Oxford
49 pages
ActivationFun Survey Arxiv
No ratings yet
ActivationFun Survey Arxiv
49 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
CS490 Advanced Topics in Computing (Deep Learning)
No ratings yet
CS490 Advanced Topics in Computing (Deep Learning)
37 pages
Lecture NN Part1
No ratings yet
Lecture NN Part1
62 pages
Neumayer 2023 A
No ratings yet
Neumayer 2023 A
17 pages
Mod 2.3 - Activation Function, Loss Functions
No ratings yet
Mod 2.3 - Activation Function, Loss Functions
12 pages
Montanari
No ratings yet
Montanari
10 pages
NeurIPS 2018 Lipschitz Regularity of Deep Neural Networks Analysis and Efficient Estimation Paper
No ratings yet
NeurIPS 2018 Lipschitz Regularity of Deep Neural Networks Analysis and Efficient Estimation Paper
10 pages
Common Misconceptions About Neural Networks As Approximators
No ratings yet
Common Misconceptions About Neural Networks As Approximators
14 pages
What Kinds of Functions Do Deep Neural Networks Learn - Insights From Variational Spline Theory (2021)
No ratings yet
What Kinds of Functions Do Deep Neural Networks Learn - Insights From Variational Spline Theory (2021)
25 pages
Curs7 PDF
No ratings yet
Curs7 PDF
46 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
Deep Neural Network Structures Solving Variational Inequalities
No ratings yet
Deep Neural Network Structures Solving Variational Inequalities
26 pages
Safran 17 A
No ratings yet
Safran 17 A
9 pages
14 Deep
No ratings yet
14 Deep
6 pages
Solution Dseclzg524!01!102020 Ec2r
100% (1)
Solution Dseclzg524!01!102020 Ec2r
6 pages
17 Aap1328
No ratings yet
17 Aap1328
59 pages
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
No ratings yet
(2020) Kidger, Lyons (Proc. Mach. Learn. Res.)
22 pages
Introduction To Radial Basis Function Networks
No ratings yet
Introduction To Radial Basis Function Networks
45 pages
Activation Function
No ratings yet
Activation Function
5 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
(1991) Hornik (Neural Netw.)
No ratings yet
(1991) Hornik (Neural Netw.)
7 pages
Fundamentals of Neural Network
No ratings yet
Fundamentals of Neural Network
84 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Accuracy Controlled Iterative Method For Efficient Sigmoid Function Approximation
No ratings yet
Accuracy Controlled Iterative Method For Efficient Sigmoid Function Approximation
3 pages
Universal Approximation To Nonlinear Operators by Neural Networks With Arbitrary Activation Functions and Its Application To Dynamical Systems
No ratings yet
Universal Approximation To Nonlinear Operators by Neural Networks With Arbitrary Activation Functions and Its Application To Dynamical Systems
7 pages
Presentation For Deep Learning
No ratings yet
Presentation For Deep Learning
15 pages
Ad3451 ML Unit 4 Notes
No ratings yet
Ad3451 ML Unit 4 Notes
36 pages
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
No ratings yet
Neural Networks and Principal Component Analysis: Learning From Examples Without Local Minima
6 pages
Approximation Capability To Functions of Several Variables Nonlinear Functionals and Operators by Radial Basis Function Neural Networks
No ratings yet
Approximation Capability To Functions of Several Variables Nonlinear Functionals and Operators by Radial Basis Function Neural Networks
7 pages
16 - The Key To The Most Powerful ML Models
No ratings yet
16 - The Key To The Most Powerful ML Models
25 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
DL Answers
No ratings yet
DL Answers
24 pages
Performance Analysis of Various Activation Functio
No ratings yet
Performance Analysis of Various Activation Functio
7 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
Performance Analysis of Various Activation Functions in Neural
No ratings yet
Performance Analysis of Various Activation Functions in Neural
20 pages
Unit 2
No ratings yet
Unit 2
35 pages
9 Integrals PDF
No ratings yet
9 Integrals PDF
19 pages
Unit 3 Deep Learning
No ratings yet
Unit 3 Deep Learning
11 pages
CS230 Midterm Solutions Fall 2021
No ratings yet
CS230 Midterm Solutions Fall 2021
14 pages
Activation Functions and Their Characteristics in Deep Neural Networks
No ratings yet
Activation Functions and Their Characteristics in Deep Neural Networks
6 pages
Lecture Five Radial-Basis Function Networks: Associate Professor
No ratings yet
Lecture Five Radial-Basis Function Networks: Associate Professor
64 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
NNN Proect
No ratings yet
NNN Proect
12 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
Activation
No ratings yet
Activation
7 pages
Lec 22 Activations Functions Complete
No ratings yet
Lec 22 Activations Functions Complete
33 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
ML Mentorship Prahitha Movva V1
No ratings yet
ML Mentorship Prahitha Movva V1
5 pages
Klqgceb Ewvhja SC
No ratings yet
Klqgceb Ewvhja SC
8 pages
Mathematics of Deep Learning: Lecture 2 - Depth Separation
No ratings yet
Mathematics of Deep Learning: Lecture 2 - Depth Separation
13 pages
Final Test Firstsem Genmath
No ratings yet
Final Test Firstsem Genmath
4 pages
Computer Science Annotation Paraphrasing Vihaan Malpani
No ratings yet
Computer Science Annotation Paraphrasing Vihaan Malpani
14 pages
Taylor's Theorem
No ratings yet
Taylor's Theorem
4 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Ex04 Regression MLP Solution
No ratings yet
Ex04 Regression MLP Solution
4 pages
IX.4 The Hankel Transform
No ratings yet
IX.4 The Hankel Transform
54 pages
2Q-S1 Polynomial Functions
No ratings yet
2Q-S1 Polynomial Functions
18 pages
CG Chapter 5
No ratings yet
CG Chapter 5
61 pages
Chap2 1
No ratings yet
Chap2 1
45 pages
Cayley MemoirTheoryMatrices 1858
No ratings yet
Cayley MemoirTheoryMatrices 1858
22 pages
GDEA-2021-0354 Proof Hi
No ratings yet
GDEA-2021-0354 Proof Hi
20 pages
Continuous Functions
No ratings yet
Continuous Functions
11 pages
PRACTICAL
No ratings yet
PRACTICAL
11 pages
EGM212 Unit2 2-2
No ratings yet
EGM212 Unit2 2-2
31 pages
Geometric Interpration of Derivative
No ratings yet
Geometric Interpration of Derivative
64 pages
Wessam Math 119
No ratings yet
Wessam Math 119
11 pages
Programs in Fortran
100% (1)
Programs in Fortran
12 pages
03 Gradients and Intercepts
No ratings yet
03 Gradients and Intercepts
14 pages
2018 - Differential Calculus
No ratings yet
2018 - Differential Calculus
2 pages
Lecture 11. Higher-Order Linear Equations
No ratings yet
Lecture 11. Higher-Order Linear Equations
5 pages
Ch. 8 Logarithmic Functions Ans
No ratings yet
Ch. 8 Logarithmic Functions Ans
17 pages
CVT WEEK6 - 3A 27102020 124253pm PDF
No ratings yet
CVT WEEK6 - 3A 27102020 124253pm PDF
42 pages
Complex Roots Paper
No ratings yet
Complex Roots Paper
3 pages
Solutions of Practice Questions Matrices and Determinant
No ratings yet
Solutions of Practice Questions Matrices and Determinant
4 pages
Euler Homogeneity
No ratings yet
Euler Homogeneity
4 pages
BC Cram Sheet
No ratings yet
BC Cram Sheet
2 pages
AMP II Information Fact Sheet
No ratings yet
AMP II Information Fact Sheet
2 pages
Systems Analysis and Control: Matthew M. Peet
No ratings yet
Systems Analysis and Control: Matthew M. Peet
23 pages
Row Echelon Form
No ratings yet
Row Echelon Form
2 pages
Generalized Quotient Topologies
No ratings yet
Generalized Quotient Topologies
6 pages
SOLUTIONS - Practice Paper Non Calc
No ratings yet
SOLUTIONS - Practice Paper Non Calc
6 pages
Introduktion Till Högskolematematik
No ratings yet
Introduktion Till Högskolematematik
1 page

Dasgupta and Schnitger - 1992 - The Power of Approximating A Comparison of Activation Functions Paper

Uploaded by

Dasgupta and Schnitger - 1992 - The Power of Approximating A Comparison of Activation Functions Paper

Uploaded by

The Power of Approximating: a

Comparison of Activation Functions

Bhaskar DasGupta Georg Schnitger

if f can be approximated by a r 2 -net with d layers, size s, Lipschitz-

In other words, when comparing the approximation power of activation functions,

2 Equivalence of Activation Functions for Error e( s, d) = 2- 8

Theorem 2.4 The function sine(Ax) can be approximated by a {u}-net CA with d

2.1 Activation F\lnctions with the Approximation Power of Splines

Obviously, any activation function which can efficiently approximate polynomials

Definition 2.1 Ld -y : n ---+ n be a function. We call -y suitable if and only if

(b) For each i > 2 there exists j with i ~ j ~ i k and aj "# O.

o < roo g( u 2 )du < 00 ~

Theorem 2.5 Assume that r is powerful.

3 Equivalence of Activation Functions for Error s-d

Definition 3.1 Let r

(b) There is a r -net

Theorem 3.1 Assume that e(s, d) = s-d.

where ~/ is a suitable function and

4 Computing boolean functions

Acknowledgements. We wish to thank R. Paturi, K. Y. Siu and V. P. Roy-

You might also like