Psychol Limits NN
Psychol Limits NN
Psychol Limits NN
RAZVAN ANDONIE
Department of Electronics and Computers, Transylvania University, 2200 Brasov, Romania Abstract: Recent results changed essentially our view concerning the generality of neural networks' models. Presently, we know that such models i) are more powerful than Turing machines if they have an infinite number of neurons, ii) are universal approximators, iii) can represent any logical function, iv) can solve efficiently instances of NP-complete problems. In a previous paper [1], we discussed the computational capabilities of artificial neural networks vis-a-vis the assumptions of classical computability. We continue here this discussion, concentrating on the worst case psychological limits of neural computation. Meanwhile, we state some open problems and presumptions concerning the representation of logical functions and circuits by neural networks.
In: Dealing with Complexity: A Neural Network Approach, M. Karny, K. Warwick, V. Kurkova (eds.), SpringerVerlag, London, 1997, 252-263.
2. Function approximation
The approximation (or prediction) of functions that are known only at a certain number of discrete points is a classical application of multilayered neural networks. Of fundamental importance was the discovery [20] that a classical mathematical result of Kolmogorov (1957) was actually a statement that for any continuous mapping f : [0, 1] n R n R m there must exist a three-layered feedforward neural network of continuous type neurons (having an input layer with n neurons, a hidden layer with (2n+1) neurons, and an output layer with m neurons) that implements f exactly. This existence result was the first step. Cybenko [10] showed that any continuous function defined on a compact subset of R n can be approximated to any desired degree of accuracy by a feedforward neural network with one hidden layer using sigmoidal nonlinearities. Many other papers have investigated the approximation capability of three-layered networks in various ways (see [37]). More recently, Chen et al. [8] pointed out that the boundedness of the sigmoidal function plays an essential role for its being an activation function in the hidden layer, i.e., instead of continuity or monotony, the boundedness of sigmoidal functions ensures the networks approximation capability of functions defined on compact sets in R n . In addition to sigmoid functions, many others can be used as activation functions of universal approximator feedforward networks [9]. Girosi and Poggio [17] proved that radial basis function networks also have universal approximation property. Consequently, a feedforward neural network with a single hidden layer has sufficient flexibility to approximate with a given error any continuous function defined on a compact (these conditions may be relaxed). However, there are at least three important limitations: a) These existence proofs are rather formal and do not guarantee the existence of a reasonable representation in practice. In contrast, a constructive proof was given [27] for networks with two hidden layers. An explicit numerical implementation of the neural representation of continuous functions was recently discovered by Sprecher [35,36]. b) The arbitrary accuracy by which such networks are able to approximate any quite general function is based on the assumption that arbitrary large parameters (weights and biases) and enough hidden units are available. However, in practical situations, both the size of parameters and the numbers of hidden neurons are bounded. The problem of how the universal approximation property can be achieved constraining the size of parameters and the number of hidden units was recently examined by Kurkov [26]. c) These universal approximation proofs are commonly used to justify the notion that neural networks can do anything (in the domain of function approximation). What is not considered by this proofs is that networks are simulated on computers with finite accuracy. Wray and Green [38] showed that approximation theory results cannot be used blindly without consideration of numerical accuracy limits, and that these limitations constrain the approximation ability of neural networks.
The relationship between networks with one hidden layer and networks with several hidden layers is not yet well understood. Although one hidden layer is always enough, in solving particular problems it is often essential to have more hidden layers. This is because for many problems an approximation with one hidden layer would require an impractically large number of hidden neurons, whereas an adequate solution can be obtained with a tractable network size by using more than one hidden layer.
representations [see 15]. For example, consider the majority function on k inputs. Such a function can be represented as a feedforward neural network, described by a list expression using O( k ) symbols. The DNF representation requires, however,
C k , ( k + 1) 2 terms. Therefore, the number of terms for the DNF representation
of the majority function grows exponentially in k. It is relatively easy to establish an upper bound on the size of a feedforward net for realizing arbitrary, even partially specified, logical functions. Considering a partially specified logical function, defined on a set of m arbitrary points in {0, 1}n , with m 2 n , then, a single hidden layer discrete neural network with m hidden neurons and one output neuron is sufficient to realize the function [19]. It is perhaps more interesting to know the lower bounds on the number of hidden neurons for realizing any logical function by a discrete neural net with a single output neuron. Hassoun [19] proved the existence of the following lower bounds, in the limit of large n: 1. A feedforward net with one hidden layer requires more than m / [n log2 (m e / n)] hidden neurons to represent any function f : R n {0, 1} defined on m arbitrary points in R n , m 3n. 2. A two-hidden-layer feedforward net having k / 2 (k is even) neurons in each of its hidden layers requires that k 2(m / log2 m)1/2 to represent any function f : R n {0, 1} defined on m arbitrary points in R n , m >> n 2 . 3. An arbitrary interconnected network with no feedback needs more than (2m / log2 m)1/2 neurons to represent any function f : R n {0, 1} defined on m arbitrary points in R n , m >> n 2 . We can conclude that when the training sequence is much larger than n, networks with two or more hidden layers may require substantial fewer units than single-hidden-layer-networks. Other bounds on the number of units for computing arbitrary logical functions by multilayer feedforward neural networks may be found in [32]. The fact that neural networks can implement any logical function needs more attention. At first sight, such a representation seems useless in practice, because the number of required neurons may be very large. There are three important observations: a) The representation of arbitrary logical functions is a NP-complete problem, which here finds its expression in the fact that in general an exponentially growing number of hidden neurons might be required. b) For individual logical functions, the number of hidden neurons can be much smaller. For example, symmetric functions with p inputs are representable by networks with a linear (with respect to p) number of neurons in the hidden layer [15]. Symmetric functions are an important class of logical functions giving outputs that are not changed if inputs are permuted. Special cases of symmetric functions are: AND function, OR function, XOR function, majority function.
c)
A more general neural network, representing an individual function, may require substantial fewer neurons than the straightforward CNN representation of this function.
A very practical problem arises: How to minimize the neural network representation of a particular logical function (that means, how to obtain an implementation with a minimum number of neurons). For CNN representations, the problem is not new and is equivalent to the minimization of a DNF logical function. Therefore, one might apply a usual minimization algorithm (Veitch-Karnaugh or Quine-McCluskey [37]). The minimization of a DNF is a NP-complete problem (see [16], problem L09). It is not surprising that the minimization of logical functions having more than two layers is also a hard problem [37]. We have to conclude that the minimization of a neural network representation of a logical function is a hard problem even for one hidden layer feedforward architectures. Another important result concerns the complexity theory of decision problems. A decision problem is based on the logical functions conjunction, disjunction, and complement. If we enlarge the family of decision problems, including decision problems based on linear threshold functions, we obtain new complexity classes [31].
It has been an empirical observation that some algorithms (notably backpropagation) work well on nets that have only a few hidden layers. One might be tempted to infer that shallow nets would be intrinsically easier to teach. Judds result shows that this is nor the case: learning in a network with one hidden layer is (generally) NP-complete.
a)
It seems advantageous to use neural network representations of logical functions. b) The neural network representation of a logical function might be expensive (considering the number of neurons). c) Learning neural network representations of logical functions may be computationally expensive. d) A neural network representation of some logical functions may be obtained efficiently by generalization, memorizing examples for the pair (function argument, function value).
These properties might be obviously extended over the wider class of circuits, which are finite combinations of logical functions. The size of a circuit equals the number of its gates. The depth of a circuit is the number of gates on its longest path. A circuit is efficient if size and depth are small. Our knowledge about circuits is far from that required to always design efficient circuits. We have no general methods to simultaneously minimize the size and the depth of a circuit as a trade-off result [37]. The independent minimization of the two parameters may lead to expensive solutions. For instance, a minimal DNF representation of a logical function does not necessarily lead to an efficient circuit, in spite of the fact that we minimized first the depth (restricting the problem to the class of two-level circuits) and then the number of gates. This becomes more evident in the case of the n-parity logical function. Another important parameter of a circuit is the fan-in (the number of incoming edges) of each gate. For VLSI engineers, the area (A) and the delay of the circuit (T) are standard optimization costs, and a circuit is considered VLSI-optimal if it minimises the combined measure AT2. For practical reasons, it would be interesting to find a wide subclass of circuits that can be efficiently learned and represented by neural networks. Meanwhile, this could also be a solution for obtaining optimal circuits. Thus, one could hope to obtain circuits with small size and depth, or VLSI-optimal circuits. There are two main approaches for building efficient neural networks representations: learning techniques (for instance, variants of backpropagation), and constructive (direct design) techniques (see [33]). The research in this area is presently very active. As a first step, the following two experiments were performed by one of my students, A. Cataron: Experiment A. A two bits adder with carry was represented as a combination of two DNF logical functions. It was minimized using the Veitch-Karnaugh method, leading to a quite complicated structure with 12 gates. Implemented as a CNN, the circuit required 3 entries, 7 hidden neurons and 2 output neurons. Next, a neural network with a single hidden layer was teached by backpropagation to memorize the circuit. The number of hidden neurons was reduced as much as possible (so that learning by backpropagation was still effective). The resulted network had only 2 hidden neurons and 2 output neurons. Thus, the most compact neural representation of the circuit was obtained. The Adder Problem is one of the many test problems often used to bench-mark neural network architectures and learning techniques, since it is sufficient complex. Keating and Noonan [25] obtained the neural network representation of a three bits
adder, using simulated annealing to train the network. The resulting neural representation of the adder did not compare well with the minimum circuit obtained by the Veitch-Karnaugh method. This happened, in our opinion, because their method does not minimize the size of the neural net; the architecture of the neural net remains unchanged during and after the training process. Incremental (architecture-changing) learning techniques seems to be more efficient for this problem. Experiment B. A more complex circuit, a priority encoder with 29 gates, was too large to be implemented as a CNN. Again, a neural network with a single hidden layer was teached by backpropagation to memorize the circuit. The teaching sequence was, as before, the entire set of 512 associations. The result was a neural network representation with 9 entries, 4 hidden neurons and 5 output neurons. Since the teaching sequence was too large (and time consuming), several learning tests with uncomplete teaching sequences were performed. The results are encouraging, even if only approximations of the initial functions have been obtained (this means that the generalization process was not perfect). Complex logical functions are difficult to represent as CNN because of the constructive procedure we used. Instead, learning neural network representations on multilayered feedforward networks seems to be more practical. Learning and hidden layer minimization were performed in a heuristic way. Beside heuristics, another technique may help make progress: generalization. It is interesting to find some special logical circuit for which approximately correct generalization is satisfactory. Obtaining the neural network representations of a circuit does not mean that we obtained the circuit itself; the hardware implementation must follow. An immediate application would be to directly map the neural network representation in FPGAs [4]. It is not straightforward to compare the generally accepted minimum circuits (the adder and the priority encoder) with the neural obtained ones. The hardware implementation of a neural network is itself a challenging task with many open problems. For instance, Beiu [3] proved that digital VLSI-optimal implementations of neural networks could be done by constant fan-in digital circuits. The standard procedure in learning a logical function by backpropagation is to perform learning on a sigmoid neural network and to replace afterthat the continuous neurons with discrete neurons. Sometimes it might be interesing to implement directly in hardware the resulting continuous neural network. Fan-in dependent depth-size trade-offs were observed when trying to implement sigmoid type neural networks [5].
7. Final remarks
Our motivation for writing this paper has been given by the new results obtained during the last few years in neural computation. Some of these results make us reflect (and doubt?), once again, on the limits of computability. Although, at a closer look, we re-discover some old limits:
i)
Infinite neural networks are more powerful than Turing machines, but this result is not a counterexample to the Church-Turing thesis. ii) Neural networks are universal approximators but there are serious limitations. iii) Neural networks can represent any logical function but it is quite difficult to find efficient neural representations for complex logical functions (or circuits). iv) Neural networks are frequently used to solve instances of NP-complete optimization problems. This does not prove that P = NP. The neural learning problem of a logical function gives an insight about how NP-complete problems can be solved by neural computing. The infinite family of circuits represents the advance of technology over time. It is likely that in the next ten years we will produce a family of circuits scaling from current technological limits up to brain-scale computations. Nanotechnology (i.e., molecular scale engineering) will dominate 21st century economics. However, building such Avogadro Machines (with a trillion trillion processing elements) will demand an enormous complexity of the circuits. The complexity of such circuits shall be so huge that, using the present techniques, we shall not be able to design them: it will be impossible to predict function from structure. One possible way to design a complex circuit is by learning its neural network representations (i.e., using the concept of evolutionary hardware [12]). Even if the learning procedure risks to become a hard problem, this method seams to be of interest for the future. We discussed here the worst case psychological limits of neural computation. From this point of view, we have to be rather pessimistic, accepting the late observations of Minsky and Papert [30]. Little good can come from statements like learning arbitrary logical functions is NP-complete. We have to quote here Minsky and Papert [30]: It makes no sense to seek the best network architecture or learning procedure because it makes no sense to say that any network is efficient by itself: that makes sense only in the context of some class of problems to be solved. From the point of view of restricted classes of problems which can be solved efficiently on neural networks we have to be optimistic. The psychological limits exist only in our mind. In practice, we have to solve particular, not general, problems. Acknowledgement: I am grateful to Valeriu Beiu for many valuable comments that helped me to complete this paper.
References: 1. Andonie R. The new computational power of neural networks. Neural Network World 1996; 6: 469-475 2. Aizenstein H and Pitt L. On the learnability of disjunctive normal form formulas. Machine Learning 1995; 19: 183-208 3. Beiu V. Constant fan-in discrete neural networks are VLSI-optimal. Submitted to Neural Processing Letters, June, 1996 4. Beiu V and Taylor JG. Optimal Mapping of Neural Networks onto FPGAs A New Costructive Learning Algorithm. In: J. Mira and F. Sandoval (eds). From Natural to Artificial Neural Computation. SpringerVerlag, Berlin, pp 822-829, 1995 5. Beiu V and Taylor JG. On the circuit complexity of sigmoid feedforward neural networks. Neural Networks 1996; 9: 1155-1171 6. Blum AL and Rivest R. Training a 3-node network is NP-complete. Neural Networks 1992; 5: 117-127 7. Carnevali P and Paternello S. Exhaustive thermodynamical analysis of Boolean learning networks. Europhys Lett 1987; 4: 1199 8. Chen T, Chen H and Liu RW. Approximation capability in C (R n ) by multilayer feedforward networks and related problems. IEEE Trans Neural Networks 1995; 6: 25-30 9. Chen T and Chen H. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamic systems. IEEE Trans Neural Networks 1995; 6: 911-917 10. Cybenko G. Approximation by superpositions of sigmoidal functions. Mathematics of Control, Signals and Systems 1989; 2: 303-314 11. DasGupta B, Siegelmann HT and Sontag E. On the complexity of training neural networks with continuous activation functions. IEEE Trans Neural Networks 1995; 6: 1490-1504 12. de Garis H. Evolvable hardware: genetic programming of a Darwin machine. In: R.F. Albert, C.R. Reeves and N.C. Steele (eds). Artificial Neural Nets and Genetic Algorithms. SpringerVerlag, New York, pp 441-449, 1993 13. Franklin SP and Garzon M. Neural computability. In: O. Omidvar (ed). Progress in Neural networks. vol 1, ch. 6, Ablex Pu Co, Norwood, NJ, 1990 14. Franklin SP and Garzon M. Neural computability II. Submitted, 1994. Extended abstract in: Proceedings 3rd Int Joint Conf on Neural Networks, vol 1, Washington DC, pp 631-637, 1989 15. Gallant S. Neural network learning and expert systems. The MIT Press, Cambridge, Mass, second printing, 1994 16. Garey MR and Johnson DS. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco, 1979 17. Girosi F and Poggio T. Networks and the best approximation property. Biological Cybernetics 1990; 63: 169-176 18. Hartley R and Szu H. A comparison of the computational power of neural network models. In: Proceedings IEEE 1st Int Conf on Neural Networks, vol 3, pp 17-22, 1987 19. Hassoun MH. Fundamentals of artificial neural networks. The MIT Press, Cambridge, Mass, 1995
20. Hecht-Nielsen R. Kolmogorovs mapping neural network existence theorem. In: Proceedings Int Conf on Neural Networks, IEEE Press, vol 3, New York, pp 11-13, 1987 21. Ito Y. Finite mapping by neural networks and truth functions. Math Scientist 1992; 17: 69-77 22. Judd JS. Neural network design and the complexity of learning. The MIT Press, Cambridge, Mass, 1990 23. Judd JS. The complexity of learning. In: M.A. Arbib (ed). The Handbook of Brain Theory and Neural Networks. The MIT Press, Cambridge, Mass, pp 984-987, 1995 24. Kearns MJ and Vazirani UV. An introduction to computational learning theory. The MIT Press, Cambridge, Mass, 1994 25. Keating JK and Noonan D. The structure and performance of trained Boolean networks. In: G. Orchard (ed). Neural Computing (Proceedings of the Irish Neural Networks Conference, Belfast). The Irish Neural Networks Association, Belfast, pp 69-76, 1994 26. Kurkov V. Approximation of functions by perceptron networks with bounded number of hidden units. Neural Networks 1995; 8: 745-750 27. Lapedes AS and Farber RM. How neural networks work. In: Y.S. Lee (ed). Evolution, Learning and Cognition. World Scientific, Singapore, 1988 28. McCulloch W and Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943; 5: 115-133 29. Mzard M and Nadal JP. Learning in feedforward layered neural networks: the tiling algorithm. J Phys 1989; A 22: 2191-2203 30. Minsky ML and Papert SA. Perceptrons. The MIT Press: Cambridge, Mass, third printing, 1988 31. Parberry I. Circuit complexity and neural networks. The MIT Press, Cambridge Mass, 1994 32. Paugam-Moisy H. Optimisations des rseaux de neurones artificiels. These de doctorat, Ecole Normal Suprieure de Lyon, LIP-IMAG, URA CNRS nr. 1398, 1992 33. Smieja FJ. Neural network constructive algorithm: Trading generalization for learning efficiency?. Circuits, System, Signal Processing 1993; 12: 331-374 34. Sontag ED. Feedforward nets for interpolation and classification. J Comp Syst Sci 1992; 45: 20-48 35. Sprecher DA. A numerical implementation of Kolmogorovs superpositions. Neural Networks 1995; 8: 1-8 36. Sprecher DA. A universal construction of a universal function for Kolmogorovs superpositions. Neural Network World 1996; 6: 711-718 37. Wegener I. The complexity of boolean functions. Wiley-Teubner, Chichester, 1987 38. Wray J and Green GGR. Neural networks, approximation theory, and finite precision computation. Neural Networks 1995; 8: 31-37 39. ma J. Back-propagation is not efficient. Neural Networks 1996; 9: 1017-1023