0% found this document useful (0 votes)

19 views9 pages

Dynamic Neural Diversification Path To Computation

Uploaded by

kovalenko.alx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views9 pages

Dynamic Neural Diversification Path To Computation

Uploaded by

kovalenko.alx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

DYNAMIC N EURAL D IVERSIFICATION : PATH TO

C OMPUTATIONALLY S USTAINABLE N EURAL N ETWORKS

PAPER ACCEPTED TO ICANN 2021.

Alexander Kovalenko Pavel Kordík

Faculty of Information Technology Faculty of Information Technology
arXiv:2109.09612v1 [cs.LG] 20 Sep 2021

Czech Technical University in Prague Czech Technical University in Prague

Prague, Czech Republic Prague, Czech Republic
[email protected] [email protected]

Magda Friedjungová
Faculty of Information Technology
Czech Technical University in Prague
Prague, Czech Republic
[email protected]

A BSTRACT
Small neural networks with a constrained number of trainable parameters, can be suitable resource-
efficient candidates for many simple tasks, where now excessively large models are used. However,
such models face several problems during the learning process, mainly due to the redundancy of the
individual neurons, which results in sub-optimal accuracy or the need for additional training steps.
Here, we explore the diversity of the neurons within the hidden layer during the learning process, and
analyze how the diversity of the neurons affects predictions of the model. As following, we introduce
several techniques to dynamically reinforce diversity between neurons during the training. These
decorrelation techniques improve learning at early stages and occasionally help to overcome local
minima faster. Additionally, we describe novel weight initialization method to obtain decorrelated,
yet stochastic weight initialization for a fast and efficient neural network training. Decorrelated
weight initialization in our case shows about 40% relative increase in test accuracy during the first 5
epochs.

Keywords Diversification · Negative Correlation · Weight Initialization · Computational Sustainability · Neural

Networks

1 Introduction

Over the last decade, machine learning algorithms have achieved vast progress in various fields. Namely, general
approach called deep neural networks (DNN) with multiple hidden layers [1], has enabled machine learning algorithms
to perform at an acceptable level in the many areas, in some cases outperforming human accuracy [2]. Such progress, in
no small measure, has become available due to modern hardware computational capabilities, enabling the training of
large DNN on an immense amount of data.
On the other hand, even though large models perform very well on complex tasks, we cannot endlessly rely on an
infinite increase in computational resources and size of datasets. Training large neural networks is energy, time and
memory demanding task. Recently, researchers started questioning energy consumption of machine learning algorithms
and their carbon footprint [3]. Thus it will not be superfluous to develop a strategy for the models that have a constrained
number of parameters, sufficient enough for the certain task, and can be trained fast, rather than chasing higher accuracy
by enlarging the number of parameters and using more complex hardware.
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

Universal approximation theorem [4] claims that a feed-forward artificial neural network with a single hidden layer can
approximate any continuous well-behaved function of arbitrary number of variables with any accuracy. The conditions
are: a sufficient number of neurons in the hidden layer, and a correct weight selection. Above mentioned theorem for an
arbitrary width case was originally proved by Cybenko [4] and Hornik [5] and later extended to an arbitrary depth case
(DNN) in [6].
In this paper we get a deeper insight on the practical application of Cybenko’s theorem, in order to train a neural
network, where all hidden neurons will be used efficiently. Therefore, we have to pay attention to two following aspects:
number of neurons and correct weight selection.
Number of neurons in a hidden layer is a quite straightforward parameter that became trendy with availability of
multi-threaded parallel computing on GPU [7]. Models of a vast number of trainable parameters are not devoid of logic,
as they generalize better and can be so-called ‘universal learners’. For example, GPT-3 having 175 billion parameters,
is a perfect example of a universal learner [8]. Thus, the community has been experimenting with model architectures
increasing width [6] or depth [9] of neural networks. Issues, such as vanishing gradient [10, 11] was resolved by
applying methods, including second-order Hessian-free optimization [12], training schedules by using greedy layer-wise
training [13–15], sparse rectifier activation function, widely known as ReLU [16], layer-size-dependent initialization,
such as Xavier [17] and Kaiming [18] and skip connections [19]. Even though, we can make arbitrarily large models
make good predictions, to achieve computational sustainability by expanding the number of trainable parameters up
to infinity, would not be the best option for the tasks of lower complexity. The community has been already trying to
address this problem, thus several solutions dealing with this issue have occurred. For example, widely used ReLU
activation function, saturated only in one dimension, which helps with vanishing gradient problem, on the other hand
results in so-called ‘dying neurons’ [20], modified activation functions such as Leaky ReLU [21], adaptive convolutional
ReLU [22], Swish [23], Antirectifier [24] and many other were addressed to solve the problem of ‘neural graveyard’.
Resource efficient solutions, such as pooling operations [25], LightLayers [26] depth-wise separable convolutions [27]
were developed to reduce the complexity of the models.
Correct weight selection, at first sight, depends on training parameters, such as loss function, number of epochs, learning
rate etc. However, to train the neural network competently these weights have to be initialized stochastically. There
are several ways to initialize weight, mainly aimed to avoid vanishing gradients. Nevertheless, stochastic weight
initialization can result in neuron redundancy, when different neurons are trained in a similar manner. This is not crucial
if the neural network is excessively large, however, in computationally sustainable models, neuron redundancy and
‘neural graveyards’ are undesirable. Moreover, there are numerous application when memory efficient model is required
(e.g. autonomous devices such as sensors, detectors, mobile or portable devices). Such devices require memory and
performance efficient solutions to learn spontaneously and improve from experience. In this case adding excessive
parameters to the model can be rather questionable for the model application.
Therefore, once we consider each neuron of the model as an individual learner, the neural network can be seen as an
ensemble. It is known that for ensembles diversity of learners is desirable to some extent [28]. Thus, we can assume
that diversity between neurons or reinforced diversification during the training can be beneficial for the model.
In this paper we foremost explore how the diversity between neurons evolves during the training and as a following step
suggest methods for diversification of the neurons during the model training. This is especially relevant in resource
constrained models, where neuron redundancy means reducing the number of predictors. Additionally, we show how
weight pre-initialization can affect neural network training at the early steps.

2 Our Approach

Let us start with a term negative correlation (NC) learning [28], which is a simple, yet elegant technique to diversify
individual base-models in the ensemble and reduce their correlations. Ambiguity decomposition [29] of the loss function
raises the possibility of controlling the trade-off between bias, variance, and covariance [30] using the strength parameter,
to reduce covariance. In its order the concept of an NC learning is originated from bias-variance decomposition [31, 32]
of ensemble learning. In this case, bias is the output shift from the true value, and variance is the measure of ensemble
ambiguity, which simply means dispersion around the mean output value.
As it was first demonstrated by Krogh and Vedelsby [33] quadratic error of ensemble prediction is always less that the
quadratic error of each individual estimator of the ensemble:

2
X 2
X 2
(fens − d) = wi (fi − d) − wi (fi − fens ) (1)
i i

2
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

Later Brown [28] demonstrated decomposition of ensemble error into three components - bias, variance and covariance,
and shown, the connection between ambiguity and covariance:
n 2 o
2
1 1
fi − f¯
P P
E M i (fi − d) − M i =
(2)
2 1 1

bias + M var + 1− M covar

The ensemble ambiguity is nothing less than the variance of the weighted ensemble around the weighted mean.
Therefore, higher ambiguity, i.e. decorrelation between the ensemble output is desirable up to some measure.
Our first trial was to decorrelate neurons in the hidden layer by penalizing the difference between mean weight of the
neurons w̄ and each neuron wi :

1 X
NC = γ (w̄ − wi ) (3)
n i
where γ is the regularization strength parameter, and n is the number of neurons in a layer.
However, it is likely more profitable to compare not only single weights, but weight matrices or e.g. kernels in
convolutional neural networks (CNN), as trainable kernels represent. Thus, the second way to define diversity is
comparing neurons by cosine similarity:

1 1 XX
= γ wi · wj (4)
D n i j

where w are weights of individual neurons and D is the diversity measure.

In this technique we compare each weight in the layers and define a diversity measure D. However, it has quadratic
complexity of such expression, which would oppose the idea of the current work, as our indent is fast and efficient
training of resource constrained neural networks.
Therefore, combining the first two approaches we introduce and explore another method to define diversity in the neural
networks:

1 1 X
= γ w̄ · wi (5)
D n i

After observing the training process and evolution of diversity measure in the models, we explored the possibility
of weight pre-optimization using diversification. In this case, we used Kaiming weight initialization, with further
optimization to enlarge the diversity between the weights, and at the same time keep weight mean and standard deviation
of the weight matrix close to initial:

XX
L = (|W̄ − w̄k | + |σW − σwi |) wi · wj (6)
i j

where L is loss, W̄ is the initial weight mean, w̄k is the weight mean at k training step, σW is standard deviation of the
initial weights array, and σwi is standard deviation of the weights array at k training step.

3 Experiments
We perform some initial experiments using DNN in order to study diversity evolution during the model training and
demonstrate the effectiveness of proposed diversification mechanisms.
The experiments were performed on publicly available benchmark dataset Fashion MNIST [34]. This dataset was
chosen as it is suitable for DNN training and has higher variance than traditional hand-written digits dataset MNIST [35].
We implemented one-hidden-layer neural network with 16, 32, 64, 128, and 256 neurons in the hidden layer (see Table

3
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

21
80
20
75
19
Test Accuracy

70
18
65

1/D
17
60 16
55 15
50 14

45 13
0 10 20 30 40 50
Epoch
Figure 1: Training curve and Diversity measure (Eq. 4) for the first 50 epochs on Fashion MNIST dataset. DNN with 1
hidden layer of 32 neurons.

1), using PyTorch [36] library. Otherwise, we used standard parameters for the training, including Adam optimizer [37]
with a learning rate of 0.01, cross entropy loss function with penalization terms (Eq. 3-5):

N
X 1 1
H(T, p) = − log2 q (xi ) + (7)
i=1
N D

where T presents training set, p is true distribution, q is predicted distribution, N is standard deviation of the weights
array at k training step, q(x) is the probability of event x estimated from the training set, and D is the diversity measure,
obtained using Eq. 3, 4, or 5.

4 Results and Discussion

4.1 Evolving Diversity and Symmetry Breaking

During the model training, one can notice sub-optimal accuracy stagnation for a several epochs, this can be associated
with the existence of local minima on a loss function surface [38, 39]. This can be associated with a symmetry in the
neural network layer, which is shown to be a critical point especially for small neural networks [40, 41]. We found out
that naturally the model tends to decrease the correlation between the neurons, however when the model converges
to a local minimum with a sub-optimal accuracy, the similarity between the neurons rises up until the moment when
the optimization process surpasses the local minimum and the accuracy increases. (see Figure 1) This correlates with
an existence of symmetry in the weights, once weights are symmetrical (correlated) and the number of neurons is
constrained, the overall output of the model will likely to be inefficient.

4.2 Negative Correlation Learning

The experiment above inspired us to study certain ways to decorrelate neurons in the hidden layer, thus brake the
symmetry that can appear during the learning process. As we discussed earlier, we consider the output of neural network
as an output of an ensemble. Thus, first, we did simple NC learning, applied to the individual neurons, rather than
ensemble of classifiers. The logic behind this experiment was rather comprehensible. Once the model has constrained
number of parameters to generalize the data, higher variance would help to eliminate redundant neurons and overall
prediction has to be more accurate. As it can be seen from the Figure 2. decorrelation mechanism helps to avoid local

4
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

Table 1: First 10 epochs average of the neural network training for various number of neurons, hidden layer diversified
according to the Eq. 5.
Number of Neurons &
γ
Test Accuracy, %
16 32 64 128 256
0.0 54.19 58.25 62.46 69.62 72.10
1 · 10−5 55.17 60.17 62.45 68.64 70.65
1 · 10−4 56.41 61.25 64.13 70.32 72.27
1 · 10−3 54.48 60.81 65.04 70.45 72.83
1 · 10−2 53.54 60.04 63.19 70.26 72.36
1 · 10−1 54.22 59.46 62.23 70.20 71.64
1.0 50.09 57.49 60.53 69.84 71.65

85 85 85

80 80 80

75 75 75

70 70 70
Accuracy

Accuracy

Accuracy
65 65 65

60 60 60

55 55 55

50 50 50
Baseline Baseline Baseline
1e-05 0.01 1.0
45 10 15 20 25 30 35 40 45 50 45 10 15 20 25 30 35 40 45 50 45 10 15 20 25 30 35 40 45 50
Epoch Epoch Epoch

Figure 2: Validation accuracy training curves of the model with various various γ values.

minima at the early stage on the model learning. Nevertheless, decorrelation using NC learning generally did not result
in the higher accuracy overall. We associate it to several factors, such as Kaiming weight initialization that help to avoid
vanishing gradient, and Adam optimizer which is a replacement optimization algorithm that can handle sparse gradients
on noisy data, and thus is able to efficiently overcome local minima due to adaptive learning rated for each parameter.
Eventhough, these widely used techniques are dealing with the above mentioned problem of the neuron redundancy, our
proposed model can help at the early stages of a model training.
Moreover, with an increasing number of neurons the influence of decorrelation diminishes, this can be explained,
that excessively large NN performs good at the low variance data as well as not every neuron is needed for a good
prediction. However, in the present work we consider computationally sustainable DNN, where all the neuron are
forced to contribute the prediction and on the other hand, for complex data larger amount of neurons would be needed
to generalize the dataset. Therefore, for more sophisticated problems neuron diversification may be efficient for a larger
number of neurons. However, in the present case we performed further experiments on the model with 64 neurons in
the hidden layer, which we consider sufficient for a given dataset. All the models were trained for 10 times to calculate
mean and standard deviation. In Table 2 the average testing accuracy of the first 10 epoch for the DNN with 64 neurons
in the hidden layer trained using negative correlation learning (Eq. 3) is shown.

Table 2: First 10 epochs average of the neural network training, hidden layer diversified according to the Eq. 3.
γ Train Acc., % Test Acc., % , Test Acc. STD
0.0 61.46 62.46 2.34
1 · 10−5 62.26 62.45 2.34
1 · 10−4 63.65 64.13 1.59
1 · 10−3 63.12 65.04 1.76
1 · 10−2 62.54 63.19 1.03
1 · 10−1 64.23 62.23 0.95
1.0 59.56 60.53 1.57

5
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

4.3 Pairwise Cosine Similarity Diversification

It has to be noted that, unlike in [28], where universal diversification strength parameter was found for the ensembles of
all sizes, in our case γ value depends on the size of the hidden layer and has to be rather considered as γ per neuron.
However, on the other hand it is loss-dependent, which means that, ideally, it has to be same or one order of magnitude
smaller than the output of the loss function during the training, otherwise, rather than the model loss (e.g. cross entropy),
1
reciprocal diversity measure D will be optimized. Thus, the reader has to consider optimizing γ value for each certain
neural network and loss function. Thus optimal γ approximately can be estimated as:

0.5 · 10bloss
γ= (8)
n
where n is the number of neurons in the hidden layer and bloss is the loss function order of magnitude.
In addition to NC learning, we introduced diversity measure based on cosine similarity between the neurons (Eq. 4).
Such technique, seems to be promising due to several reasons: first, we, rather that mean values, compare patterns,
which can be useful for more complex models, such as CNNs or transformers, moreover here, each neuron is compared
with each, thus such model is intended to be more robust. Nevertheless, at least for DNN, results we comparable with
NC learning (see Table 3), additionally, such method has quadratic complexity, which opposes our initial aim to train
small models faster and more efficient.

Table 3: First 10 epochs average of the neural network training, hidden layer diversified according to the Eq. 4.
γ Train Acc., % Test Acc., % , Test Acc. STD
0.0 55.94 62.83 1.18
5 · 10−8 56.52 64.20 1.11
5 · 10−7 57.98 65.76 0.92
5 · 10−6 55.48 59.96 0.71
5 · 10−5 56.47 49.20 1.52
5 · 10−4 56.47 44.60 1.06
5 · 10−3 42.66 38.61 1.10

4.4 Reaching Linear Complexity

To enable our diversification method to compare patterns, however avoid quadratic complexity, we combined the fist
concept of NC learning with the second one, and implemented diversity measure based on penalization of the cosine
similarity of each neuron in the hidden and layer’s neurons mean (Eq. 5). The algorithm (see Table 4) overhead is
comparable with L regularization. Moreover, it has shown the highest accuracy gain among three.

Table 4: First 10 epochs average of the neural network training, hidden layer diversified according to the Eq. 5.
γ Train Acc., % Test Acc., % , Test Acc. STD
0.0 61.54 63.15 2.08
5 · 10−7 62.37 63.3 1.63
5 · 10−6 63.60 64.54 1.25
5 · 10−5 64.87 64.95 1.66
5 · 10−4 60.36 62.14 1.66
5 · 10−3 52.54 55.86 0.45
5 · 10−2 41.26 42.71 1.32

4.5 Iterative Diversified Weight Initialization

However, it can be noticed, that occasionally, during the training, the model do not behave exactly as expected, creating
an outlying learning curves. This is most likely associated with stochastic weight initialization. In this case Kaiming
initalization is used [19]. Kaiming initialization is widely used for the neural networks with ReLU activation functions
and related to the nonlinearities of the ReLU activation function, which make it non-differentiable at x = 0. The
weights, in this case are initialized stochastically with the variance that depends on the number of neurons N :

6
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

v 2 = 2/N (9)

It is fair to suggest, that correlation between the initialized weights can play significant role in the model learning
process. Indeed, in the Figure 1. it is clearly seen, the the model gained the most of its accuracy while reducing the
correlation between neurons during the first few epochs. However, the aim of weight initialization is to prevent layer
activation outputs from exploding or vanishing during the course of a forward pass through a deep neural network.
Usually weight are initialized stochastically with a small number to avoid vanishing gradients especially if tanh
or sigmoid activation functions are used. Thus, to obtain stochastically initialized, yet decorrelated, weights we
introduced iteratively diversified Weight initialization, using custom loss function based on Eq. 6. The logic behind
such initialization is to reduce the diversity measure between the weights and at the same time keep weights mean w̄
and weights standard deviation σw close to the originally initialized using Kaiming initialization.

Table 5: First 5 epochs average of the neural network training initialized with decorrelated weights according to the Eq.
6 pre-optimized for 5 epochs.
γ Train Acc., % Test Acc., % , Test Acc. STD
0.0 29.54 34.23 2.04
1 · 10−4 42.43 43.43 1.81
1 · 10−3 43.92 45.65 1.53
1 · 10−2 44.65 47.01 1.24
1 · 10−1 38.32 39.83 1.06
1.0 36.64 38.94 1.37
10.0 32.5 37.57 1.41

5 Conclusion

In this paper we show how to explore and tame the diversity of neurons in the hidden layer. We studied how the
correlation between the neurons evolves during the training and what is the effect on prediction accuracy. In appears,
that once the model is converged to the local minimum on the loss landscape, correlation between the neurons increases
up to the point when the optimization process overcome the local minimum. Thus, we introduced three methods how to
dynamically reinforce diversification and thus decorrelate neural network layer. The concept of negative correlation
suggested by Brown [28] was reviewed and expanded. Instead of decorrelation individual neural networks in the
ensemble we diversified neurons in the hidden layer, using three techniques: negative correlation learning, cosine
pairwise similarity, cosine similarity around the mean.
First technique is originated from the neural networks ensembles and shows a decent performance in our example
using DNN, however for more sophisticated models, such as CNNs and transformers, second and third technique is
likely to be more advantageous as far as it can compare patterns. Additionally to reach correct weight selection, we
introduced weight iterative optimization using weight diversification. It was shown that such techniques are suitable for
the fast training of small models and notably affect their accuracy at the early stage. Which is a small, yet important
step towards the development of a strategy towards energy-efficient training of neural networks.
Our future plans for using neural network diversification primarily consists in using above described diversification
techniques in more sophisticated models in order to explore the possibility to improve training speed and reduce
the number of training parameters. Popular architectures, such as transformers can benefit from the individual head
diversification in multi-head attention block, as far as multiple heads are intended to learn various representation.
Furthermore, we are planning to explore more pattern-oriented techniques for defining diversity between neurons to
enable efficient diversification application in CNNs.

Acknowledgment

This research is supported by the Czech Ministry of Education, Youth and Sports from the Czech Operational
ProgrammeResearch, Development, and Education, under grant agreement No. CZ.02.1.01/0.0/0.0/15003/0000421 and
the Czech Science Foundation (GAČR 18-18080S).

7
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

References
[1] Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks.
science, 313(5786):504–507, 2006.
[2] Dan Claudiu Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible,
high performance convolutional neural networks for image classification. In Twenty-second international joint
conference on artificial intelligence, 2011.
[3] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions
of machine learning. arXiv preprint arXiv:1910.09700, 2019.
[4] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and
systems, 2(4):303–314, 1989.
[5] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257,
1991.
[6] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks:
A view from the width, 2017.
[7] Lodovico Marziale, Golden G. Richard, and Vassil Roussev. Massive threading: Using gpus to increase the
performance of digital forensics tools. Digital Investigation, 4:73 – 81, 2007.
[8] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[9] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, 2014.
[10] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of
learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent
Neural Networks. IEEE Press, 2001.
[11] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural nets and problem solutions.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
[12] James Martens and Ilya Sutskever. Learning recurrent neural networks with hessian-free optimization. In ICML,
2011.
[13] Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.
Neural Computation, 4(1):131–139, 1992.
[14] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural
computation, 18(7):1527–1554, 2006.
[15] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning,
pages 1096–1103, 2008.
[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the
fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and
Conference Proceedings, 2011.
[17] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
JMLR Workshop and Conference Proceedings, 2010.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision,
pages 1026–1034, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
[20] Lu Lu. Dying relu and initialization: Theory and numerical examples. Communications in Computational Physics,
28(5):1671–1706, Jun 2020.
[21] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional
network. arXiv preprint arXiv:1505.00853, 2015.

8
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks A P REPRINT

[22] Hongyang Gao, Lei Cai, and Shuiwang Ji. Adaptive convolutional relus. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 3914–3921, 2020.
[23] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint
arXiv:1710.05941, 2017.
[24] Ben Luijten, Regev Cohen, Frederik J de Bruijn, Harold AW Schmeitz, Massimo Mischi, Yonina C Eldar, and
Ruud JG van Sloun. Deep learning for fast adaptive beamforming. In ICASSP 2019-2019 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1333–1337. IEEE, 2019.
[25] Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of pooling operations in convolutional architec-
tures for object recognition. In International conference on artificial neural networks, pages 92–101. Springer,
2010.
[26] Debesh Jha, Anis Yazidi, Michael A. Riegler, Dag Johansen, Håvard D. Johansen, and Pål Halvorsen. Lightlayers:
Parameter efficient dense and convolutional layers for image classification, 2021.
[27] François Chollet. Xception: Deep learning with depthwise separable convolutions, 2017.
[28] Gavin Brown. Diversity in neural network ensembles. Technical report, 2004.
[29] Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE transactions on pattern analysis and
machine intelligence, 12(10):993–1001, 1990.
[30] Naonori Ueda and Ryohei Nakano. Generalization error of ensemble estimators. In Proceedings of International
Conference on Neural Networks (ICNN’96), volume 1, pages 90–95. IEEE, 1996.
[31] Yijun Bian and Huanhuan Chen. When does diversity help generalization in classification ensembles?, 2021.
[32] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging
weights leads to wider optima and better generalization, 2019.
[33] Anders Krogh and Jesper Vedelsby. Validation, and active learning. Advances in neural information processing
systems 7, 7:231, 1995.
[34] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine
learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
[35] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
[36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,
Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito,
Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information
Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[37] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[38] Akarachai Atakulreka and Daricha Sutivong. Avoiding local minima in feedforward neural networks by simulta-
neous learning. In Mehmet A. Orgun and John Thornton, editors, AI 2007: Advances in Artificial Intelligence,
pages 100–109, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.
[39] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of neural
networks, 2017.
[40] Yossi Arjevani and Michael Field. Symmetry and critical points for a model shallow neural network, 2020.
[41] Kshitij Tayal, Chieh-Hsin Lai, Vipin Kumar, and Ju Sun. Inverse problems, deep learning, and symmetry breaking,
2020.

Deep Learning Turorial PDF
No ratings yet
Deep Learning Turorial PDF
301 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
Deep Learning Computer Vision
No ratings yet
Deep Learning Computer Vision
302 pages
UNIT3
No ratings yet
UNIT3
17 pages
Artificial Neural Network Thesis Topics
100% (2)
Artificial Neural Network Thesis Topics
4 pages
BSC - Computer Objectives PDF
75% (4)
BSC - Computer Objectives PDF
72 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
On The Power and Limitations of Random Features For Understanding Neural Networks
No ratings yet
On The Power and Limitations of Random Features For Understanding Neural Networks
30 pages
1 Deep Learning
No ratings yet
1 Deep Learning
70 pages
Bunz Mann 2005
No ratings yet
Bunz Mann 2005
11 pages
Neural Network
No ratings yet
Neural Network
22 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Self Adaptive Deep Neural Network Numerical Approximation To Functions and PDEs
No ratings yet
Self Adaptive Deep Neural Network Numerical Approximation To Functions and PDEs
16 pages
Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
82 pages
From Perceptron To Deep Neural Nets - Becoming Human - Artificial Intelligence Magazine
No ratings yet
From Perceptron To Deep Neural Nets - Becoming Human - Artificial Intelligence Magazine
36 pages
Are All Layers Created Equal
No ratings yet
Are All Layers Created Equal
28 pages
2015 Training Artificial Neural Network Using Modificationof Differential Evolution Algorithm
No ratings yet
2015 Training Artificial Neural Network Using Modificationof Differential Evolution Algorithm
7 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
DL Unit 5 Notes 2
No ratings yet
DL Unit 5 Notes 2
23 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
16 pages
Mid Summary
No ratings yet
Mid Summary
13 pages
Neural Networks EN
No ratings yet
Neural Networks EN
16 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Unit 3
No ratings yet
Unit 3
12 pages
Question 105A
No ratings yet
Question 105A
33 pages
Unit V
No ratings yet
Unit V
42 pages
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
No ratings yet
NeurIPS 2021 Towards Lower Bounds On The Depth of Relu Neural Networks Paper
13 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Parallelized Deep Neural Networks
No ratings yet
Parallelized Deep Neural Networks
34 pages
Ann CNN RNN
No ratings yet
Ann CNN RNN
26 pages
Unit 4
No ratings yet
Unit 4
13 pages
MSCDA 605 Machine Learning Exam Model Answers May - 2019
No ratings yet
MSCDA 605 Machine Learning Exam Model Answers May - 2019
7 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
DL Intro
No ratings yet
DL Intro
64 pages
Wa0000
No ratings yet
Wa0000
52 pages
Improvement in The Speed of Training A Neural Network Using Sampled Training
No ratings yet
Improvement in The Speed of Training A Neural Network Using Sampled Training
3 pages
Unit 4-Health Care and Deep Learninh
No ratings yet
Unit 4-Health Care and Deep Learninh
87 pages
Unit 2
No ratings yet
Unit 2
37 pages
Artificial Neural NetworkIV
No ratings yet
Artificial Neural NetworkIV
6 pages
A Survey of Randomized Algorithms For Training Neural Networks
No ratings yet
A Survey of Randomized Algorithms For Training Neural Networks
10 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Guilhoto Math
No ratings yet
Guilhoto Math
25 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Neural Network
No ratings yet
Neural Network
7 pages
AI Lab 1
No ratings yet
AI Lab 1
11 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Multi Layer Perceptron Neural Networks
No ratings yet
Multi Layer Perceptron Neural Networks
30 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Information Sciences: Le Zhang, P.N. Suganthan
No ratings yet
Information Sciences: Le Zhang, P.N. Suganthan
3 pages
Deep Learning Algorithms Report PDF
No ratings yet
Deep Learning Algorithms Report PDF
11 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
RAG Chatbot Project Presentation
No ratings yet
RAG Chatbot Project Presentation
30 pages
Bushing Monitoring System
100% (1)
Bushing Monitoring System
140 pages
ANN Assignment
No ratings yet
ANN Assignment
10 pages
Assigment 2024
No ratings yet
Assigment 2024
25 pages
Utilization of ERP Systems in Manufacturing Industry For Productivity
No ratings yet
Utilization of ERP Systems in Manufacturing Industry For Productivity
8 pages
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
From Everand
Neural Networks for Beginners: Introduction to Machine Learning and Deep Learning
daniel Huston
No ratings yet
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Hpe Alletra 6000
No ratings yet
Hpe Alletra 6000
16 pages
Semantic Web Unit - 1 & 2
No ratings yet
Semantic Web Unit - 1 & 2
16 pages
Related Literature and Studies 1. Level
No ratings yet
Related Literature and Studies 1. Level
9 pages
Akka Infoq Agentic Ai Design Patterns
No ratings yet
Akka Infoq Agentic Ai Design Patterns
33 pages
Digital Psychiatry: Ethical Risks and Opportunities For Public Health and Well-Being
No ratings yet
Digital Psychiatry: Ethical Risks and Opportunities For Public Health and Well-Being
30 pages
Product Management
No ratings yet
Product Management
32 pages
Deep Learning
From Everand
Deep Learning
Manish Soni
No ratings yet
Agentforce Specialist
No ratings yet
Agentforce Specialist
19 pages
Loginro Tech Salary Report Romania 2024 Respondents
No ratings yet
Loginro Tech Salary Report Romania 2024 Respondents
20 pages
Data Science and The Future
No ratings yet
Data Science and The Future
10 pages
When Taekwondo Meets Artificial Intelligence The D
No ratings yet
When Taekwondo Meets Artificial Intelligence The D
24 pages
Weaviate Agentic Architectures-Ebook
No ratings yet
Weaviate Agentic Architectures-Ebook
11 pages
Executive MTECH Brochure
No ratings yet
Executive MTECH Brochure
22 pages
Fast Animal Pose Estimation Using Deep Neural Networks
No ratings yet
Fast Animal Pose Estimation Using Deep Neural Networks
13 pages
A Hybrid CNN + Lstm-Based Intrusion Detection System For Industrial Iot Networks
No ratings yet
A Hybrid CNN + Lstm-Based Intrusion Detection System For Industrial Iot Networks
14 pages
Prepared by Dr. Musa Alyaman Introduction To Engineering (0908200)
No ratings yet
Prepared by Dr. Musa Alyaman Introduction To Engineering (0908200)
24 pages
THPT CHUYÊN PHAN BỘI CHÂU - NGHỆ AN
No ratings yet
THPT CHUYÊN PHAN BỘI CHÂU - NGHỆ AN
7 pages
SimpleTron NeurIPS 2022
No ratings yet
SimpleTron NeurIPS 2022
15 pages
IJDSA Franta
No ratings yet
IJDSA Franta
14 pages
Ionic Origin of A Negative Capacitance in Lead Hal
No ratings yet
Ionic Origin of A Negative Capacitance in Lead Hal
12 pages
Aastha-Barsaiya Resume
No ratings yet
Aastha-Barsaiya Resume
3 pages
CCS234 - Unit1 - QB
No ratings yet
CCS234 - Unit1 - QB
2 pages
Customer Churn Prediction Using Machine Learning Techniques: The Case of Lion Insurance
No ratings yet
Customer Churn Prediction Using Machine Learning Techniques: The Case of Lion Insurance
14 pages
Whitepaper AI Research Merck KGaA
No ratings yet
Whitepaper AI Research Merck KGaA
11 pages
Problem Solving: State-Space Search and Control Strategies
No ratings yet
Problem Solving: State-Space Search and Control Strategies
9 pages
Icann 2023
No ratings yet
Icann 2023
4 pages
Voice - Assistant - Research Paper
No ratings yet
Voice - Assistant - Research Paper
6 pages
ALife Conference 2024 PINCA
No ratings yet
ALife Conference 2024 PINCA
3 pages
Funtoot To Address Indian Education System's One Size Fit All' Challenge
No ratings yet
Funtoot To Address Indian Education System's One Size Fit All' Challenge
2 pages
Csa4005 Expert-Systems-And-Fuzzy-Logic LT 1.0 6 Csa4005
No ratings yet
Csa4005 Expert-Systems-And-Fuzzy-Logic LT 1.0 6 Csa4005
2 pages
Deep Learning Lite
No ratings yet
Deep Learning Lite
58 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet

Dynamic Neural Diversification Path To Computation

Uploaded by

Dynamic Neural Diversification Path To Computation

Uploaded by

DYNAMIC N EURAL D IVERSIFICATION : PATH TO

C OMPUTATIONALLY S USTAINABLE N EURAL N ETWORKS

PAPER ACCEPTED TO ICANN 2021.

Alexander Kovalenko Pavel Kordík

Czech Technical University in Prague Czech Technical University in Prague

Keywords Diversification · Negative Correlation · Weight Initialization · Computational Sustainability · Neural

where w are weights of individual neurons and D is the diversity measure.

4 Results and Discussion

4.2 Negative Correlation Learning

4.3 Pairwise Cosine Similarity Diversification

4.4 Reaching Linear Complexity

4.5 Iterative Diversified Weight Initialization

You might also like