0% found this document useful (0 votes)
26 views

Image Based Classification

The document discusses reducing the size of convolutional neural networks (CNNs) through approximate computing techniques without substantially affecting performance. It proposes removing some neurons only from the fully connected layers before network training. As a case study, it focuses on fine-tuning, which has shown effectiveness in domains lacking expert features. To further compact networks, it applies weight quantization to convolutional kernels. Results show the network size can be tailored by this method without statistically affecting performance or requiring additional training.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Image Based Classification

The document discusses reducing the size of convolutional neural networks (CNNs) through approximate computing techniques without substantially affecting performance. It proposes removing some neurons only from the fully connected layers before network training. As a case study, it focuses on fine-tuning, which has shown effectiveness in domains lacking expert features. To further compact networks, it applies weight quantization to convolutional kernels. Results show the network size can be tailored by this method without statistically affecting performance or requiring additional training.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Future Generation Computer Systems 118 (2021) 48–55

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Effects of hidden layer sizing on CNN fine-tuning



Stefano Marrone , Cristina Papa, Carlo Sansone
DIETI - University of Naples Federico II, Via Claudio 21, 80125 Napoli, Italy

article info a b s t r a c t

Article history: Some applications have the property of being resilient, meaning that they are robust to noise (e.g. due
Received 1 February 2020 to error) in the data. This characteristic is very useful in situations where an approximate computation
Received in revised form 5 December 2020 allows to perform the task in less time or to deploy the algorithm on embedded hardware. Deep
Accepted 24 December 2020
learning is one of the fields that can benefit from approximate computing to reduce the high number
Available online 28 December 2020
of involved parameters thanks to its impressive generalization ability. A common approach is to prune
Keywords: some neurons and perform an iterative re-training with the aim of both reducing the required memory
CNN and to speed-up the inference stage. In this work we propose to face CNN size reduction from a
Adversarial perturbation different perspective: instead of reducing the network weights or look for an approximated network
Fine-tuning very close to the Pareto frontier, we investigate whether it is possible to remove some neurons only
Approximate computing
from the fully connected layers before the network training without substantially affecting the network
performance. As a case study, we will focus on ‘‘fine-tuning’’, a branch of transfer learning that has
shown its effectiveness especially in domains lacking effective expert-designed features. To further
compact the network, we apply weight quantization to the convolutional kernels. Results show that it
is possible to tailor some layers to reduce the network size, both in terms of the number of parameters
to learn and required memory, without statistically affecting the performance and without the need
for any additional training. Finally, we investigate to what extent the sizing operation affects the
network robustness against adversarial perturbations, a set of approaches aimed at misleading deep
neural networks.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction is very useful in situations where an approximate computation


(e.g. by representing rational numbers by using single-precision
One of the common misconceptions with numerical comput- floating-point variables instead of double-precision ones) allows
ing is related to the concepts of ‘‘correct’’ and ‘‘approximate’’ to perform the computation in less time or to deploy it on
computation: the former term is often erroneously (at least in embedded hardware [1].
the numerical computing field) used as a synonym for ‘‘closed- Deep Learning (DL) is one of the fields that can benefit from
form solution’’, while the latter tends to be perceived as ‘‘not approximate computing since, once trained, deep neural net-
precise’’ or ‘‘roughly estimated’’. To not fall for the trap, it is works show an impressive generalization ability. For example,
important to keep in mind that when a natural signal (e.g. an
Convolutional Neural Networks (CNNs) proved to be resilient to
image, a sound, an earthquake trace, etc.) needs to be processed
approximation in the data [2] or to intrinsic randomness in the
by means of a computer, the first step is to make it discrete
training procedure [3]. A common approach to approximate a
(e.g. by using signal sampling). Working on a discrete version
of the problem does not necessarily result in a worse solution. CNN is to iteratively prune some neurons [4] with the aim of both
For example, considering the case of measuring the area under a reducing the required memory and to speed-up the inference
curve (integral of a function), the discrete solution tends to the stage. The flip side of the coin is in the need for an iterative re-
analytic one as the discretization becomes finer. Thus, part of training procedure to deal with the information loss caused by the
the solution design is the choice of discretization level, based on removal of trained convolutional weights. Another strategy is to
the desired precision. On the other hand, some applications have quantize the learnt weights [5] to save bits of entropy used only
the intriguing property of being resilient, meaning that they are to process rare samples. In this case, the limitation is that the
robust to noise (e.g. due to error) in the data. This characteristic obtained network is not necessarily the most compact possible
one since quantization operates on all the weights similarly. In
∗ Corresponding author. a relatively recent work [6], the authors propose an interesting
E-mail address: [email protected] (S. Marrone). approximate approach exploiting software mutants to explore

https://doi.org/10.1016/j.future.2020.12.020
0167-739X/© 2020 Elsevier B.V. All rights reserved.
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

transform the input data in the desired output (e.g. a class for a
classification problem or a value for a regression one). However,
while shallow neural networks operate on the features designed
and extracted by a domain expert (Fig. 1(a)), CNNs use a hierarchy
of convolution operations (whose kernel weights are learned
in the very same way classical neurons are) to autonomously
extract the features that better model the problem under analysis
(Fig. 1(b)).
All CNNs of the type described in Fig. 1(b) can be seen as a
stack of neurons specialized for feature engineering (the ones in
the convolutional layers) and neurons intended for the classifica-
tion task (the ones in the fully connected layers). According to this
point of view, the fully connected layers can be seen as a standard
multilayer perceptron (MLP) classifier that relies on features ex-
tracted from the convolutional layers (although it is worth noting
that all layers are trained together). This distinction is important
since convolutional and fully connected layers have a very differ-
ent number of parameters. Using AlexNet [8] as example, it has
2, 334, 080 parameters in the convolutional layers (∼4% of the
total) and 58, 631, 144 parameters in the fully connected ones
(∼96% of the total). The reason is that AlexNet was originally
intended to handle the 1000 classes of ImageNet, therefore, after
feature engineering, a great representational capacity is required.
Supervised learning theory says that to reduce over-fitting,
a huge number of annotated samples is required to estimate
millions of parameters [9]. Unfortunately, having access to a huge
number of labelled samples is not always viable. To address this
problem, an increasingly popular solution is ‘‘Transfer Learning’’,
a term referring to transfer knowledge learnt in a task to another
Fig. 1. Comparison between a shallow and a convolutional neural network
architecture. (a) A representation of the classical approach in which some
one. In particular, there are two possible ways of leveraging
features (in the middle) are extracted from input data (on the left) and used to knowledge from a pre-trained network (Fig. 2): fine-tuning the
train a shallow neural network (a multilayer perceptron, on the right); (b) An net to adapt it to the new task; directly using the network as
example of a CNN using a hierarchy of (in the example 2) convolutional layers feature exactor. It is important to highlight that, while the feature
(in the middle) to autonomously learn the best representation of input data (on
the left) to train a fully connected neural network (on the right).
extractor approach does not change the network structure (it just
exploits the net inherit hierarchical representation), fine-tuning
implies a change in the architecture, without any concern on if
and to what extent this alteration can impact the efficiency and
the solution space looking for those laying on the Pareto fron-
effectiveness of the net.
tier. Unfortunately, this approach is extremely computationally
Indeed, fine-tuning is usually applied on problems having a
demanding, resulting to be infeasible to use in the case of CNNs.
smaller number of classes, causing a bottleneck in the network
In this work we propose to face CNN size reduction from a
different perspective: instead of reducing the weights and re- structure: e.g. in the case reported in Fig. 2, the 4096 neurons
training the network or looking for an approximated network in the fc7 layer (originally intended to be the input for the 1000
very close to the Pareto frontier, we investigate whether it is pos- classes output layer) are connected to a 2 neurons output layer.
sible to remove some neurons before the network training without Besides the waste of representational capacity, this simplistic
substantially affecting the network performance. Moreover, we approach has three main drawbacks:
also analyse the impact that weights quantization has when
applied on convolutional kernels of a ‘‘sized’’ CNN. As training
• can introduce an unnecessary additional computational and
scenario, we will focus on ‘‘fine-tuning’’, a branch of transfer memorization burden (with a direct impact on the network
learning that has been demonstrating its effectiveness especially performance and required system characteristics);
in domains lacking effective expert-designed features or huge • can cause the network to focus more on the noise within
annotated datasets [7]. images rather than on other salient aspects;
The rest of the paper is organized as follows: Section 2 in- • may require more training samples to converge (due to the
troduces the concepts behind the proposed approach; Section 3 higher number of parameters to fit).
investigates the effectiveness of the proposed approach by eval-
uating its performance on three different datasets; Section 4 Thus, in the view of network size reduction, in this work
analyses the effects of network sizing on adversarial perturbation; we propose to further adapt deep CNNs by performing a sizing of
finally, Section 5 draws some conclusions. hidden layers (i.e. those between the input and the output lay-
ers) when using the fine-tuning strategy. With sizing, we refer to
2. Hidden layer sizing the use of a suitable strategy to reduce the number of neurons
used, without significantly affecting the network performance.
Convolutional Neural Networks (CNNs) are very similar to This can be achieved in two different ways: by reducing the
traditional (shallow) Artificial Neural Networks (ANNs): they are synapses (i.e. connections between neurons), or by removing
both made of neurons, usually organized in layers, connected to whole neurons. Here we will analyse the latter approach, focusing
form a network in which the output of a neuron is the input in particular only on neurons in the fully connected layers since,
of some others. This structure makes it possible to learn how to as previously discussed, they are the most numerous ones.
49
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

Fig. 2. A representation of transfer learning using AlexNet. From left to right: the net input (a 227 × 227 pixels - 3 channels RGB image), convolutional layers,
fully connected layers and adaptations needed to perform transfer learning. In particular, on the top right side there is an example of how to fine-tune AlexNet for
a binary classification task, while on the bottom right side there is an illustration showing the use of the output from the last hidden layer as feature set for an
external classifier (i.e. using AlexNet as feature extractor).

computational demanding, with the latter aspect becoming more


critical as the network size increases. Therefore, it is very com-
mon to adopt more practical solutions based on some ‘‘rules of
the thumb’’ intended to guide the ANN design process. Among
all, in this work we consider two approaches, chosen for their
simplicity and popularity:

• To use as many neurons as the average between the num-


ber of input and output neurons, hereafter referred as A-
Rule [21]; √
• To use n = 2 × (m + 2) × i neurons, where m is the
number of classes and i is the number of input neurons,
hereafter referred as Huang [22].

2.2. Neuron pruning

Artificial neural networks (both shallow and deep) have shown


a great generalization ability in a wide variety of tasks. Their
representational capacity is closely linked to the number of artifi-
cial neurons they contain and in their interconnections. However,
artificial neural networks tend to quickly become massive (both
in terms of memory and computational effort) as the number of
neurons increases, with deep architectures requiring hardware
acceleration even for inference tasks [23]. This limit is becoming
even more severe with the spread of edge computing devices and
embedded architectures designed to run AI-based applications.
Focusing on the architecture of CNNs, from a computational
Fig. 3. Illustrative representation of how the number of neurons affects the de- perspective convolutional layers have the largest impact. Thus,
cision boundaries for a two-classes problem, in a single hidden layer multilayer a possible solution to limit the required computational power
perceptron. For each box, the image on the left refers to training samples, while is to reduce the number of neurons in the convolutional layers.
the image on the right refers to validation samples. The number in the box However, this operation is not straightforward, since neurons in
represents the number of neurons in the hidden layer: 2 causes underfitting;
100 results in overfitting; finally 5 is a good value.
the convolutional layers are those in charge of learning how to
Source: Images taken from Duda et al. [10]. perform feature extraction. To face this problem, an increasingly
used approach consists in ‘‘neurons pruning’’, a strategy that,
inspired by tree pruning techniques, aims in removing some
2.1. Choosing the number of neurons neurons after the training stage. Although it is not something
new [24], it is only with the spread of deep architectures that
Designing an artificial neural network is no trivial matter. it started to be further investigated. One of the first works on
On one hand, the universal approximation theorem states that a this line [25] proposes an unstructured pruning strategy based
single hidden layer (with an infinite number of neurons) feed- on the use of a predefined threshold. In the same year, in Han
forward neural network can approximate any continuous func- et al. [5] the authors proposed a multi-staged iterative approach
tion. On the other, an improper network design can easily result leveraging pruning, quantization, Huffman coding and re-training
in under/over-fitting (Fig. 3). to compress a CNN up to 49x. The year after, in Molchanov et al.
Over the years, authors have proposed different approaches [4] the authors introduce an iterative strategy alternating pruning
to optimize neural network design [11–13]. Most adopted so- and fine-tuning to optimize the network architecture while lim-
lutions rely on optimization-inspired meta-heuristics, such as iting the impact on its performance. In the same vein, in Anwar
Particle Swarm Optimization [14,15], operational research algo- et al. [26] the authors propose a structured pruning approach
rithms [16–18], information theory [19,20], and so on. However, leveraging information sparsity at different scales (e.g. kernel-
in most cases, these approaches are not general enough or very wise, feature map-wise, etc.). A more recent work [27] introduces
50
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

the interesting idea of training a CNN with multiple pruning ratios


at once, to allow for adaptive energy-accuracy trade-offs. With
the same aim of adapting the pruning to the task under analysis,
in Nikolaos et al. [28] the authors introduce a dynamic pruning
approach designed to dynamically remove the redundant capacity
of a CNN.
The most common flip side of the presented approaches is that
they rely on re-training stages to cope with the performance loss
introduced by removing neurons in a trained network. Instead,
in this work we focus on the possibility of changing the shape
of the fully connected part before the training itself. This means
that, while the cited approaches need iterative re-training, our
approach does not need any additional training besides the one
already needed to train the target model for the task under analysis.
Among all the layers, we focus on the fully connected part since,
as showed at the beginning of this section, for several deep archi-
tectures it is the most massive part in terms of neurons (weights).
This difference is the reason why we refer to our approach as
‘‘sizing’’ and not as pruning (since we do not prune a trained
network).

3. Sizing as a CNN approximation technique

To measure the suitability of the sizing strategy for approx-


imate computing purposes, it is important to measure the re-
siliency of the ‘‘sized’’ network against the ‘‘original’’ one, on a
given task. Both the network depth and the considered task might Fig. 4. A simplified illustrative representation and comparison between
AlexNet [8] and Vgg19 [30] CNN architectures. In both cases, activation functions,
affect the approach effectiveness: the former, due to the different
dropout and other functional layers have not been reported.
ratio between the number of neurons in the convolutional and
in the fully connected layers; the latter due to the different
number of classes, directly related to the number of neurons in
the classification layer. Therefore, in this work we consider two end of an epoch) for the last 15 epochs. This value has been
different networks on three classification problems. selected based on the validation curves obtained by fine-tuning
the original (i.e. not sized) CNNs on the considered datasets. In
3.1. Experimental setup particular, since both networks showed to quickly converge, with
a very little oscillation in the loss values, smaller values would
To take into account the depth of the networks, we consider have probably resulted in local minimum configurations (since
two different CNNs pre-trained on ImageNet [29]: AlexNet [8] and the networks may not have enough time to converge), while
Vgg19 [30]. The former, consists of 5 convolutional and 3 fully bigger values would have probably only increased the training
connected layers, for a total of 60, 965, 224 parameters (Fig. 4, time, without a significant impact on the performance (since the
left), while the latter, consists of 16 convolutional and of 3 fully curve tends to quickly saturate).
connected layers, for a total of 143, 667, 240 parameters (Fig. 4, It is worth noting that the designed sizing strategy impacts the
right). For both networks, the sizing approximation procedure is networks weights differently:
as follows:
• all the weights in the convolutional layers will remain un-
• Replace the last fully connected layer with a new fully-
varied (the same learnt on ImageNet);
connected layer having as many neurons as the number of
• during fine-tuning, the weights in the third to last fully con-
classes in the considered dataset;
nected layer are initialized with those learnt on ImageNet
• Change the shape of the second to last fully connected
layer according to one of the sizing rules introduced in but will be updated during the re-training;
Section 2.1; • weights of last two fully connected layers will be randomly
• Freeze the trained weights and biases for all convolutional initialized and will be adapted during the training.
layers (by setting the learning rate to 0);
• Set a very low learning rate (10−4 ) for the fully connected To evaluate the effectiveness of the proposed approach, we
layers; considered three datasets differing in terms of number of classes,
• Train the modified network on the target task. number of samples and image resolution:

Since the optimizer used affects the learnt parameters, all the • The Dogs vs Cats dataset1 [32], consisting of 25000 equally
experiments were run two times: the first, by using Stochastic distributed images of cats and dogs;
Gradient Descent with Momentum (SGDM) as described in Bishop • The UIUC Sports Event dataset2 [33], containing images of 8
[9], the second by using ADAM [31]. A 5-fold cross-validation different sport activities, distributed from 137 to 250 images
was performed, with 3 folds used as the training set, 1 as the per category. All the images are also grouped into ‘‘easy’’ and
validation set and 1 as the test set. In all the considered con- ‘‘medium’’ according to the human subject judgement;
figurations, the maximum number of training epochs has been
set to a very high value (105 ). However, we adopted an early-
stopping strategy to stop the training if there is no improvement 1 https://www.microsoft.com/en-us/download/details.aspx?id=54765.
in the accuracy values (measured on the validation set at the 2 http://vision.stanford.edu/lijiali/event_dataset/.

51
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

Table 1 Table 5
AlexNet [8] 5-fold cross validation mean results for the Dogs vs Cats [32] dataset, Summary of the required AlexNet [8] memory for each sizing technique and
with respective 95% confidence values. considered dataset. The table also reports the numerical difference (∆) and the
Technique Optimizer Accuracy Iterations percentage saving (∆%) obtained by using the sized network w.r.t. the base
Mean Median Mean Median fine-tuning approach.

Base
SGDM 0.9744 ± 0.0021 0.9738 ± 0.0026 4120 ± 98 3800 ± 95 Memory (MB) ∆ (MB) ∆%
ADAM 0.9658 ± 0.0025 0.9656 ± 0.0031 7400 ± 112 5000 ± 101
SGDM 0.9729 ± 0.0024 0.9730 ± 0.0030 5360 ± 104 5000 ± 99
Base 211 – –
A-Rule Dogs Vs Cats
ADAM 0.9664 ± 0.0026 0.9660 ± 0.0032 11880 ± 207 12000 ± 234 A-Rule 177 34 16.11%
[32]
SGDM 0.9729 ± 0.0025 0.9722 ± 0.0031 11120 ± 221 13000 ± 256 Huang 124 87 41.23%
Huang
ADAM 0.9663 ± 0.0027 0.9660 ± 0.0033 18160 ± 301 17400 ± 279
Base 207 – –
UIUC sports event
A-Rule 176 31 14.98%
[33]
Huang 152 55 26.57%
Table 2
AlexNet [8] 5-fold cross validation mean results for the UIUC Sports Event [33] Base 209 – –
Caltech-101
dataset, with respective 95% confidence values. A-Rule 178 31 14.83%
[34]
Technique Optimizer Accuracy Iterations Huang 166 43 20.57%
Mean Median Mean Median
SGDM 0.9538 ± 0.0095 0.9587 ± 0.0119 1022 ± 61 792 ± 45
Base
ADAM 0.9601 ± 0.0111 0.9621 ± 0.0138 192 ± 51 156 ± 15 Table 6
A-Rule
SGDM 0.9531 ± 0.0132 0.9495 ± 0.0165 1190 ± 68 1248 ± 71 Vgg19 [30] 5-fold cross validation mean results for the Dogs vs Cats [32] dataset,
ADAM 0.9588 ± 0.0138 0.9495 ± 0.0172 293 ± 53 192 ± 47
with respective 95% confidence values.
SGDM 0.9493 ± 0.0113 0.9460 ± 0.0141 1534 ± 81 1368 ± 78 Technique Optimizer Accuracy Iterations
Huang
ADAM 0.9563 ± 0.0156 0.9524 ± 0.0195 190 ± 60 168 ± 43
Mean Median Mean Median
SGDM 0.9887 ± 0.0010 0.9884 ± 0.0013 9160 ± 178 9800 ± 185
Base
ADAM 0.9870 ± 0.0009 0.9870 ± 0.0011 6760 ± 109 2200 ± 79
Table 3
SGDM 0.9881 ± 0.0006 0.9882 ± 0.0007 10360 ± 204 10200 ± 203
AlexNet [8] 5-fold cross validation mean results for the Caltech-101 [34] dataset, A-Rule
ADAM 0.9864 ± 0.0015 0.9858 ± 0.0019 6200 ± 101 5200 ± 98
with respective 95% confidence values. SGDM 0.9884 ± 0.0009 0.9880 ± 0.0011 9200 ± 190 8400 ± 173
Huang
Technique Optimizer Accuracy Iterations ADAM 0.9873 ± 0.0013 0.9872 ± 0.0016 3280 ± 105 1400 ± 87
Mean Median Mean Median
SGDM 0.9159 ± 0.0100 0.9155 ± 0.0125 25258 ± 382 24455 ± 326
Base
ADAM 0.9334 ± 0.0067 0.9338 ± 0.0084 9475 ± 202 10512 ± 217 Table 7
A-Rule
SGDM 0.9229 ± 0.0073 0.9237 ± 0.0091 29667 ± 388 29638 ± 380 Vgg19 [30] 5-fold cross validation mean results for the UIUC Sports Event [33]
ADAM 0.9324 ± 0.0071 0.9335 ± 0.0089 7680 ± 112 7300 ± 109
dataset, with respective 95% confidence values.
SGDM 0.9212 ± 0.0071 0.9204 ± 0.0089 31332 ± 396 29273 ± 392 Technique Optimizer Accuracy Iterations
Huang
ADAM 0.9331 ± 0.0082 0.9329 ± 0.0103 7227 ± 102 6789 ± 100
Mean Median Mean Median
SGDM 0.9658 ± 0.0097 0.9621 ± 0.0121 377 ± 51 276 ± 48
Base
ADAM 0.9734 ± 0.0026 0.9716 ± 0.0032 84 ± 14 72 ± 12
Table 4
SGDM 0.9683 ± 0.0066 0.9685 ± 0.0082 386 ± 50 372 ± 51
Summary of the number of AlexNet [8] parameters for each sizing technique A-Rule
ADAM 0.9740 ± 0.0025 0.9746 ± 0.0031 245 ± 46 96 ± 19
and considered dataset. The table also reports the numerical difference (∆) and SGDM 0.9709 ± 0.0043 0.9714 ± 0.0054 1022 ± 61 1068 ± 58
the percentage saving (∆%) obtained by using the sized network w.r.t. the base Huang
ADAM 0.9740 ± 0.0044 0.9746 ± 0.0054 192 ± 48 156 ± 42
fine-tuning approach.
#Parameters ∆ ∆%
Table 8
Base 56,876,418 – –
Dogs Vs Cats Vgg19 [30] 5-fold cross validation mean results for the Caltech-101 [34] dataset,
A-Rule 48,485,765 8,390,653 14.75%
[32] with respective 95% confidence values.
Huang 41,136,258 15,740,160 27.67%
Technique Optimizer Accuracy Iterations
Base 56,901,000 – – Mean Median Mean Median
UIUC sports event
A-Rule 48,510,380 8,390,620 14.75% SGDM 0.9362 ± 0.0035 0.9377 ± 0.0043 11271 ± 198 11242 ± 191
[33] Base
Huang 41,745,340 15,155,660 26.64% ADAM 0.9378 ± 0.0063 0.9377 ± 0.0078 569 ± 99 511 ± 107
SGDM 0.9478 ± 0.0075 0.9531 ± 0.0094 11811 ± 221 11023 ± 201
Base 57,282,021 – – A-Rule
ADAM 0.9415 ± 0.0065 0.9431 ± 0.0082 657 ± 118 584 ± 101
Caltech-101
A-Rule 48,894,417 8,387,604 14.64%
[34] Huang
SGDM 0.9483 ± 0.0061 0.9475 ± 0.0076 9724 ± 201 8103 ± 189
Huang 41,136,258 15,740,160 27.67% ADAM 0.9415 ± 0.0065 0.9431 ± 0.0082 657 ± 109 584 ± 103

• The Caltech 101 dataset3 [34], collecting pictures of objects Similarly, Tables 6 to 8 report the classification accuracy and
belonging to 101 different categories, distributed from 40 to the number of iterations needed to converge for Vgg19 [30], vary-
800 images per category; ing the training optimizer (Section 3.1) and the sizing approach
used (Section 2.1), for each considered dataset respectively (Sec-
3.2. Experimental results tion 3.1). Tables 9 and 10 respectively report the number of
parameters and occupied memory (in MB) for the same set of
CNN, optimization approach, sizing strategy and dataset.
Measuring the resiliency of a deep CNN implies measuring the
Although this work must be considered just as a proof-of-
classification error rates of the approximated networks against
concept, results show that it is possible to reduce the number of
those obtained by using the basic fine-tuning procedure.
neurons in the hidden layers without statistically affect the net-
Tables 1 to 3 report the classification accuracy and the number
work performance, but with a significant reduction of the number
of iterations needed to converge for AlexNet, varying the training
of parameters (up to ∼27% for AlexNet and ∼11% for Vgg19)
optimizer (Section 3.1) and the sizing approach used (Section 2.1),
and required memory use (up to ∼41% for AlexNet and ∼10%
for each considered dataset respectively (Section 3.1). Tables 4
for Vgg19). This implies that, though fine-tuning can already be
and 5 respectively report the number of parameters and occupied
effectively used in many different contexts, other investigations
memory (in MB) for the same set of CNN, optimization approach,
are needed to develop improvements that allow unleashing its
sizing strategy and dataset.
full potential. Moreover, the possibility of reducing the mem-
ory use without a direct impact on the network classification
3 http://www.vision.caltech.edu/Image_Datasets/Caltech101/. ability open new scenarios toward the application of powerful
52
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

Table 9
Summary of the number of Vgg19 [30] parameters for each sizing technique
and considered dataset. The table also reports the numerical difference (∆) and
the percentage saving (∆%) obtained by using the sized network w.r.t. the base
fine-tuning approach.
#Parameters ∆ ∆%
Base 139,578,434 – –
Dogs Vs Cats
A-Rule 131,187,781 8,390,653 6.01%
[32]
Huang 123,838,274 15,740,160 11.28%
Base 139,603,016 – –
UIUC sports event
A-Rule 131,212,396 8,390,620 6.01%
[33]
Huang 124,447,356 15,155,660 10.86%
Base 139,984,037 – – Fig. 5. Perturbation attack on an image from the Kaggle Dogs vs Cats com-
Caltech-101
A-Rule 131,596,433 8,387,604 5.99% petition [32]: (a) clean image, (b) image with the adversarial perturbation
[34]
Huang 123,838,274 15,740,160 11.28% applied.

Table 10
Summary of the required Vgg19 [30] memory for each sizing technique and The term adversarial perturbation refers to the whole of tech-
considered dataset. The table also reports the numerical difference (∆) and the
niques that inject an image with a suitable, hardly perceptible,
percentage saving (∆%) obtained by using the sized network w.r.t. the base
fine-tuning approach. perturbation (noise) with the aim of misleading a CNN. Although
Memory (MB) ∆ (MB) ∆% the existence of blind spots in CNNs and a way to exploit them
dates back to 2013 [36], it is only with the Fast Gradient Sign
Base 508 – –
Dogs Vs Cats Method (FGSM) [37] and with its improved iterative version [38]
A-Rule 488 20 3.94%
[32]
Huang 461 47 9.25% that it started to effectively be possible to attack a CNN. Given
Base 507 – – a target CNN and a clean sample input, the FGSM multiplies
UIUC sports event
A-Rule 477 30 5.92% the sign of the prediction gradient (with respect to the class of
[33]
Huang 452 55 10.85% the input image) by a user-defined standard deviation ϵ (iter-
Caltech-101
Base 506 – – atively tuned by the improved variant) to generate an additive
[34]
A-Rule 482 24 4.74% perturbation. Since ϵ determines the magnitude of the perturba-
Huang 465 41 8.10%
tion, higher values increase the attack success rate at the cost
of a more visible alteration. In 2015, DeepFool [39] proposes
an efficient iterative approach that exploits the gradient of a
CNNs on embedded devices, as already done with Random Forest locally linearized version of the network loss to generate a series
classifiers [6]. of additive perturbations that move the clean sample on the
As described in Section 2, our approach focuses on fully con- edge of the classification separating hyperplane. The generated
nected neurons. Thus, for the sake of completeness, we analysed perturbation is then multiplied by η ≪ 1 in order to make
if and to what extent ‘‘compacting’’ also neurons in the convolu- the adversarial sample able to cross the hyperplane so as to
tional part impacts the effectiveness of the proposed approach. To be misclassified. From that moment on, several gradient-based
the aim, we focus on weights quantization as we want to avoid any approaches have been published, with the aim of producing a
sort of re-training. The idea is to reduce the memory needed to more efficient perturbation with a more automatic procedure.
store the weights by moving from a double-precision to a ‘‘lower’’ Common defence techniques usually try to make CNNs more
precision floating-point representation. It is worth noting that the robust by either working on the data to find out the adversarial
number of bits actually used to store the weights depends on the samples [40] or to destroy the injected artefacts [41], or on the
networks itself. This value is calculated as the smallest number way the model learns from it [42]. But is it possible that the
of bits needed to preserve the accuracy of the network on the shape of the network itself contributes to the effectiveness of
validation set. Under these settings, weights quantization further such attacks?
compact the networks for just a small margin (on average, ∼2% Although recently, some authors have proposed adversarial
for AlexNet and ∼3% for Vgg19). attacks able to work in several domains [43,44], in this section we
will focus only on CNNs for image processing and on adversarial
attacks applicable to them. This is because (i) CNNs represent
4. Sizing and adversarial perturbations
the most used deep neural networks and (ii) all the adversarial
attacks proposed so far in any domain are the same, or are
Although fine-tuning has been shown to be very effective
an adaptation, of those intended against images. Therefore, we
[35], the use of a pre-trained CNN does not bring only benefits,
analyse the impact that the sizing strategy has on the robustness
but also its weaknesses, which could negatively impact the use of
of CNNs against adversarial perturbation approaches. To this aim,
a pre-trained network. In particular, it has been shown [36] that
Tables 11 and 12 respectively report the robustness of AlexNet [8]
(i) it is possible to arbitrarily fool state-of-the-art CNNs with a
and Vgg19 [30] against two adversarial perturbation strategies on
small (and often imperceptible to human eyes) perturbation and the UIUC Sports Event Dataset [33], varying the sizing approach
that (ii) regularization techniques are useless in this case (since (Section 2.1). The value ρ , introduced in Moosavi-Dezfooli et al.
it is not the result of overfitting). More formally, given an image [39] which relates the magnitude of the adversarial noise needed
I ∈ R(w,h,c) of size w × h on c channels, and a classifier mapping to mislead the CNN, is defined as
function fC : I → {1..n} that classifies an image I into one of n
possible labels, an adversarial perturbation r is defined as: ∥Na ∥2
ρ= (2)
∥I ∥2
r ∈ R(w,h,c) : fC (I) ̸ = fC (I + r) (1)
where Na is the injected adversarial noise and I is the target
where r usually is the smallest perturbation (in terms of autoin- image. The column ‘‘Time’’ refers to the time (in seconds) needed
formation) able to fool the network (Fig. 5). to craft the adversarial samples. It is worth noting that, to provide
53
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

Table 11 that our approach does not need any additional training besides
AlexNet [8] robustness to FGSM [37] and to DeepFool [39] adversarial per- the one already needed to train the target model. Given the set
turbations attacks on the UIUC Sports Event Dataset [33], varying the sizing
approach.
of neurons we focus on, it is clear that fine-tuning is the most
Fool ρ Time (s) effective scenario for the proposed approach. It is worth noting
Mean Median Mean Median that although this might appear as a limitation, the number of
Base
FGSM 0.0183 ± 0.0004 0.0185 ± 0.0001 2690.45 ± 35.01 2811.46 ± 26.17 works leveraging fine-tuning is the bigger portion of the total
DeepFool 0.0407 ± 0.0010 0.0419 ± 0.0007 706.83 ± 20.71 703.33 ± 18.56
FGSM 0.0193 ± 0.0003 0.0194 ± 0.0001 3260.09 ± 123.43 3146.05 ± 122.27
number of papers using deep neural networks, mostly due to
A-Rule
DeepFool 0.0442 ± 0.0007 0.0452 ± 0.0004 722.80 ± 27.10 754.96 ± 22.38 the lack of large labelled datasets. All these aspects make our
Huang
FGSM
DeepFool
0.0199 ± 0.0002
0.0488 ± 0.0012
0.0202 ± 0.0002
0.0499 ± 0.0009
4745.09 ± 117.93
558.67 ± 12.17
4149.13 ± 109.87
554.58 ± 11.86
approach different from what have been so far proposed, with
the main difference laying in the fact that it has been inspired
by approximate computing with the aim of leveraging CNN’s
Table 12 resiliency to reduce the numbers of needed neurons to try to get
Vgg19 [30] robustness to FGSM [37] and to DeepFool [39] adversarial per-
turbations attacks on the UIUC Sports Event Dataset [33], varying the sizing
closer to the Pareto frontier
approach. Although preliminary, results not only show that it is possible
Fool ρ Time (s) to reduce the number of parameters and memory usage without
Mean Median Mean Median statistically affecting the performance, but also that the obtained
Base
FGSM 0.0177 ± 0.0007 0.0182 ± 0.0003 18386.91 ± 267.25 18634.11 ± 213.77
DeepFool 0.1188 ± 0.0023 0.0754 ± 0.0018 4305.76 ± 105.32 4383.62 ± 99.53
network is more robust against adversarial perturbations. Future
FGSM 0.0192 ± 0.0004 0.0190 ± 0.0002 27851.98 ± 196.11 28652.63 ± 179.32 work will further analyse these aspects, trying (i) to extend the
A-Rule
DeepFool 0.0713 ± 0.0010 0.0730 ± 0.0009 5151.23 ± 101.33 5243.81 ± 98.09 sizing procedure also to convolutional layers and (ii) provid-
FGSM 0.0200 ± 0.0003 0.0201 ± 0.0001 22059.96 ± 203.09 22084.38 ± 185.87
Huang
DeepFool 0.1000 ± 0.0011 0.0882 ± 0.0007 4132.82 ± 158.22 4130.29 ± 149.10 ing a systematic analysis of the effects of the sizing strategy
against state-of-the-art adversarial perturbation approaches. Fi-
nally, since (to some extents), this is similar to what has been
proposed for scaling neural networks in the context of increasing
fair results, the CNNs have been trained on the training folds, the size of layers to adapt them to different tasks [45], we will
while the adversarial samples have been crafted only for images also analyse whether both aims (network size reduction and task
in the test fold. Moreover, both ρ and ‘‘Time’’ values have been adaption) can be pursued at the same time.
measured only for successfully crafted adversarial samples.
Interestingly, results show that, apart for a single combination,
Declaration of competing interest
sized CNNs need a ‘‘stronger’’ adversarial noise to be misled. This
further motivates other investigations in the direction of CNN
The authors declare that they have no known competing finan-
layer sizing, since the reported analysis seems to suggest that
cial interests or personal relationships that could have appeared
the adversarial perturbation problem can be faced (or at least
to influence the work reported in this paper.
limited) by reducing the number of neurons/connections that
actively take part in the network decision problem. These results,
although preliminary, represent a novel contribution in the field Acknowledgements
of adversarial defence strategies since CNN sizing does not rely
on the analysis of the data but could make CNNs intrinsically The authors gratefully acknowledge the availability of the
more robust by changing the neurons connection pattern. This Calculation Centre SCoPE of the University of Naples Federico II
will help the user in ‘‘sizing’’ the network accordingly to the and thank the SCoPE academic staff for the given support.
desired levels of performance/robustness they need to obtain on
the basis of the risk associated with the task. Finally, it is also References
worth noting that the approach is totally topic-agnostic, meaning
that it is applicable in several contexts and for different tasks [1] F. Amato, M. Barbareschi, G. Cozzolino, A. Mazzeo, N. Mazzocca, A.
Tammaro, Outperforming image segmentation by exploiting approximate
(e.g. user recognition, object detection, etc.), helping researchers K-means algorithms, in: International Conference on Optimization and
in choosing the most suitable solution without changes in the Decision Science, Springer, 2017, pp. 31–38.
procedure. [2] W. Sung, S. Shin, K. Hwang, Resiliency of deep neural networks under
quantization, 2015, arXiv preprint arXiv:1511.06488.
5. Discussion and conclusions [3] S. Marrone, S. Olivieri, G. Piantadosi, C. Sansone, Reproducibility of deep
CNN for biomedical image processing across frameworks and architectures,
in: 2019 27th European Signal Processing Conference (EUSIPCO), IEEE,
In this work we focused on the concept of approximate com- 2019, pp. 1–5.
puting, a field involving the study of resilience (i.e. the ability of a [4] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional
system to provide correct results also in the presence of degraded neural networks for resource efficient inference, 2016, arXiv preprint
arXiv:1611.06440.
working conditions) to reduce the resources needed by a sys-
[5] S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural
tem. In particular, we introduced the concept of ‘‘sizing a CNN’’, networks with pruning, trained quantization and huffman coding, 2015,
namely a procedure intended in removing some of the fully- arXiv preprint arXiv:1510.00149.
connected neurons to reduce the number of trainable parameters. [6] M. Barbareschi, C. Papa, C. Sansone, Approximate decision tree-based
The proposed approach differs from classical pruning techniques multiple classifier systems, in: International Conference on Optimization
and Decision Science, Springer, 2017, pp. 39–47.
since, to the best of our knowledge, all the approaches proposed
[7] Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariant
so far rely on re-training stages to cope with the performance loss spatio-temporal features for action recognition with independent subspace
introduced by removing neurons in a trained network. analysis, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Instead, we focus only on removing whole neurons from the Conference on, IEEE, 2011, pp. 3361–3368.
fully connected layers since, as showed in Section 2, for several [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
deep architectures it is the most massive part in terms of neurons
Processing Systems, 2012, pp. 1097–1105.
(weights). Moreover, we do not operate on an already trained [9] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
network but instead, focus on the possibility of ‘‘sizing’’ the shape [10] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons,
of the fully connected part before the training itself. The result is 2012.

54
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55

[11] G. Panchal, A. Ganatra, Y. Kosta, D. Panchal, Behaviour analysis of multi- [35] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features
layer perceptrons with multiple hidden neurons and hidden layers, Int. J. in deep neural networks? in: Advances in Neural Information Processing
Comput. Theory Eng. 3 (2) (2011) 332–337. Systems, 2014, pp. 3320–3328.
[12] D. Hunter, H. Yu, M.S. Pukish III, J. Kolbusz, B.M. Wilamowski, Selection [36] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
of proper neural network sizes and architectures — A comparative study,
R. Fergus, Intriguing properties of neural networks, 2013, arXiv preprint
IEEE Trans. Ind. Inf. 8 (2) (2012) 228–240.
arXiv:1312.6199.
[13] K.G. Sheela, S.N. Deepa, Review on methods to fix number of hidden
neurons in neural networks, Math. Probl. Eng. 2013 (2013). [37] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial
[14] F.M. Shah, M.K. Hasan, M.M. Hoque, S. Ahmmed, Architecture and weight examples, 2014, arXiv preprint arXiv:1412.6572.
optimization of ANN using sensitive analysis and adaptive particle swarm [38] A. Kurakin, I. Goodfellow, S. Bengio, Adversarial examples in the physical
optimization, Int. J. Comput. Sci. Netw. Secur. 10 (8) (2010) 103–111. world, 2016, arXiv preprint arXiv:1607.02533.
[15] N. Mohsenifar, A. Kargar, N. Mohsenifar, Adjusting MLP neural network [39] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: A simple and
architecture through PSO algorithm for ECG signal prediction, in: Inter- accurate method to fool deep neural networks, in: Proceedings of the
national Conference on Intelligent Computing Electronics Systems and IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp,
Information Technology, 2015, pp. 13–16. 2574–2582.
[16] K.Z. Mao, G.-B. Huang, Neuron selection for RBF neural network classifier
[40] W. Xu, D. Evans, Y. Qi, Feature squeezing: Detecting adversarial examples
based on data structure preserving criterion, IEEE Trans. Neural Netw. 16
in deep neural networks, 2017, arXiv preprint arXiv:1704.01155.
(6) (2005) 1531–1540.
[17] C.H. Aladag, E. Egrioglu, S. Gunay, M.A. Basaran, Improving weighted [41] G.K. Dziugaite, Z. Ghahramani, D.M. Roy, A study of the effect of jpg
information criterion by using optimization, J. Comput. Appl. Math. 233 compression on adversarial images, 2016, arXiv preprint arXiv:1608.00853.
(10) (2010) 2683–2687. [42] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, S. Ishii, Distributional
[18] C.H. Aladag, A new architecture selection method based on tabu search smoothing with virtual adversarial training, 2015, arXiv preprint arXiv:
for artificial neural networks, Expert Syst. Appl. 38 (4) (2011) 3287–3293. 1507.00677.
[19] H. Yuan, F. Xiong, X. Huai, A method for estimating the number of hidden [43] N. Carlini, D. Wagner, Audio adversarial examples: Targeted attacks on
neurons in feed-forward neural networks based on information entropy, speech-to-text, in: 2018 IEEE Security and Privacy Workshops (SPW), IEEE,
Comput. Electron. Agric. 40 (1–3) (2003) 57–64. 2018, pp. 1–7.
[20] S. Xu, L. Chen, A novel approach for determining the optimal number of [44] M. Sato, J. Suzuki, H. Shindo, Y. Matsumoto, Interpretable adversarial
hidden layer neurons for FNN’s and its application in data mining, 2008. perturbation in input embedding space for text, 2018, arXiv preprint
[21] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, in: M. Kaufmann (Ed.), Data Mining: arXiv:1805.02917.
Practical Machine Learning Tools and Techniques, fourth ed. [45] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional
[22] G.-B. Huang, Learning capability and storage capacity of two-hidden-layer neural networks, 2019, arXiv preprint arXiv:1905.11946.
feedforward networks, IEEE Trans. Neural Netw. 14 (2) (2003) 274–281.
[23] Y.E. Wang, G.-Y. Wei, D. Brooks, Benchmarking TPU, GPU, and CPU
platforms for deep learning, 2019, arXiv preprint arXiv:1907.10701. Stefano Marrone is a post-doc at the University of
[24] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, in: Advances in Naples Federico II. His research topics are within the
Neural Information Processing Systems, 1990, pp. 598–605. sphere of Pattern Recognition and Computer Vision,
[25] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections with applications ranging from biomedical image pro-
for efficient neural network, in: Advances in Neural Information Processing cessing to biometrics and image/video forensics. More
Systems, 2015, pp. 1135–1143. recently, he has also been working on ethics and
[26] S. Anwar, K. Hwang, W. Sung, Structured pruning of deep convolutional fairness in AI.
neural networks, ACM J. Emerg. Technol. Comput. Syst. (JETC) 13 (3) (2017)
1–18.
[27] S. Moon, Y. Byun, J. Park, S. Lee, Y. Lee, Memory-reduced network stacking
for edge-level CNN architecture with structured weight pruning, IEEE J.
Emerg. Sel. Top. Circuits Syst. 9 (4) (2019) 735–746. Cristina Papa received a master’s degree in Computer
Science Engendering with laude from the University of
[28] F. Nikolaos, I. Theodorakopoulos, V. Pothos, E. Vassalos, Dynamic pruning
Naples Federico II. Her main research interests include
of CNN networks, in: 2019 10th International Conference on Information,
embedded systems, model-based design, approximate
Intelligence, Systems and Applications (IISA), IEEE, 2019, pp. 1–5. computing, image processing and artificial intelligence.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large
scale visual recognition challenge, Int. J. Comput. Vis. (IJCV) 115 (3) (2015)
211–252, http://dx.doi.org/10.1007/s11263-015-0816-y.
[30] K. Simonyan, A. Zisserman, Very deep convolutional networks for
large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[31] D. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv Carlo Sansone is Full Professor of Computer Engi-
preprint arXiv:1412.6980. neering at the Dipartimento di Ingegneria Elettrica
[32] J. Elson, J.J. Douceur, J. Howell, J. Saul, Asirra: a CAPTCHA that exploits e Tecnologie dell’Informazione of the University of
interest-aligned manual image categorization, 2007. Naples Federico II. His basic interests cover the areas
[33] L.-J. Li, L. Fei-Fei, What, where and who? Classifying events by scene of image analysis, pattern recognition and machine and
and object recognition, in: Computer Vision, 2007. ICCV 2007. IEEE 11th deep learning. From an applicative point of view, his
main contributions were in the fields of biomedical
International Conference on, IEEE, 2007, pp. 1–8.
image analysis, biometrics and image forensics. He
[34] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from
coordinated several projects, mainly in the areas of
few training examples: An incremental Bayesian approach tested on 101 biomedical images interpretation and network intru-
object categories, in: 2004 Conference on Computer Vision and Pattern sion detection. Prof. Sansone has authored more than
Recognition Workshop, IEEE, 2004, p. 178. 200 papers in international journals and conference proceedings. He was also
co-editor of three special issues and of three books.

55

You might also like