Image Based Classification
Image Based Classification
article info a b s t r a c t
Article history: Some applications have the property of being resilient, meaning that they are robust to noise (e.g. due
Received 1 February 2020 to error) in the data. This characteristic is very useful in situations where an approximate computation
Received in revised form 5 December 2020 allows to perform the task in less time or to deploy the algorithm on embedded hardware. Deep
Accepted 24 December 2020
learning is one of the fields that can benefit from approximate computing to reduce the high number
Available online 28 December 2020
of involved parameters thanks to its impressive generalization ability. A common approach is to prune
Keywords: some neurons and perform an iterative re-training with the aim of both reducing the required memory
CNN and to speed-up the inference stage. In this work we propose to face CNN size reduction from a
Adversarial perturbation different perspective: instead of reducing the network weights or look for an approximated network
Fine-tuning very close to the Pareto frontier, we investigate whether it is possible to remove some neurons only
Approximate computing
from the fully connected layers before the network training without substantially affecting the network
performance. As a case study, we will focus on ‘‘fine-tuning’’, a branch of transfer learning that has
shown its effectiveness especially in domains lacking effective expert-designed features. To further
compact the network, we apply weight quantization to the convolutional kernels. Results show that it
is possible to tailor some layers to reduce the network size, both in terms of the number of parameters
to learn and required memory, without statistically affecting the performance and without the need
for any additional training. Finally, we investigate to what extent the sizing operation affects the
network robustness against adversarial perturbations, a set of approaches aimed at misleading deep
neural networks.
© 2020 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.future.2020.12.020
0167-739X/© 2020 Elsevier B.V. All rights reserved.
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
transform the input data in the desired output (e.g. a class for a
classification problem or a value for a regression one). However,
while shallow neural networks operate on the features designed
and extracted by a domain expert (Fig. 1(a)), CNNs use a hierarchy
of convolution operations (whose kernel weights are learned
in the very same way classical neurons are) to autonomously
extract the features that better model the problem under analysis
(Fig. 1(b)).
All CNNs of the type described in Fig. 1(b) can be seen as a
stack of neurons specialized for feature engineering (the ones in
the convolutional layers) and neurons intended for the classifica-
tion task (the ones in the fully connected layers). According to this
point of view, the fully connected layers can be seen as a standard
multilayer perceptron (MLP) classifier that relies on features ex-
tracted from the convolutional layers (although it is worth noting
that all layers are trained together). This distinction is important
since convolutional and fully connected layers have a very differ-
ent number of parameters. Using AlexNet [8] as example, it has
2, 334, 080 parameters in the convolutional layers (∼4% of the
total) and 58, 631, 144 parameters in the fully connected ones
(∼96% of the total). The reason is that AlexNet was originally
intended to handle the 1000 classes of ImageNet, therefore, after
feature engineering, a great representational capacity is required.
Supervised learning theory says that to reduce over-fitting,
a huge number of annotated samples is required to estimate
millions of parameters [9]. Unfortunately, having access to a huge
number of labelled samples is not always viable. To address this
problem, an increasingly popular solution is ‘‘Transfer Learning’’,
a term referring to transfer knowledge learnt in a task to another
Fig. 1. Comparison between a shallow and a convolutional neural network
architecture. (a) A representation of the classical approach in which some
one. In particular, there are two possible ways of leveraging
features (in the middle) are extracted from input data (on the left) and used to knowledge from a pre-trained network (Fig. 2): fine-tuning the
train a shallow neural network (a multilayer perceptron, on the right); (b) An net to adapt it to the new task; directly using the network as
example of a CNN using a hierarchy of (in the example 2) convolutional layers feature exactor. It is important to highlight that, while the feature
(in the middle) to autonomously learn the best representation of input data (on
the left) to train a fully connected neural network (on the right).
extractor approach does not change the network structure (it just
exploits the net inherit hierarchical representation), fine-tuning
implies a change in the architecture, without any concern on if
and to what extent this alteration can impact the efficiency and
the solution space looking for those laying on the Pareto fron-
effectiveness of the net.
tier. Unfortunately, this approach is extremely computationally
Indeed, fine-tuning is usually applied on problems having a
demanding, resulting to be infeasible to use in the case of CNNs.
smaller number of classes, causing a bottleneck in the network
In this work we propose to face CNN size reduction from a
different perspective: instead of reducing the weights and re- structure: e.g. in the case reported in Fig. 2, the 4096 neurons
training the network or looking for an approximated network in the fc7 layer (originally intended to be the input for the 1000
very close to the Pareto frontier, we investigate whether it is pos- classes output layer) are connected to a 2 neurons output layer.
sible to remove some neurons before the network training without Besides the waste of representational capacity, this simplistic
substantially affecting the network performance. Moreover, we approach has three main drawbacks:
also analyse the impact that weights quantization has when
applied on convolutional kernels of a ‘‘sized’’ CNN. As training
• can introduce an unnecessary additional computational and
scenario, we will focus on ‘‘fine-tuning’’, a branch of transfer memorization burden (with a direct impact on the network
learning that has been demonstrating its effectiveness especially performance and required system characteristics);
in domains lacking effective expert-designed features or huge • can cause the network to focus more on the noise within
annotated datasets [7]. images rather than on other salient aspects;
The rest of the paper is organized as follows: Section 2 in- • may require more training samples to converge (due to the
troduces the concepts behind the proposed approach; Section 3 higher number of parameters to fit).
investigates the effectiveness of the proposed approach by eval-
uating its performance on three different datasets; Section 4 Thus, in the view of network size reduction, in this work
analyses the effects of network sizing on adversarial perturbation; we propose to further adapt deep CNNs by performing a sizing of
finally, Section 5 draws some conclusions. hidden layers (i.e. those between the input and the output lay-
ers) when using the fine-tuning strategy. With sizing, we refer to
2. Hidden layer sizing the use of a suitable strategy to reduce the number of neurons
used, without significantly affecting the network performance.
Convolutional Neural Networks (CNNs) are very similar to This can be achieved in two different ways: by reducing the
traditional (shallow) Artificial Neural Networks (ANNs): they are synapses (i.e. connections between neurons), or by removing
both made of neurons, usually organized in layers, connected to whole neurons. Here we will analyse the latter approach, focusing
form a network in which the output of a neuron is the input in particular only on neurons in the fully connected layers since,
of some others. This structure makes it possible to learn how to as previously discussed, they are the most numerous ones.
49
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
Fig. 2. A representation of transfer learning using AlexNet. From left to right: the net input (a 227 × 227 pixels - 3 channels RGB image), convolutional layers,
fully connected layers and adaptations needed to perform transfer learning. In particular, on the top right side there is an example of how to fine-tune AlexNet for
a binary classification task, while on the bottom right side there is an illustration showing the use of the output from the last hidden layer as feature set for an
external classifier (i.e. using AlexNet as feature extractor).
Since the optimizer used affects the learnt parameters, all the • The Dogs vs Cats dataset1 [32], consisting of 25000 equally
experiments were run two times: the first, by using Stochastic distributed images of cats and dogs;
Gradient Descent with Momentum (SGDM) as described in Bishop • The UIUC Sports Event dataset2 [33], containing images of 8
[9], the second by using ADAM [31]. A 5-fold cross-validation different sport activities, distributed from 137 to 250 images
was performed, with 3 folds used as the training set, 1 as the per category. All the images are also grouped into ‘‘easy’’ and
validation set and 1 as the test set. In all the considered con- ‘‘medium’’ according to the human subject judgement;
figurations, the maximum number of training epochs has been
set to a very high value (105 ). However, we adopted an early-
stopping strategy to stop the training if there is no improvement 1 https://www.microsoft.com/en-us/download/details.aspx?id=54765.
in the accuracy values (measured on the validation set at the 2 http://vision.stanford.edu/lijiali/event_dataset/.
51
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
Table 1 Table 5
AlexNet [8] 5-fold cross validation mean results for the Dogs vs Cats [32] dataset, Summary of the required AlexNet [8] memory for each sizing technique and
with respective 95% confidence values. considered dataset. The table also reports the numerical difference (∆) and the
Technique Optimizer Accuracy Iterations percentage saving (∆%) obtained by using the sized network w.r.t. the base
Mean Median Mean Median fine-tuning approach.
Base
SGDM 0.9744 ± 0.0021 0.9738 ± 0.0026 4120 ± 98 3800 ± 95 Memory (MB) ∆ (MB) ∆%
ADAM 0.9658 ± 0.0025 0.9656 ± 0.0031 7400 ± 112 5000 ± 101
SGDM 0.9729 ± 0.0024 0.9730 ± 0.0030 5360 ± 104 5000 ± 99
Base 211 – –
A-Rule Dogs Vs Cats
ADAM 0.9664 ± 0.0026 0.9660 ± 0.0032 11880 ± 207 12000 ± 234 A-Rule 177 34 16.11%
[32]
SGDM 0.9729 ± 0.0025 0.9722 ± 0.0031 11120 ± 221 13000 ± 256 Huang 124 87 41.23%
Huang
ADAM 0.9663 ± 0.0027 0.9660 ± 0.0033 18160 ± 301 17400 ± 279
Base 207 – –
UIUC sports event
A-Rule 176 31 14.98%
[33]
Huang 152 55 26.57%
Table 2
AlexNet [8] 5-fold cross validation mean results for the UIUC Sports Event [33] Base 209 – –
Caltech-101
dataset, with respective 95% confidence values. A-Rule 178 31 14.83%
[34]
Technique Optimizer Accuracy Iterations Huang 166 43 20.57%
Mean Median Mean Median
SGDM 0.9538 ± 0.0095 0.9587 ± 0.0119 1022 ± 61 792 ± 45
Base
ADAM 0.9601 ± 0.0111 0.9621 ± 0.0138 192 ± 51 156 ± 15 Table 6
A-Rule
SGDM 0.9531 ± 0.0132 0.9495 ± 0.0165 1190 ± 68 1248 ± 71 Vgg19 [30] 5-fold cross validation mean results for the Dogs vs Cats [32] dataset,
ADAM 0.9588 ± 0.0138 0.9495 ± 0.0172 293 ± 53 192 ± 47
with respective 95% confidence values.
SGDM 0.9493 ± 0.0113 0.9460 ± 0.0141 1534 ± 81 1368 ± 78 Technique Optimizer Accuracy Iterations
Huang
ADAM 0.9563 ± 0.0156 0.9524 ± 0.0195 190 ± 60 168 ± 43
Mean Median Mean Median
SGDM 0.9887 ± 0.0010 0.9884 ± 0.0013 9160 ± 178 9800 ± 185
Base
ADAM 0.9870 ± 0.0009 0.9870 ± 0.0011 6760 ± 109 2200 ± 79
Table 3
SGDM 0.9881 ± 0.0006 0.9882 ± 0.0007 10360 ± 204 10200 ± 203
AlexNet [8] 5-fold cross validation mean results for the Caltech-101 [34] dataset, A-Rule
ADAM 0.9864 ± 0.0015 0.9858 ± 0.0019 6200 ± 101 5200 ± 98
with respective 95% confidence values. SGDM 0.9884 ± 0.0009 0.9880 ± 0.0011 9200 ± 190 8400 ± 173
Huang
Technique Optimizer Accuracy Iterations ADAM 0.9873 ± 0.0013 0.9872 ± 0.0016 3280 ± 105 1400 ± 87
Mean Median Mean Median
SGDM 0.9159 ± 0.0100 0.9155 ± 0.0125 25258 ± 382 24455 ± 326
Base
ADAM 0.9334 ± 0.0067 0.9338 ± 0.0084 9475 ± 202 10512 ± 217 Table 7
A-Rule
SGDM 0.9229 ± 0.0073 0.9237 ± 0.0091 29667 ± 388 29638 ± 380 Vgg19 [30] 5-fold cross validation mean results for the UIUC Sports Event [33]
ADAM 0.9324 ± 0.0071 0.9335 ± 0.0089 7680 ± 112 7300 ± 109
dataset, with respective 95% confidence values.
SGDM 0.9212 ± 0.0071 0.9204 ± 0.0089 31332 ± 396 29273 ± 392 Technique Optimizer Accuracy Iterations
Huang
ADAM 0.9331 ± 0.0082 0.9329 ± 0.0103 7227 ± 102 6789 ± 100
Mean Median Mean Median
SGDM 0.9658 ± 0.0097 0.9621 ± 0.0121 377 ± 51 276 ± 48
Base
ADAM 0.9734 ± 0.0026 0.9716 ± 0.0032 84 ± 14 72 ± 12
Table 4
SGDM 0.9683 ± 0.0066 0.9685 ± 0.0082 386 ± 50 372 ± 51
Summary of the number of AlexNet [8] parameters for each sizing technique A-Rule
ADAM 0.9740 ± 0.0025 0.9746 ± 0.0031 245 ± 46 96 ± 19
and considered dataset. The table also reports the numerical difference (∆) and SGDM 0.9709 ± 0.0043 0.9714 ± 0.0054 1022 ± 61 1068 ± 58
the percentage saving (∆%) obtained by using the sized network w.r.t. the base Huang
ADAM 0.9740 ± 0.0044 0.9746 ± 0.0054 192 ± 48 156 ± 42
fine-tuning approach.
#Parameters ∆ ∆%
Table 8
Base 56,876,418 – –
Dogs Vs Cats Vgg19 [30] 5-fold cross validation mean results for the Caltech-101 [34] dataset,
A-Rule 48,485,765 8,390,653 14.75%
[32] with respective 95% confidence values.
Huang 41,136,258 15,740,160 27.67%
Technique Optimizer Accuracy Iterations
Base 56,901,000 – – Mean Median Mean Median
UIUC sports event
A-Rule 48,510,380 8,390,620 14.75% SGDM 0.9362 ± 0.0035 0.9377 ± 0.0043 11271 ± 198 11242 ± 191
[33] Base
Huang 41,745,340 15,155,660 26.64% ADAM 0.9378 ± 0.0063 0.9377 ± 0.0078 569 ± 99 511 ± 107
SGDM 0.9478 ± 0.0075 0.9531 ± 0.0094 11811 ± 221 11023 ± 201
Base 57,282,021 – – A-Rule
ADAM 0.9415 ± 0.0065 0.9431 ± 0.0082 657 ± 118 584 ± 101
Caltech-101
A-Rule 48,894,417 8,387,604 14.64%
[34] Huang
SGDM 0.9483 ± 0.0061 0.9475 ± 0.0076 9724 ± 201 8103 ± 189
Huang 41,136,258 15,740,160 27.67% ADAM 0.9415 ± 0.0065 0.9431 ± 0.0082 657 ± 109 584 ± 103
• The Caltech 101 dataset3 [34], collecting pictures of objects Similarly, Tables 6 to 8 report the classification accuracy and
belonging to 101 different categories, distributed from 40 to the number of iterations needed to converge for Vgg19 [30], vary-
800 images per category; ing the training optimizer (Section 3.1) and the sizing approach
used (Section 2.1), for each considered dataset respectively (Sec-
3.2. Experimental results tion 3.1). Tables 9 and 10 respectively report the number of
parameters and occupied memory (in MB) for the same set of
CNN, optimization approach, sizing strategy and dataset.
Measuring the resiliency of a deep CNN implies measuring the
Although this work must be considered just as a proof-of-
classification error rates of the approximated networks against
concept, results show that it is possible to reduce the number of
those obtained by using the basic fine-tuning procedure.
neurons in the hidden layers without statistically affect the net-
Tables 1 to 3 report the classification accuracy and the number
work performance, but with a significant reduction of the number
of iterations needed to converge for AlexNet, varying the training
of parameters (up to ∼27% for AlexNet and ∼11% for Vgg19)
optimizer (Section 3.1) and the sizing approach used (Section 2.1),
and required memory use (up to ∼41% for AlexNet and ∼10%
for each considered dataset respectively (Section 3.1). Tables 4
for Vgg19). This implies that, though fine-tuning can already be
and 5 respectively report the number of parameters and occupied
effectively used in many different contexts, other investigations
memory (in MB) for the same set of CNN, optimization approach,
are needed to develop improvements that allow unleashing its
sizing strategy and dataset.
full potential. Moreover, the possibility of reducing the mem-
ory use without a direct impact on the network classification
3 http://www.vision.caltech.edu/Image_Datasets/Caltech101/. ability open new scenarios toward the application of powerful
52
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
Table 9
Summary of the number of Vgg19 [30] parameters for each sizing technique
and considered dataset. The table also reports the numerical difference (∆) and
the percentage saving (∆%) obtained by using the sized network w.r.t. the base
fine-tuning approach.
#Parameters ∆ ∆%
Base 139,578,434 – –
Dogs Vs Cats
A-Rule 131,187,781 8,390,653 6.01%
[32]
Huang 123,838,274 15,740,160 11.28%
Base 139,603,016 – –
UIUC sports event
A-Rule 131,212,396 8,390,620 6.01%
[33]
Huang 124,447,356 15,155,660 10.86%
Base 139,984,037 – – Fig. 5. Perturbation attack on an image from the Kaggle Dogs vs Cats com-
Caltech-101
A-Rule 131,596,433 8,387,604 5.99% petition [32]: (a) clean image, (b) image with the adversarial perturbation
[34]
Huang 123,838,274 15,740,160 11.28% applied.
Table 10
Summary of the required Vgg19 [30] memory for each sizing technique and The term adversarial perturbation refers to the whole of tech-
considered dataset. The table also reports the numerical difference (∆) and the
niques that inject an image with a suitable, hardly perceptible,
percentage saving (∆%) obtained by using the sized network w.r.t. the base
fine-tuning approach. perturbation (noise) with the aim of misleading a CNN. Although
Memory (MB) ∆ (MB) ∆% the existence of blind spots in CNNs and a way to exploit them
dates back to 2013 [36], it is only with the Fast Gradient Sign
Base 508 – –
Dogs Vs Cats Method (FGSM) [37] and with its improved iterative version [38]
A-Rule 488 20 3.94%
[32]
Huang 461 47 9.25% that it started to effectively be possible to attack a CNN. Given
Base 507 – – a target CNN and a clean sample input, the FGSM multiplies
UIUC sports event
A-Rule 477 30 5.92% the sign of the prediction gradient (with respect to the class of
[33]
Huang 452 55 10.85% the input image) by a user-defined standard deviation ϵ (iter-
Caltech-101
Base 506 – – atively tuned by the improved variant) to generate an additive
[34]
A-Rule 482 24 4.74% perturbation. Since ϵ determines the magnitude of the perturba-
Huang 465 41 8.10%
tion, higher values increase the attack success rate at the cost
of a more visible alteration. In 2015, DeepFool [39] proposes
an efficient iterative approach that exploits the gradient of a
CNNs on embedded devices, as already done with Random Forest locally linearized version of the network loss to generate a series
classifiers [6]. of additive perturbations that move the clean sample on the
As described in Section 2, our approach focuses on fully con- edge of the classification separating hyperplane. The generated
nected neurons. Thus, for the sake of completeness, we analysed perturbation is then multiplied by η ≪ 1 in order to make
if and to what extent ‘‘compacting’’ also neurons in the convolu- the adversarial sample able to cross the hyperplane so as to
tional part impacts the effectiveness of the proposed approach. To be misclassified. From that moment on, several gradient-based
the aim, we focus on weights quantization as we want to avoid any approaches have been published, with the aim of producing a
sort of re-training. The idea is to reduce the memory needed to more efficient perturbation with a more automatic procedure.
store the weights by moving from a double-precision to a ‘‘lower’’ Common defence techniques usually try to make CNNs more
precision floating-point representation. It is worth noting that the robust by either working on the data to find out the adversarial
number of bits actually used to store the weights depends on the samples [40] or to destroy the injected artefacts [41], or on the
networks itself. This value is calculated as the smallest number way the model learns from it [42]. But is it possible that the
of bits needed to preserve the accuracy of the network on the shape of the network itself contributes to the effectiveness of
validation set. Under these settings, weights quantization further such attacks?
compact the networks for just a small margin (on average, ∼2% Although recently, some authors have proposed adversarial
for AlexNet and ∼3% for Vgg19). attacks able to work in several domains [43,44], in this section we
will focus only on CNNs for image processing and on adversarial
attacks applicable to them. This is because (i) CNNs represent
4. Sizing and adversarial perturbations
the most used deep neural networks and (ii) all the adversarial
attacks proposed so far in any domain are the same, or are
Although fine-tuning has been shown to be very effective
an adaptation, of those intended against images. Therefore, we
[35], the use of a pre-trained CNN does not bring only benefits,
analyse the impact that the sizing strategy has on the robustness
but also its weaknesses, which could negatively impact the use of
of CNNs against adversarial perturbation approaches. To this aim,
a pre-trained network. In particular, it has been shown [36] that
Tables 11 and 12 respectively report the robustness of AlexNet [8]
(i) it is possible to arbitrarily fool state-of-the-art CNNs with a
and Vgg19 [30] against two adversarial perturbation strategies on
small (and often imperceptible to human eyes) perturbation and the UIUC Sports Event Dataset [33], varying the sizing approach
that (ii) regularization techniques are useless in this case (since (Section 2.1). The value ρ , introduced in Moosavi-Dezfooli et al.
it is not the result of overfitting). More formally, given an image [39] which relates the magnitude of the adversarial noise needed
I ∈ R(w,h,c) of size w × h on c channels, and a classifier mapping to mislead the CNN, is defined as
function fC : I → {1..n} that classifies an image I into one of n
possible labels, an adversarial perturbation r is defined as: ∥Na ∥2
ρ= (2)
∥I ∥2
r ∈ R(w,h,c) : fC (I) ̸ = fC (I + r) (1)
where Na is the injected adversarial noise and I is the target
where r usually is the smallest perturbation (in terms of autoin- image. The column ‘‘Time’’ refers to the time (in seconds) needed
formation) able to fool the network (Fig. 5). to craft the adversarial samples. It is worth noting that, to provide
53
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
Table 11 that our approach does not need any additional training besides
AlexNet [8] robustness to FGSM [37] and to DeepFool [39] adversarial per- the one already needed to train the target model. Given the set
turbations attacks on the UIUC Sports Event Dataset [33], varying the sizing
approach.
of neurons we focus on, it is clear that fine-tuning is the most
Fool ρ Time (s) effective scenario for the proposed approach. It is worth noting
Mean Median Mean Median that although this might appear as a limitation, the number of
Base
FGSM 0.0183 ± 0.0004 0.0185 ± 0.0001 2690.45 ± 35.01 2811.46 ± 26.17 works leveraging fine-tuning is the bigger portion of the total
DeepFool 0.0407 ± 0.0010 0.0419 ± 0.0007 706.83 ± 20.71 703.33 ± 18.56
FGSM 0.0193 ± 0.0003 0.0194 ± 0.0001 3260.09 ± 123.43 3146.05 ± 122.27
number of papers using deep neural networks, mostly due to
A-Rule
DeepFool 0.0442 ± 0.0007 0.0452 ± 0.0004 722.80 ± 27.10 754.96 ± 22.38 the lack of large labelled datasets. All these aspects make our
Huang
FGSM
DeepFool
0.0199 ± 0.0002
0.0488 ± 0.0012
0.0202 ± 0.0002
0.0499 ± 0.0009
4745.09 ± 117.93
558.67 ± 12.17
4149.13 ± 109.87
554.58 ± 11.86
approach different from what have been so far proposed, with
the main difference laying in the fact that it has been inspired
by approximate computing with the aim of leveraging CNN’s
Table 12 resiliency to reduce the numbers of needed neurons to try to get
Vgg19 [30] robustness to FGSM [37] and to DeepFool [39] adversarial per-
turbations attacks on the UIUC Sports Event Dataset [33], varying the sizing
closer to the Pareto frontier
approach. Although preliminary, results not only show that it is possible
Fool ρ Time (s) to reduce the number of parameters and memory usage without
Mean Median Mean Median statistically affecting the performance, but also that the obtained
Base
FGSM 0.0177 ± 0.0007 0.0182 ± 0.0003 18386.91 ± 267.25 18634.11 ± 213.77
DeepFool 0.1188 ± 0.0023 0.0754 ± 0.0018 4305.76 ± 105.32 4383.62 ± 99.53
network is more robust against adversarial perturbations. Future
FGSM 0.0192 ± 0.0004 0.0190 ± 0.0002 27851.98 ± 196.11 28652.63 ± 179.32 work will further analyse these aspects, trying (i) to extend the
A-Rule
DeepFool 0.0713 ± 0.0010 0.0730 ± 0.0009 5151.23 ± 101.33 5243.81 ± 98.09 sizing procedure also to convolutional layers and (ii) provid-
FGSM 0.0200 ± 0.0003 0.0201 ± 0.0001 22059.96 ± 203.09 22084.38 ± 185.87
Huang
DeepFool 0.1000 ± 0.0011 0.0882 ± 0.0007 4132.82 ± 158.22 4130.29 ± 149.10 ing a systematic analysis of the effects of the sizing strategy
against state-of-the-art adversarial perturbation approaches. Fi-
nally, since (to some extents), this is similar to what has been
proposed for scaling neural networks in the context of increasing
fair results, the CNNs have been trained on the training folds, the size of layers to adapt them to different tasks [45], we will
while the adversarial samples have been crafted only for images also analyse whether both aims (network size reduction and task
in the test fold. Moreover, both ρ and ‘‘Time’’ values have been adaption) can be pursued at the same time.
measured only for successfully crafted adversarial samples.
Interestingly, results show that, apart for a single combination,
Declaration of competing interest
sized CNNs need a ‘‘stronger’’ adversarial noise to be misled. This
further motivates other investigations in the direction of CNN
The authors declare that they have no known competing finan-
layer sizing, since the reported analysis seems to suggest that
cial interests or personal relationships that could have appeared
the adversarial perturbation problem can be faced (or at least
to influence the work reported in this paper.
limited) by reducing the number of neurons/connections that
actively take part in the network decision problem. These results,
although preliminary, represent a novel contribution in the field Acknowledgements
of adversarial defence strategies since CNN sizing does not rely
on the analysis of the data but could make CNNs intrinsically The authors gratefully acknowledge the availability of the
more robust by changing the neurons connection pattern. This Calculation Centre SCoPE of the University of Naples Federico II
will help the user in ‘‘sizing’’ the network accordingly to the and thank the SCoPE academic staff for the given support.
desired levels of performance/robustness they need to obtain on
the basis of the risk associated with the task. Finally, it is also References
worth noting that the approach is totally topic-agnostic, meaning
that it is applicable in several contexts and for different tasks [1] F. Amato, M. Barbareschi, G. Cozzolino, A. Mazzeo, N. Mazzocca, A.
Tammaro, Outperforming image segmentation by exploiting approximate
(e.g. user recognition, object detection, etc.), helping researchers K-means algorithms, in: International Conference on Optimization and
in choosing the most suitable solution without changes in the Decision Science, Springer, 2017, pp. 31–38.
procedure. [2] W. Sung, S. Shin, K. Hwang, Resiliency of deep neural networks under
quantization, 2015, arXiv preprint arXiv:1511.06488.
5. Discussion and conclusions [3] S. Marrone, S. Olivieri, G. Piantadosi, C. Sansone, Reproducibility of deep
CNN for biomedical image processing across frameworks and architectures,
in: 2019 27th European Signal Processing Conference (EUSIPCO), IEEE,
In this work we focused on the concept of approximate com- 2019, pp. 1–5.
puting, a field involving the study of resilience (i.e. the ability of a [4] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional
system to provide correct results also in the presence of degraded neural networks for resource efficient inference, 2016, arXiv preprint
arXiv:1611.06440.
working conditions) to reduce the resources needed by a sys-
[5] S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural
tem. In particular, we introduced the concept of ‘‘sizing a CNN’’, networks with pruning, trained quantization and huffman coding, 2015,
namely a procedure intended in removing some of the fully- arXiv preprint arXiv:1510.00149.
connected neurons to reduce the number of trainable parameters. [6] M. Barbareschi, C. Papa, C. Sansone, Approximate decision tree-based
The proposed approach differs from classical pruning techniques multiple classifier systems, in: International Conference on Optimization
and Decision Science, Springer, 2017, pp. 39–47.
since, to the best of our knowledge, all the approaches proposed
[7] Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariant
so far rely on re-training stages to cope with the performance loss spatio-temporal features for action recognition with independent subspace
introduced by removing neurons in a trained network. analysis, in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Instead, we focus only on removing whole neurons from the Conference on, IEEE, 2011, pp. 3361–3368.
fully connected layers since, as showed in Section 2, for several [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with
deep convolutional neural networks, in: Advances in Neural Information
deep architectures it is the most massive part in terms of neurons
Processing Systems, 2012, pp. 1097–1105.
(weights). Moreover, we do not operate on an already trained [9] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
network but instead, focus on the possibility of ‘‘sizing’’ the shape [10] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley & Sons,
of the fully connected part before the training itself. The result is 2012.
54
S. Marrone, C. Papa and C. Sansone Future Generation Computer Systems 118 (2021) 48–55
[11] G. Panchal, A. Ganatra, Y. Kosta, D. Panchal, Behaviour analysis of multi- [35] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How transferable are features
layer perceptrons with multiple hidden neurons and hidden layers, Int. J. in deep neural networks? in: Advances in Neural Information Processing
Comput. Theory Eng. 3 (2) (2011) 332–337. Systems, 2014, pp. 3320–3328.
[12] D. Hunter, H. Yu, M.S. Pukish III, J. Kolbusz, B.M. Wilamowski, Selection [36] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow,
of proper neural network sizes and architectures — A comparative study,
R. Fergus, Intriguing properties of neural networks, 2013, arXiv preprint
IEEE Trans. Ind. Inf. 8 (2) (2012) 228–240.
arXiv:1312.6199.
[13] K.G. Sheela, S.N. Deepa, Review on methods to fix number of hidden
neurons in neural networks, Math. Probl. Eng. 2013 (2013). [37] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial
[14] F.M. Shah, M.K. Hasan, M.M. Hoque, S. Ahmmed, Architecture and weight examples, 2014, arXiv preprint arXiv:1412.6572.
optimization of ANN using sensitive analysis and adaptive particle swarm [38] A. Kurakin, I. Goodfellow, S. Bengio, Adversarial examples in the physical
optimization, Int. J. Comput. Sci. Netw. Secur. 10 (8) (2010) 103–111. world, 2016, arXiv preprint arXiv:1607.02533.
[15] N. Mohsenifar, A. Kargar, N. Mohsenifar, Adjusting MLP neural network [39] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: A simple and
architecture through PSO algorithm for ECG signal prediction, in: Inter- accurate method to fool deep neural networks, in: Proceedings of the
national Conference on Intelligent Computing Electronics Systems and IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp,
Information Technology, 2015, pp. 13–16. 2574–2582.
[16] K.Z. Mao, G.-B. Huang, Neuron selection for RBF neural network classifier
[40] W. Xu, D. Evans, Y. Qi, Feature squeezing: Detecting adversarial examples
based on data structure preserving criterion, IEEE Trans. Neural Netw. 16
in deep neural networks, 2017, arXiv preprint arXiv:1704.01155.
(6) (2005) 1531–1540.
[17] C.H. Aladag, E. Egrioglu, S. Gunay, M.A. Basaran, Improving weighted [41] G.K. Dziugaite, Z. Ghahramani, D.M. Roy, A study of the effect of jpg
information criterion by using optimization, J. Comput. Appl. Math. 233 compression on adversarial images, 2016, arXiv preprint arXiv:1608.00853.
(10) (2010) 2683–2687. [42] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, S. Ishii, Distributional
[18] C.H. Aladag, A new architecture selection method based on tabu search smoothing with virtual adversarial training, 2015, arXiv preprint arXiv:
for artificial neural networks, Expert Syst. Appl. 38 (4) (2011) 3287–3293. 1507.00677.
[19] H. Yuan, F. Xiong, X. Huai, A method for estimating the number of hidden [43] N. Carlini, D. Wagner, Audio adversarial examples: Targeted attacks on
neurons in feed-forward neural networks based on information entropy, speech-to-text, in: 2018 IEEE Security and Privacy Workshops (SPW), IEEE,
Comput. Electron. Agric. 40 (1–3) (2003) 57–64. 2018, pp. 1–7.
[20] S. Xu, L. Chen, A novel approach for determining the optimal number of [44] M. Sato, J. Suzuki, H. Shindo, Y. Matsumoto, Interpretable adversarial
hidden layer neurons for FNN’s and its application in data mining, 2008. perturbation in input embedding space for text, 2018, arXiv preprint
[21] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, in: M. Kaufmann (Ed.), Data Mining: arXiv:1805.02917.
Practical Machine Learning Tools and Techniques, fourth ed. [45] M. Tan, Q.V. Le, Efficientnet: Rethinking model scaling for convolutional
[22] G.-B. Huang, Learning capability and storage capacity of two-hidden-layer neural networks, 2019, arXiv preprint arXiv:1905.11946.
feedforward networks, IEEE Trans. Neural Netw. 14 (2) (2003) 274–281.
[23] Y.E. Wang, G.-Y. Wei, D. Brooks, Benchmarking TPU, GPU, and CPU
platforms for deep learning, 2019, arXiv preprint arXiv:1907.10701. Stefano Marrone is a post-doc at the University of
[24] Y. LeCun, J.S. Denker, S.A. Solla, Optimal brain damage, in: Advances in Naples Federico II. His research topics are within the
Neural Information Processing Systems, 1990, pp. 598–605. sphere of Pattern Recognition and Computer Vision,
[25] S. Han, J. Pool, J. Tran, W. Dally, Learning both weights and connections with applications ranging from biomedical image pro-
for efficient neural network, in: Advances in Neural Information Processing cessing to biometrics and image/video forensics. More
Systems, 2015, pp. 1135–1143. recently, he has also been working on ethics and
[26] S. Anwar, K. Hwang, W. Sung, Structured pruning of deep convolutional fairness in AI.
neural networks, ACM J. Emerg. Technol. Comput. Syst. (JETC) 13 (3) (2017)
1–18.
[27] S. Moon, Y. Byun, J. Park, S. Lee, Y. Lee, Memory-reduced network stacking
for edge-level CNN architecture with structured weight pruning, IEEE J.
Emerg. Sel. Top. Circuits Syst. 9 (4) (2019) 735–746. Cristina Papa received a master’s degree in Computer
Science Engendering with laude from the University of
[28] F. Nikolaos, I. Theodorakopoulos, V. Pothos, E. Vassalos, Dynamic pruning
Naples Federico II. Her main research interests include
of CNN networks, in: 2019 10th International Conference on Information,
embedded systems, model-based design, approximate
Intelligence, Systems and Applications (IISA), IEEE, 2019, pp. 1–5. computing, image processing and artificial intelligence.
[29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, L. Fei-Fei, ImageNet large
scale visual recognition challenge, Int. J. Comput. Vis. (IJCV) 115 (3) (2015)
211–252, http://dx.doi.org/10.1007/s11263-015-0816-y.
[30] K. Simonyan, A. Zisserman, Very deep convolutional networks for
large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[31] D. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv Carlo Sansone is Full Professor of Computer Engi-
preprint arXiv:1412.6980. neering at the Dipartimento di Ingegneria Elettrica
[32] J. Elson, J.J. Douceur, J. Howell, J. Saul, Asirra: a CAPTCHA that exploits e Tecnologie dell’Informazione of the University of
interest-aligned manual image categorization, 2007. Naples Federico II. His basic interests cover the areas
[33] L.-J. Li, L. Fei-Fei, What, where and who? Classifying events by scene of image analysis, pattern recognition and machine and
and object recognition, in: Computer Vision, 2007. ICCV 2007. IEEE 11th deep learning. From an applicative point of view, his
main contributions were in the fields of biomedical
International Conference on, IEEE, 2007, pp. 1–8.
image analysis, biometrics and image forensics. He
[34] L. Fei-Fei, R. Fergus, P. Perona, Learning generative visual models from
coordinated several projects, mainly in the areas of
few training examples: An incremental Bayesian approach tested on 101 biomedical images interpretation and network intru-
object categories, in: 2004 Conference on Computer Vision and Pattern sion detection. Prof. Sansone has authored more than
Recognition Workshop, IEEE, 2004, p. 178. 200 papers in international journals and conference proceedings. He was also
co-editor of three special issues and of three books.
55