Improving Network Training On Resource-Constrained
Improving Network Training On Resource-Constrained
Improving Network Training On Resource-Constrained
Article
Improving Network Training on Resource-Constrained Devices
via Habituation Normalization †
Huixia Lai 1 , Lulu Zhang 1 and Shi Zhang 1,2, *
1 The College of Computer and Cyber Security, Fujian Normal University, Fuzhou 350007, China
2 The Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University,
Fuzhou 350007, China
* Correspondence: [email protected]
† This paper is an extended version of our paper published in ICICN 2022.
Abstract: As a technique for accelerating and stabilizing training, the batch normalization (BN) is
widely used in deep learning. However, BN cannot effectively estimate the mean and the variance of
samples when training/fine-tuning with small batches of data on resource-constrained devices. It
will lead to a decrease in the accuracy of the deep learning model. In the fruit fly olfactory system,
the algorithm based on the “negative image” habituation model can filter redundant information and
improve numerical stability. Inspired by the circuit mechanism, we propose a novel normalization
method, the habituation normalization (HN). HN first eliminates the “negative image” obtained
by habituation and then calculates the statistics for normalizing. It solves the problem of accuracy
degradation of BN when the batch size is small. The experiment results show that HN can speed up
neural network training and improve the model accuracy on vanilla LeNet-5, VGG16, and ResNet-
50 in the Fashion MNIST and CIFAR10 datasets. Compared with four standard normalization
methods, HN keeps stable and high accuracy in different batch sizes, which shows that HN has
strong robustness. Finally, the applying HN to the deep learning-based EEG signal application system
indicates that HN is suitable for the network fine-tuning and neural network applications under
limited computing power and memory.
Citation: Lai, H.; Zhang, L.; Zhang,
S. Improving Network Training on
Keywords: normalization; neural network training; resource-constrained device; habituation; EEG
Resource-Constrained Devices via
signal application
Habituation Normalization. Sensors
2022, 22, 9940. https://doi.org/
10.3390/s22249940
foundation of many state-of-the-art computer vision algorithms and is applied to the latest
network architectures.
BN, with great success, is not without drawbacks. For example, on the ResNet-
50 model trained in CIFAR10, BN performs well with a sufficiently large batch size
(e.g., 32 images per worker). However, a small batch size leads to an inaccurate estima-
tion of the mean and variance within a batch, leading to reduced model accuracy (Figure 1).
In addition, BN cannot be effectively applied to recurrent neural networks (RNNs) [3]. In
response to this problem, some normalization methods have been proposed. For example,
Layer Normalization (LN) [3], Weight Normalization (WN) [4], Instance Normalization
(IN) [5], Group Normalization (GN) [6], Attentive Graph Normalization (AGN) [7], etc.
GN has higher stability among these normalization methods, but lower performance in
medium and large batches. As a particular case of BN and LN, IN only considers all
elements of a channel in one sample to calculate the statistics, which is more suitable for
fast stylization. LN is mainly applied to recurrent neural networks (RNNs) and is rarely
used in CNN. Therefore, it is necessary to explore a new normalization method with high
stability and suitability for different network types.
Figure 1. CIFAR10 classification test accuracy vs. batch sizes. This is a ResNet-50 model trained in
the CIFAR10, which shows that small batches may lead to lower accuracy dramatically.
In the remainder of this paper, we first introduce the works related to normalization
and habituation in Section 2. Then the formulation and implementation are discussed in
Section 3. In Section 4, the experimental analyses of HN are performed. Section 5 is a case
study. Section 6 concludes the paper.
2. Related Works
2.1. Normalization
Normalizing hidden features in neural networks can speed up the network train-
ing, so normalization methods are widely used. In recent years, normalization methods,
such as Batch Normalization (BN) [1], Layer Normalization (LN) [3], Weight Normaliza-
tion (WN) [4], Instance Normalization (IN) [5], Batch Renormalization (BRN) [13], Group
Normalization (GN) [6], and Attentive Graph Normalization (AGN) [7], are proposed
successively. We briefly introduce these normalization methods in this subsection.
Ioffe and Szegedy proposed the batch normalization (BN) [1] in 2015. First, BN
normalizes a feature map with the mean and variance calculated along with the batch,
height, and width of the feature map. Then, BN re-scales and re-shifts the normalized
feature map. It is widely used in CNN networks with significant results [14,15] but less
applicable to RNN and LSTM networks. In addition, BN leads to the deterioration of
network accuracy when the batch size is small.
In 2016, Ba, Kiros and Hinton proposed the layer normalization (LN) [3]. LN computes
the mean and variance along a feature map’s channel, height, and width dimensions and
then normalizes it. LN and BN are perpendicular to each other in terms of the dimensions
where they find the mean and the variance. LN requires the same operation in the training
and testing processes. It solves the problem that BN is unsuited for RNN, and at the same
time, achieves good results when setting a small batch size. However, LN is still less
accurate than BN in many large image recognition tasks.
Salimans and Kingma proposed the weight normalization (WN) [4] in 2016. WN decou-
ples the weight vector into a parameter vector v and a parameter scalar g to reparametrize
and optimize these parameters by stochastic gradient descent. Unlike BN and LN, WN
has a special idea of parameter normalization. WN also accelerates the convergence of
stochastic gradient descent optimization.
In 2016, Ulyanov, Vedaldi, and Lempitsky proposed the instance normalization (IN) [5].
IN takes all elements of a single sample, a single channel in a batch sample to calculate the
mean and variance, and then normalizes. IN is mainly applied in style transfer to accelerate
model convergence and maintain the independence between image instances.
In 2017, Ioffe proposed the batch renormalization (BRN) by adding two non-training
parameters, r and d, to BN [13]. BRN keeps the equivalence of the training phase and the
inference phase, and solves the problems of non-independent identical distribution and
small batch. Although BRN solves the problem of the BN’s accuracy reduction in minor
batch sizes, BRN is still batch dependent. Therefore, its accuracy is still affected by the
batch size.
Wu and He proposed the group normalization (GN) [6] in 2018. GN divides the chan-
nel data into groups and calculates the mean and the variance of the channel, height, and
width dimensions on each group. LN, IN, and GN all perform in dependent computations
along the batch axis. The two extreme cases of GN are equivalent to LN and IN. Although
GN is batch size independent, it needs to be divided into G groups. Therefore, its stability
is between IN and LN.
Chen et al. proposed the attentive graph normalization (AGN) [7] in 2022. AGN
learns a weighted combination of multiple graph-aware normalization methods, aiming
to automatically select the optimal combination of multiple normalization methods for a
specific task. However, it is limited to graph-based applications.
Sensors 2022, 22, 9940 4 of 16
Figure 2. When a dog sits in the garden and get used to the smell of flowers, the dog can perceive
any changes in the environment (for example, a coyote that appears in the distance) [9]. In this scene,
the dog’s smell of flowers gradually fades away, when the dog is habitual. Then the new coming
smells will be magnified and be detected easily.
The effect of habituation on background elimination has also attracted the attention of
computer scientists. Some computational methods that demonstrate the primary effects of
habituation (i.e., background subtraction) have been used in robotics applications [9,10]
and deep learning networks to enhance object recognition [11]. In 2018, Kim et al. applied
the background subtraction algorithm [11] to each video frame, finding the region of
interest (ROI). They then performed CNN classification and obtained the ROI as one of
the predefined categories. In 2020, Shen et al. implemented an unsupervised neural
algorithm for odor habituation in fruit fly olfactory circuits [9] and published the work
in PNAS. They used background elimination to distinguish between similar odors and
improved the prospect detection. The method improves the detection of novel components
in odor mixtures.
Studies in [8–11,22] revealed the benefits of using habituation in machine learning or
deep learning and suggested that models that incorporate additional features of habituation
yielded more robust algorithms.
In this paper, based on the features that habituation can filter redundant informa-
tion and make the values stable, we design a habituated normalized layer (HN) for neu-
ral networks. It can enhance the training efficiency of the network and improve the
model accuracy.
3. Method
In this section, we first review existing normalization methods and then propose the
HN method with stimulus memorability.
Sensors 2022, 22, 9940 5 of 16
where α and β are a scale and a shift parameter, respectively, e is a tiny constant.
Equation (1) summarizes the three normalizing calculation formulas of BN, LN, and
IN. The only difference is that the pixels used to estimate µ and σ are different. µ and σ can
be expressed using (2) and (3).
1
ui = ∑ xnchw
|Si | (n,c,h,w
(2)
)∈S i
1
σi2 = ∑ (xnchw − ui )
|Si | (n,c,h,w
(3)
)∈S
i
where i ∈ bn, ln, in is used to distinguish different methods, Si is a set of pixels, |Si | is the
size of Si .
BN counts all pixels on a single channel, which can be expressed as Sbn = {(n, h, w) |
n ∈ [1, N ], w ∈ [1, W ], h ∈ [1, H ]}. LN counts all pixels on a single sample, which can be
expressed as Sln = {(c, h, w) | c ∈ [1, C ], w ∈ [1, W ], h ∈ [1, H ]}. IN counts all pixels of a
single channel of a single sample, which can be expressed as Sin = {(h, w) | w ∈ [1, W ],
h ∈ [1, H ]}.
For GN, G is the number of groups, which is a predefined hyperparameter (G = 32
by default). In GN, the size of the tensor is [ N, G, C/G, H, W ]. Let x ncghw and x̂ncghw be
h j ki
C
pixel values before and after normalization, where n ∈ [1, N ], c ∈ [1, G ], g ∈ 1, G ,
h ∈ [1, H ], w ∈ [1, W ]. Then, Equations (1)–(3) are changed to (4)–(6).
xncghw − u gn
x̂ncghw = α q +β (4)
2 +e
σgn
1
u gn =
|Sgn | ∑ xncghw (5)
(n,c,g,h,w)∈Sgn
1
2
σgn = ∑
|Sgn | (n,c,g,h,w
( x ncghw − u gn ) (6)
)∈Sgn
Figure 3. Overview of the fruit fly olfaction circuit. For each input odor, ORNs each fire at some
rate. ORNs send feed-forward excitation to PNs of the same type. A single lateral inhibitory neuron
(LN1) also receives feed-forward inputs from ORNs and then sends inhibition to PNs. The locus of
habituation lies at the LN1 → PN synapse, and the weights of these synapses are shown via red line
thickness. PNs then project to KCs, and each KC samples sparsely from a few PNs [9].
The “negative image” in HN is a weight vector v, which is initially a zero vector and
has a shape of [1, C, 1, 1]. At iteration t, input xnchw is adjusted with (7).
where the weight vector v is updated with (8), and the shape of v will match the shape of
the input data.
t −1 t −1
vtnchw = v1c11 + γxnchw − ϕv1c11 (8)
Then, the mean of v is calculated on the channel dimension, and the shape of v is
adjusted to [1, C, 1, 1] to facilitate the following input (9).
1
t
v1c11 =
|S| ∑ vtnchw (9)
(n,c,h,w)∈S
Finally, the statistics and affine changes are calculated with (10)–(12).
1
u HN = ∑
|S HN | (n,c,h,w
xnchw (10)
)∈S HN
1
2
σHN = ∑
|S HN | (n,c,h,w
( x nchw − u HN ) (11)
)∈S HN
Sensors 2022, 22, 9940 7 of 16
xnchw − u HN
x̂nchw = α q +β (12)
2 +e
σHN
In Equation (8), the habituation rate γ ∈ (0, 1), and the weight recovery rate ϕ ∈ (0, 1).
In Equation (9), S = (n, h, w) | n ∈ [1, N ], w ∈ [1, W ], h ∈ [1, H ], |S| = N ∗ H ∗ W. In
Equations (10) and (11), S HN = {(n, c, h, w) | n ∈ [1, N ], c ∈ [1, C ], h ∈ [1, H ], , w ∈ [1, W ]}.
In Equation (12), α and β are initialized as a one vector and a zero vector.
In the habituation method, the “negative images” are saved by vector v. If every batch
of data is similar, we expect to form a “negative image” of the input with the time going
on. After subtracting the “negative image” from the following input, what remains is the
foreground components of the images. With HN, the construction of “negative image” is a
gradual process. Therefore, the construction process considers the present batch data and
the influence of the previous batch data simultaneously. Equation (9) removes the batch
size factor after constructing a “negative image” process via (8). Equation (9) ensures that
the HN is independent of the batch size. Therefore, it can be applied to different batch sizes.
3.3. Implementation
HN can be implemented by code in the popular neural network framework Pytorch.
Figure 4 shows the code based on Pytorch.
4. Experiment
In this section, we evaluate the effectiveness of HN on two benchmark datasets and
three different deep learning networks.
The experiments are performed on a machine with Intel(R) Xeon(R) Silver 4114 CPU
2.20 GHz 2.19 GHz (2 processors), 128 GB of RAM, NVIDIA Quadro P5000 graphics card,
using Windows server 2016.
Figure 6. The curves of training epochs vs. test accuracy on dataset FASHION-MNIST with (a) vanilla
LeNet-5, (b) BN, (c) HN, and (d) LN.
From the above analysis, we can find that BN still has the problem of accuracy degra-
dation when the batch size is small in the FASHION-MNIST dataset. Our normalization
method HN adapts to a wide range of batch sizes and dramatically improves the conver-
gence speed and accuracy of the vanilla network.
Then, we applied these methods to the color dataset CIFAR10. Compared to the gray
images, the color images have more data features. So, we additionally add GroupNorm
(GN) and BatchRenorm (BRN) for comparison. Due to the simple network and limited
pipeline size, we set G = 2 for GN. BRN keeps the original setting. When the training
epochs size is 60, the average accuracies of the last 5 epochs are shown in Table 1.
Table 1. Experimenting in CIFAR10. The average test accuracy of the last 5 epochs on vanilla LeNet-5,
BN, GN, LN, BRN, and HN.
Batch Size 2 4 8 16
Vanilla 0.5408 0.5496 0.5702 0.5796
BN 0.2986 0.673 0.6854 0.6564
GN 0.592 0.618 0.6056 0.6096
LN 0.5826 0.5894 0.6178 0.598
BRN 0.3046 0.6328 0.6476 0.6498
HN 0.601 0.6114 0.6232 0.6278
In Table 1, we find that BN and BRN do not work well when batch size = 2. The test
accuracies of HN is 6.02%, 30.24%, 0.9%, 1.84%, and 29.64% higher than that of vanilla
LeNet-5, BN, GN, LN, and BRN. When batch size = 4, 8, and 16, BN has the best accuracies.
HN can improve the test accuracy of vanilla LeNet-5 in all batch sizes.
The experiments on the FASHION-MNIST, and CIFAR10 datasets verify that HN can
accelerate convergence and improve the test accuracy of convolutional neural network.
Being independent with batch size, HN does not degenerate at smaller batch size, or
saturate at larger batch size.
4.2.2. VGG16
Due to the relatively simple structure of the LeNet-5, this section additionally adds
a popular deep convolutional neural network VGG16. We trained, respectively, VGG16
without normalization layer (Vanilla) and VGG16 with BN or HN on the FASHION-MNIST
Sensors 2022, 22, 9940 10 of 16
dataset. As before, we optimized using Adam for 30 epochs, setting the initial learning rate
to 0.001 and the batch sizes to 2, 4, 8, and 16. For each batch size, the curves of accuracy vs.
epoch are shown in Figure 7, and the average accuracies of the last 5 epochs are shown in
Table 2.
Figure 7. The curves of training epochs vs. test accuracy on dataset FASHION-MNIST with VGG16.
(a) batch size = 2. (b) batch size = 2. (c) batch size = 2. (d) batch size = 2.
Table 2. Experimenting in FASHION-MNIST, the average test accuracy of the last 5 epochs on vanilla
VGG16, BN, and HN.
Batch Size 2 4 8 16
Vanilla 0.100 0.100 0.100 0.100
BN 0.784 0.865 0.923 0.929
HN 0.884 0.924 0.924 0.923
VGG16 experiments on the FASHION-MNIST dataset. When the batch size 2, 4, 8, and
16, Vanilla (VGG16 without normalization layer) experimental results are 0.100, indicating
that it does not work properly in small size. If the size is larger, it can work properly, such
as size = 512, etc. VGG16 with BN or HN can be trained to converge. For batch size 8 and
16, HN and BN have approximately the same effect. For batch size 2 and 4, the unstable
nature of BN shows up, and HN has 10% and 5.9% better accuracy than BN, respectively.
It can be seen from the above analysis that when the size is small, VGG16 has a huge
difference in the effect of adding or not adding a normalization layer to the FASHION-
MNIST dataset. BN still has the problem of reduced accuracy when the batch size is small
in deep convolutional neural networks. However, the HN proposed in this paper is still
adaptable to all batch size cases and deep convolutional neural networks. Additionally,
HN is added to the original VGG16 network as BN method, which greatly improves the
convergence speed and accuracy.
decay. The original data are read-in for network training to ensure that the comparison of
different normalization methods is not affected by preprocessing.
The baseline model is ResNet-50, containing BN in its original design. The datasets
used in this subsection are FASHION-MNIST and CIFAR10. In the baseline model, nor-
malization is used after the convolution and before the ReLU. We swap it into the model
in place of BN to apply HN. Adam is used as a training optimizer with a learning rate of
0.001. Let the training Epoch be 30, mini-batch sizes be 2, 4, 8, 16, and 32. In addition, we
add GN, BRN for comparison. For GN, we use the recommended parameter settings in [6],
where G = 32. For BRN, we keep the default settings in the source code.
The experimental results of the tests on the FASHION-MNIST dataset are shown in
Table 3. To reducing the effect of random variation, the average test accuracies for the last
five epochs are listed. The experimental results show that BN does not work well when
batch size = 2. BRN converges, but has low test accuracy. HN and GN both have significant
results when batch size = 2. In other batch size settings, their test accuracies are very close.
Table 3. Experimenting in the FASHION-MNIST dataset with ResNet-50, The average test accuracy
of the last 5 epochs on ResNet-50 (BN), BRN GN@G = 32, and HN.
Batch Size 2 4 8 16 32
BN 0.109 0.7606 0.8618 0.9124 0.9176
BRN 0.479 0.904 0.9132 0.9108 0.9138
GN@G = 32 0.9062 0.9064 0.9038 0.9066 0.9046
HN 0.9024 0.906 0.9018 0.9056 0.9074
For the CIFAR10 dataset, we use the default ResNet-50 settings. Table 4 shows the
average test accuracies for the last 5 epochs. The results of GN@G = 32 are not good when
batch size = 2, which may be caused by the invalidity of statistics calculation. So GN@G = 2
is added for comparison too. When batch size = 2, HN achieves the highest accuracy of
72.26%, which is 0.208 better than GN(G = 2), 0.5174 higher than BRN. When the batch
size = 4, the highest accuracy of HN is 0.1596 higher than BN, 0.0234 higher than GN
(G = 32), and 0.0102 lower than BRN. In other batch size settings, their test accuracies are
very close and stable.
Table 4. Experimenting in the CIFAR-10 dataset with ResNet-50, The average test accuracy of the last
5 epochs on ResNet-50 (BN), BRN, GN@G = 32, GN@G = 2, and HN.
Batch Size 2 4 8 16 32
BN 0.0954 0.5658 0.7354 0.7344 0.7504
BRN 0.2052 0.7356 0.7370 0.7626 0.7552
GN@G = 32 0.1 0.702 0.7016 0.7138 0.714
GN@G = 2 0.7018 0.701 0.7186 0.721 0.7224
HN 0.7226 0.7254 0.7326 0.7218 0.7264
Figure 8. The curves of estimated total size (MB) vs. test accuracy in different normaliza-
tion layer corresponding to batch size on dataset FASHION-MNIST. (a) LeNet-5 @epoch=30.
(b) VGG16 @epoch = 30. (c) ResNet50 @epoch = 30.
As can be seen from Figure 8a, in LeNet-5, when the estimated total size of HN is 0.41
MB with accuracy = 0.898, the BN achieves the same accuracy when the consuming memory
reachs 0.65 MB. In Figure 8b, when the estimated total size of HN is 71.34 MB in VGG16,
the BN achieves the same accuracy when the consuming memory reaches 110.93 MB. In
ResNet50, HN and BN achieve accuracy 0.902 and 0.912, while the estimated total sizes of
HN and BN are 99.91 MB and 172.71 MB (Figure 8c). Among the three models, HN only
needs 60.1%, 64.3%, and 57.8% of the memory requirements of BN with the close accuracy.
From the above analysis, we find that models with small batch size consume less
memory and are more conducive to training and applying on resource-constrained devices.
Compared with BN, HN achieves higher accuracy in small batch size, so it is more suitable
for resource-constrained devices.
5. Case Study
Brain–computer interface (BCI) constructs a communication pathway between the
human brain and external devices directly without passing the muscular system. The BCI
technology is widely used in assisted rehabilitation and brain-sensing games. Due to low
cost and high-resolution characteristics, the electroencephalogram (EEG) signals is widely
used in BCI applications. The process of EEG-BCI includes EEG signals acquisition, signal
processing, and pattern recognition. Based on the advantages and features of end-to-end
neural networks in pattern recognition, EEG-BCI systems gradually leaves the laboratory
and is applied to portable device scenarios, such as embedded systems. As shown in
Figure 9, the application of the embedded-based EEG-BCI system includes the following
four steps:
1. Training: in the laboratory experimental situation, collect enough EEG trials to train a
deep neural network for patterns recognition.
Sensors 2022, 22, 9940 13 of 16
2. Deploying: deploy the pre-trained deep neural network model to the embedded device.
3. Fine-tuning: fine-tune the deep neural network model while acquiring EEG trials.
4. Applying: apply the fine-tuned and stabilized deep neural network model to the
control of the embedded.
Figure 9. EEG signal application system based on deep learning and its application process.
Wet electrode-based EEG-BCI system requires the subject to wear an EEG cap and
apply conductive paste to each electrode, keeping the resistance of each electrode below
10 kΩ. However, subjects cannot guarantee that the electrode caps will be worn in the same
position while migrating from the laboratory situation to the embedded device. Keeping the
same resistance of each electrode is even more impossible (only guaranteed to be <10 kΩ).
Based on the situation differences, the EEG trial set collected in the embedded device is not
consistent with the EEG trial set used in training deep neural network, which no longer
satisfy the assumption of independent identically distribution. Due to the characteristics of
non-linear and non-stationary of EEG signals, the pre-trained deep neural network model
needs to be fine-tuned to adapt to the embedded device situation. However, due to the
storage and computational performance bottlenecks of embedded devices, we must set a
limited batch size for the fine-tuning process.
In this case study, we use EEG based motor imagery BCI (MI-BCI) as an example to
verify the effectiveness of HN when fine-tuning the deep neural network model. ShallowF-
BCSPNet (https://github.com/TNTLFreiburg/braindecode/blob/master/braindecode/
models/shallow_fbcsp.py), proposed by Schirrmeister et al. in 2017, is a deep neural
network designed for decoding imagined or executed tasks from raw EEG signals, which
has a good performance in classifying EEG signals [31]. The BCI Competition IV dataset 2a
(BCICIV 2a) is a classical EEG-based MI-BCI dataset. We take this dataset as an example
to analyze and compare the performance of the HN and BN. Since BN is embedded in
ShallowFBCSPNet originally, we replace BN with HN, and set max_epoch to 1600 and
max_increase_epochs to 160. No extra preprocessing is performed on the EEG signals.
First, we conducted experiments in two cases, the original batch size (batch size = 60)
and a smaller batch size (batch size = 8), to examine the influence of batch size on the test
accuracy of HN. Table 5 shows the best prediction results in 10 epochs before the end of
training with batch size = 60, while Table 6 shows the best prediction results in 10 epochs
before the end of training with batch size = 8.
Sensors 2022, 22, 9940 14 of 16
Table 5. Test accuracies of EEG signal classification using ShallowFBCSPNet (batch size = 60) on the
BCIC IV 2a dataset. 4 is the improvement of test accuracy by HN.
Accuracy 1 2 3 4 5 6 7 8 9 Average
BN 0.840 0.483 0.882 0.740 0.288 0.535 0.924 0.778 0.764 0.693
HN 0.865 0.472 0.896 0.719 0.563 0.569 0.910 0.799 0.788 0.731
4 0.025 −0.011 0.014 −0.021 0.275 0.034 −0.014 0.021 0.024 0.038
After replacing BN with HN, the test accuracy was improved on 6 of the 9 subjects,
with a maximum improvement of 0.274 (Table 5). There was a slight decrease on the other
three subjects, with a maximum reduction of 0.021. Overall, the average accuracy was
improved by 0.038, which indicates that HN is more suitable for MI recognition of EEG
signals when batch size = 60.
When batch size = 8, Table 6 shows that the test accuracy was improved in 8 out of
9 subjects up to 0.198 after replacing BN with HN. Overall, their average accuracy was
enhanced by 0.05401. The experimental results indicate that ShallowFBCSPNet with HN is
better than that of BN for MI recognition of EEG signals when the batch size is small.
Table 6. Test accuracies of EEG signals classification using ShallowFBCSPNet (batch size = 8) on the
BCIC IV 2a dataset. 4 is the improvement of test accuracy by HN.
Accuracy 1 2 3 4 5 6 7 8 9 Average
BN 0.792 0.431 0.837 0.642 0.368 0.517 0.927 0.740 0.736 0.666
HN 0.840 0.465 0.865 0.708 0.566 0.535 0.906 0.785 0.806 0.720
4 0.048 0.034 0.028 0.066 0.198 0.018 −0.021 0.045 0.070 0.046
To demonstrate the real application scenario of the deep learning-based EEG signals
application system, the experiments were conducted with EEG signals from subject A as
training (batch size = 8), followed by fine-tuning (batch size = 2) and testing on subject B
as the user. We train the model with the EEG signals from subjects 2, 4, 6, 8, and 1, and
fine-tune and test the model on subjects 1, 3, 5, 7, and 9 as users. During fine-tuning, we
randomly selected 20% of the EEG signals from subject B. Finally, the best prediction result
of the last 10 epochs of the fine-tuned model is shown in Table 7.
Table 7. Accuracies obtained under different training and test users. Training model with EEG from
subject i (batch size = 8), fine-tuning with 20% EEG from subject j (batch size = 2), and then test
on subject j. i→j represents training on subject i, testing on subject j. 4 is the improvement of test
accuracy by HN.
As shown in Table 7, when the subject and the user are not the same subject, the
model fine-tunes with a smaller batch size. Although there is fewer fine-tuning data, the
accuracy of ShallowFBCSPNet decreases greatly, which indicates that the embedded device
applications of deep learning-based EEG signal recognition model still have a long way
to go. Comparing with HN and BN, HN demonstrates a better accuracy in five pairs of
experiments, while BN does not show enough advantages. Overall, the average accuracy
of the HN is 4.4% higher than that of BN, which indicates that HN is more suitable for deep
neural network recognition model on resource-constrained devices.
Sensors 2022, 22, 9940 15 of 16
6. Conclusions
Habituation is a simple memory that changes over time and inhibits neural responses
to repetitive, neutral stimuli. This adaptive mechanism allows organisms to focus their
attention on the most salient signals in the environment. The Drosophila olfactory system
is based on a “negative image” model of habituation that filters redundant information and
enhances olfactory stability. Inspired by the circular mechanism of the Drosophila olfactory
system, we propose a novel normalization method, habituation normalization (HN), with
three characteristics of habituation in biology: stimulus adaptation, stimulus specificity,
and reversibility. HN first eliminates the “negative image” obtained by habituation and
then calculates the overall statistics to achieve normalization.
We apply HN to LeNet-5, VGG16, and ResNet-50. Experiments on three benchmark
datasets show that HN can effectively accelerate network training and improve the test
accuracy. By comparing with other normalization methods (LN, BN, GN, and BRN), the ex-
perimental results verify that HN can be used in a wide range of batch sizes, and show good
robustness. Finally, we apply HN to a deep learning-based EEG signal application system.
Experiment results in two cases (train on A, test on A; train on A, trial on B) show that HN
are more suitable for deep learning network applications in resource-constrained devices.
As future work, we will extend HN to other types of deep learning networks, such as
recurrent neural network (RNN/LSTM) or Generative Adversarial Network (GAN).
Author Contributions: H.L. wrote part of the manuscript. L.Z. proposed the study, simulated it and
wrote part of the manuscript. S.Z. made writing suggestions, reviewed and analyzed the proposed
research. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Natural Science Foundation of Fujian Province grant
number 2020J01161, the Project of Fujian Province Science and Technology Plan grant number
2020H6011, the project of Fuzhou City Science and Technology Plan grant number 2021-S-259.
Data Availability Statement: Not applicable.
Acknowledgments: Tianjian Luo is an expert in the field of EEG research and helps solve some
problems in the case study.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. Int. Conf.
Mach. Learn. 2015, 37, 448–456.
2. Murad, N.; Pan, M.C.; Hsu, Y.F. Reconstruction and Localization of Tumors in Breast Optical Imaging via Convolution Neural
Network Based on Batch Normalization Layers. IEEE Access 2022, 10, 57850–57864. [CrossRef]
3. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450.
4. Salimans, T.; Kingma, D.P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
Adv. Neural Inf. Process. Syst. 2016, 29. [CrossRef]
5. Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. arXiv 2016,
arXiv:1607.08022.
6. Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Chapel Hill, UK,
23–28 August 2018; pp. 3–19.
7. Chen, Y.; Tang, X.; Qi, X.; Li, C.G.; Xiao, R. Learning graph normalization for graph neural networks. Neurocomputing 2022,
493, 613–625. [CrossRef]
8. Wilson, D.A.; Linster, C. Neurobiology of a simple memory. J. Neurophysiol. 2008, 100, 2–7. [CrossRef] [PubMed]
9. Shen, Y.; Dasgupta, S.; Navlakha, S. Habituation as a neural algorithm for online odor discrimination. Proc. Natl. Acad. Sci. USA
2020, 117, 12402–12410. [CrossRef]
10. Marsland, S.; Nehmzow, U.; Shapiro, J. Novelty detection on a mobile robot using habituation. arXiv 2000, arXiv:cs/0006007.
11. Kim, C.; Lee, J.; Han, T.; Kim, Y.M. A hybrid framework combining background subtraction and deep neural networks for rapid
person detection. J. Big Data 2018, 5, 1–24. [CrossRef]
12. Markou, M.; Singh, S. Novelty detection: A review—part 2: Neural network based approaches. Signal Process. 2003, 83, 2499–2521.
[CrossRef]
13. Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. Adv. Neural Inf. Process.
Syst. 2017, 30. [CrossRef]
Sensors 2022, 22, 9940 16 of 16
14. Daneshmand, H.; Joudaki, A.; Bach, F.R. Batch Normalization Orthogonalizes Representations in Deep Random Networks. In
Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing
Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; Ranzato, M.; Beygelzimer, A.; Dauphin, Y.N., Liang, P., Vaughan, J.W.,
Eds.; 2021; pp. 4896–4906. Available online: https://proceedings.neurips.cc/paper/2021 (accessed on 10 October 2022).
15. Lobacheva, E.; Kodryan, M.; Chirkova, N.; Malinin, A.; Vetrov, D.P. On the Periodic Behavior of Neural Network Training
with Batch Normalization and Weight Decay. In Proceedings of the Advances in Neural Information Processing Systems 34:
Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, Virtual, 6–14 December 2021; Ranzato, M.,
Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W., Eds.; 2021; pp. 21545–21556. Available online: https://proceedings.
neurips.cc/paper/2021 (accessed on 10 October 2022).
16. Bailey, C.H.; Chen, M. Morphological basis of long-term habituation and sensitization in Aplysia. Science 1983, 220, 91–93.
[CrossRef] [PubMed]
17. Greenberg, S.M.; Castellucci, V.F.; Bayley, H.; Schwartz, J.H. A molecular mechanism for long-term sensitization in Aplysia.
Nature 1987, 329, 62–65. [CrossRef] [PubMed]
18. O’keefe, J.; Nadel, L. The Hippocampus as a Cognitive Map; Oxford University Press: Oxford, UK, 1978.
19. Ewert, J.P.; Kehl, W. Configurational prey-selection by individual experience in the toadBufo bufo. J. Comp. Physiol. 1978,
126, 105–114. [CrossRef]
20. Wang, D.; Arbib, M.A. Modeling the dishabituation hierarchy: The role of the primordial hippocampus. Biol. Cybern. 1992,
67, 535–544. [CrossRef]
21. Thompson, R.F. The neurobiology of learning and memory. Science 1986, 233, 941–947. [CrossRef]
22. Dasgupta, S.; Stevens, C.F.; Navlakha, S. A neural algorithm for a fundamental computing problem. Science 2017, 358, 793–796.
[CrossRef]
23. Groves, P.M.; Thompson, R.F. Habituation: A dual-process theory. Psychol. Rev. 1970, 77, 419. [CrossRef]
24. Stanley, J.C. Computer simulation of a model of habituation. Nature 1976, 261, 146–148. [CrossRef]
25. Wang, D.; Hsu, C. SLONN: A simulation language for modeling of neural networks. Simulation 1990, 55, 69–83. [CrossRef]
26. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998,
86, 2278–2324. [CrossRef]
27. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
29. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017,
arXiv:1708.07747.
30. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. In Handbook of Systemic Autoimmune
Diseases; 2009; Volume 1. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 1
December 2022).
31. Schirrmeister, R.T.; Springenberg, J.T.; Fiederer, L.D.J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard,
W.; Ball, T. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 2017,
38, 5391–5420. [CrossRef] [PubMed]