0% found this document useful (0 votes)
6 views4 pages

Weight Dropout For Preventing Neural Networks From Overfitting

The paper introduces a novel regularization technique called weight dropout to prevent overfitting in deep neural networks, particularly in convolutional neural networks (CNNs). It compares the performance of this method against traditional dropout techniques in tasks like image classification and segmentation, demonstrating improved accuracy on various datasets. The proposed method effectively enhances the stability and performance of neural networks during training by randomly dropping weights rather than nodes.

Uploaded by

Sanjar Karshiev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Weight Dropout For Preventing Neural Networks From Overfitting

The paper introduces a novel regularization technique called weight dropout to prevent overfitting in deep neural networks, particularly in convolutional neural networks (CNNs). It compares the performance of this method against traditional dropout techniques in tasks like image classification and segmentation, demonstrating improved accuracy on various datasets. The proposed method effectively enhances the stability and performance of neural networks during training by randomly dropping weights rather than nodes.

Uploaded by

Sanjar Karshiev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Weight Dropout for Preventing Neural Networks from Overfitting

Karshiev Sanjar 1 , Abdul Rehman1 , Anand Paul1, Kim JeongHong1 *


1
Department of Computer Science, Kyungpook National University, South Korea
email: [email protected], [email protected], [email protected], * [email protected]

Abstract—This paper briefly introduces an enhanced neural


network regularization method, so called weight dropout, in II. RELATED WORK
order to prevent deep neural networks from overfitting. In Deep neural networks include multiple nonlinear hidden
suggested method, the fully connected layer jointly used with layers, leading to expressive models learning very complex
weight dropout is a collection of layers in which the weights
between nodes are dropped randomly on the process of
relations between inputs and outputs. With limited training
training. To accomplish the desired regularization method, we data, many of these complex relationships refer to noise,
propose a building blocks with our weight dropout mask and which leads to overfitting. Several approaches have been
CNN. The performance of proposed method has been applied to avoid this. Strivastava et al. [1] proposed a
compared with other previous methods in the domain of dropout method to mitigate this problem. The technique
image classification and segmentation for the evaluation randomly drops units from the neural network during
purpose. The results show that the proposed method gives training. This prevents excessively perfect co-adaptations.
successful performance accuracies in several datasets. More technically, individual nodes are either dropped from
the net with probability of 1 or kept with probability of .
Keywords-component; overfitting; weight dropout;
regularization; image classification; semantic segmentation
The reduced network is then left for training while the
inputs and outputs for the dropped-out nodes are eliminated
as illustrated in Figure 1.
I. I NTRODUCTION
Deep Neural network models are suitable for the fields
where large number of labeled data is available. Owing to
their scope they can effortlessly be enlarged by inserting
more layers or more nodes in every layer. Conversely, deep
networks with many parameters can easily overfit despite
the size of dataset is large. Consistently, variety techniques
for regularizing Neural networks (NN) have been
introduced to deep learning community. Applying an
(a) (b)
L2 penalty to the model parameters is an easy yet efficient
method. Other techniques of regularization include Figure 1. Dropout neural network. (a) traditional network with 2 hidden
layers; (b) an example of a thinned network. Crossed nodes have been
simplifying network architecture, data augmentation or dropped
early stopping the training process. Applying these
regularization methods in training stage provides better This compensates for the larger size of the network now
performance on test data set. that no neurons are dropped and can be interpreted as
In most research works dropout technique has shown averaging over the possible networks during training. The
that neural network models can be improved by regularizing
probability can vary for each layer, with the original paper
them. Standard dropout [1] can generalize well to several
types of neural network models, however there is still room recommending p = 0.2 for the input layer and p = 0.5 for
to advance the models in terms of computational cost or hidden layers. Neurons in the output layer are not dropped.
performance. Specifically, dropout slows down the training Statistically, the formulation of dropout is given below:
because it chooses different neurons to drop in each training y = f(Wx)m, m  Bernoulli(1-p) (1)
step. Since new types of neural network architectures are
introduced to deep learning community, opportunities to where y is the result, f() is the non-linearity function, W
enhance these networks with dropout-based models will be is the learnable weight parameters, x is the input data,
in researchers’ observation. and m is dropout mask. The dropout mask gives the
Extensive academic research has been made to explain random zeros and ones with the probability of p.
the reason for why standard dropout performs well. A key Krizhevsky et al. [3] suggested complex CNN
instruction in this field have been inferring dropout as architecture to classify 1.2 M images in ImageNet.
directly generalizing over a group of neural networks [2]. Stochastic pooling [4] is a dropout based regularization
Study of dropout has also revealed experimental technique. Dropout has been widely applied to recurrent
conclusions that have confirmed being valuable in neural networks[5]–[8] meanwhile CNN has been great
recognizing dropout and in enhancing novel dropout success with dropout methods. As a replacement for
approaches. This has directed to research field in applying continuously taking the most difficult activity within every
dropout to sparsity and eventually simplify neural networks.
pooling section as max-pooling does, stochastic pooling
We propose another approach for dropout called Weight
Dropout that enhances the performance of convolutional randomly chooses the activations basedon a multinomial
neural networks. distribution. Other methods have been proposed for
developing dropout-based regularization in CNNs are to
operate dropping out at substitute viewpoints. Max-pooling
dropout [9] is an alternative method, where the elements of
pooling layer is dropped at max-pooling layer.
The max-pooling operator always pools the
largest value in a given pooling window, while the max-
pooling dropout method provides an opportunity for
smaller feature values to affect activations in later layers.
This technique can help the network to avoid overfitting as
saturated activation values have less contribution in the
network loss. At test time, the pooling operation becomes a
linear sum over activations, where each activation is
weighted by the probability that it would be selected as the Figure 4. Freezing the baseline ResNet-50 model
output during training according to this dropout method. B. Weight Dropout Mask
Another efficient regularization method called Cutout
is proposed in [10]. The authors showed that the ordinary We consider a fully connected layer of neural network
with input X = [x1, x2, … xn]T and weight parameters W.
regularization method by randomly cutting out square
The output of this fully connected layer,
regions in the input images. This method generally
improved the performance and robustness of a CNN model. Y = [y1, y2, … yn] is calculated by feeding dot products of
W weight parameters and X input into activation function :
Cutout is easy to implement and can be used jointly with
other techniques like data augmentation and dropout Y = a(WX) (7)
methods. One drawback of Cutout is that generating images
randomly from the given dataset is computationally In the proposed method, the fully connected layer with
expensive which slows down the training time and harmful Weight Dropout is a connected layer in which connections
to the performance. between nodes are selected randomly on the process of
training. For Weight Dropout layer, the output becomes as:
III. PROPOSED METHOD Y = a((M*W)X) (8)
We propose a new neural network model architecture
which consists of four main parts. Overall architectureof the Here, M is weight dropout mask and the result of
model is given in Figure 3. Bernoulli (p). Dropout mask (M) is chosen for each training
step independently, in other words, it gives different
connectivity for each training example. An example of fully
connected layers applied Weight Dropout mask with M = is
demonstrated in Figure 5.

Figure 3. Proposed Neural network architecture; Weight dropout


based convolutional layer composed of four main parts: batch
normalization (BN), activation function, weight dropout mask and CNN

A. Feature Extraction (ResNet-50)


(a) (b)
In case that we include more layers to deep neural Figure 5. Applying Weight Dropout into FCN; (a) FCN without
networks, the execution gets to be idle or begins to corrupt. Weight Dropout, (b) FCN after applying Weight Dropout
This scenario occurs owing to vanishing gradient issue.
When gradients are backpropagated through the deep neural The last part in Weight Dropout based conv layer is a
network and continuously multiplied, this makes weight fully connected convolution layer. After operating Weight
gradients cause vanishing gradient problem. ResNet Dropout, the FC layer output is calculated with Equation 1.
understands the vanishing gradient issue by utilizing
character alternate route or skip connections that skip one or C. Weight Dropout based CNN layer
more layers. We use pre-trained ResNet-50 neural network To reveal the reason behind in effectiveness of current
model for Transfer Learning feature extraction. To dropout methods for CNN layers, we thoroughly analyzed
implement transfer learning, we will eliminate the last the correlation among dropout and other widespread
output layers of the pre-trained ResNet-50 model and
methods related to CNN. We noticed that for the neuron and
connect it to our Weight Dropout based CNN layer. Weights
channel level dropout their ordinary usage is in conflict with
of ResNet-50 model are frozen and are not optimized during
model training step. Figure 4 shows how predicting layer of BN, which is embraced broadly in CNNs to become
the pre-trained model has been frozen. constant the beginning of two instants of its output
distribution, while the irregular passive activation of the
essential elements with drop-neuron and drop-channel
making disturbs such stability. In order not to lose this
stability, we propose a building blocks with our weight Input: Training example x, parameters Wl-1, Wsl-1, from
previous training step, learning rate
dropout mask and CNN, with flowing order (see Fig. 6)
Output: updated parameters W, Ws

Forward Propagation:
Extract features: output of ResNet-50 pre-trained model
Batch Normalization
Non-linearity function: 𝑎 = 𝑔(𝑥)
Random sample M mask tensor: 𝑀 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 ( 𝑝)
Convolutional layer: 𝑌 = ( 𝑀 ∗ 𝑊) 𝑥
Compute class predictions: 𝑌̂ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑌, 𝑊𝑠 )
Backpropagation:
Compute the derivative of loss w.r.t parameters
Update softmax layer: 𝑊𝑠 = 𝑊𝑠 − 𝑎 𝐿𝑜𝑠𝑠′𝑊𝑠
(a) (b) Update Weight Dropout layer: 𝑊 = 𝑊 − 𝑎 𝐿𝑜𝑠𝑠′𝑊
Figure 6. Convolutional building blocks with weight dropout
operation IV. EXPERIMENTAL RESULTS
Previous dropout methods were initially resented
right after the convolutional layer and the BN layer, which We evaluate the proposed method accuracy for two types
directs to intense fluctuations of inputs obtained by BN of Computer Vision tasks:
layer. We investigate the collapse of the standard dropout
to the incorrect placement of the dropout procedures and ▪ Image Classification;
suggest universal convolutional blocks by placing dropout ▪ Skin Lesion Segmentation.
operations before each convolution layer (see Fig. 6b). We obtained successful generalization method on all
Incorporating drop operation before the convolution datasets.
operator precedes to lower gradient variance and earlier
convergence while training. A. MNIST
MNIST handwritten digit classification dataset contains
D. Optimization algorithm
28x28 grayscale images, every image represents one of the
Forward propagation of the model starts with choosing 10 classes, a digit in range zero and nine. Before feeding
a training example from training dataset. By running the this dataset into the neural network, we up-sample the size
training example through ResNet-50 model, we of images by maintaining aspect ratio and normalize each
extract features with frozen weight values. These features pixel value into [0; 1] range. Table 1 illustrates the
are the input to the Weight Dropout based conv layer. In performance of various models in the fully connected
BN and activation function layers, there are not learnable layers. We use an initial learning rate of 0.1 and this value
parameters. The amount produced from activation function will reduce during training with the aspect of callbacks.
is input to the Weight Dropout layer where a mask tensor is
TABLE 1. MNIST ACCURACY RATE FOR DIFFERENT DROPOUT MODELS
created from a distribution. Selecting different mask
tensor for each single training example is the main factor to Training Validation Time (per
Model
efficient training in Weight Dropout layer. After a mask accuracy accuracy (%) epoch)
tensor is chosen, this tensor is implemented to learnable No-dropout 96.02 85.26 6 ms
parameters in order to calculate the input to convolutional Standard
94.57 91.31 14.2 ms
layer. The resulting matrix of the convolutional layer, then, Dropout
merged into one column vector in Dense layer. The Max-pooling
93.44 90.54 319 ms
predicting softmax layer takes this column vector and dropout
generates class predictions before the Cross Entropy Loss Ising-dropout 92.85 86.21 9.7 ms
function computes the error between predicted and target Proposed
96.78 94.51 8.3 ms
values. The trainable parameters in the model, then are method
optimized with batch gradient descent (BGD) by passing If we increase the number of hidden layers No-dropout
backward derivatives of loss function with respect to the overfits while other models improve performance. Weight
parameters. As we mentioned earlier, the mask tensor in Dropout consistently gives higher validation accuracy
Weight Dropout should be different for each training step. comparing to other techniques.
While backpropagation stage, only the active elements of B. CIFAR-10
mask tensor in forward pass are updated. Overall algorithm
CIFAR-10 is a data set of ordinary tiny images
for optimization is given in Algorithm 1.
consisting of 50,000 training images and 10,000 for testing,
both subfolders split into 10 classes. Because CIFAR-10
Algorithm 1. BGD Training with Weight Dropout
consists of tiny images, the accuracy rate on this dataset is
not reasonably higher than the other up-to-date models. We
present the model performance on CIFAR-10 dataset on
Table 2. We used PReLU activation function as we works, we will continue improving dropout technique by
discussed in experimental settings part. choosing the dropped components not randomly but with
some information.
TABLE 2. CIFAR-10 ACCURACY RATE FOR DIFFERENT DROPOUT MODELS
ACKNOWLEDGMENT
Training Validation Time (per
Model
accuracy accuracy epoch) This study was supported by the BK21 Plus Project
No-dropout 90.42 81.62 19 ms (SW Human Resource Development Program for
Standard Supporting Smart Life) funded by the Ministry of
89.21 86.49 43 ms
Dropout Education, School of Computer Science and Engineering,
Max-pooling 91.71 87.98 708 ms Kyungpook National University, Korea
dropout (21A20131600005). This work is also supported
Ising-dropout 88.54 85.43 28 ms by National Research Foundation of Korea. Grant
Proposed 94.31 91.68 21 ms Number: 2020R1A2C1012196.
method
REFERENCE
C. Skin Lesion segmentation [1] N. Srivastava, G. Hinton, K. Alex, I. Sutskever,
By comparing the model results with no-dropout and and S. Ruslan, “Dropout: A Simple Way to
standard dropout methods on dermoscopic skin lesion Prevent Neural Networks from Overfitting,” J.
segmentation dataset, we achieved competitive accuracy Mach. Learn. Res., vol. 299, no. 3–4, pp. 1929–
performance on test data set. No-dropout method is 1958, 2014.
completely example of U-Net model, while Standard- [2] G. E. Hinton, N. Srivastava, A. Krizhevsky, I.
dropout is the proposed method in [11]. Figure 7 Sutskever, and R. R. Salakhutdinov, “Improving
demonstrates the overall accuracies on these 3 methods. neural networks by preventing co-adaptation of
feature detectors,” pp. 1–18, 2012.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
120

“Imagenet classification with deep convolutional


94.34 96.27
100 91.12 88.33 88.59

80
78.23
neural networks,” NIPS, pp. 1106–1114, 2012.
60 51.48 [4] M. D. Zeiler and R. Fergus, “Stochastic pooling
for regularization of deep convolutional neural
44.05
39.26
40

20
networks,” 1st Int. Conf. Learn. Represent. ICLR
2013 - Conf. Track Proc., pp. 1–9, 2013.
[5] I. A. Popova and N. G. Stepanova, “Estimation of
0
Pixel accuracy (%) Dice (%) IoU (%)

No-dropout Standard Dropout Weight Dropout inorganic phosphate in presence of


phosphocarbohydrates (Russian),” Vopr.
Figure 7. Skin lesion segmentation accuracy results Meditsinskoj Khimii, vol. 23, no. 1, pp. 135–139,
It can be clearly seen from the figure that using bilinear 1977.
interpolation jointly with PReLU discussed in [12], and [6] S. Semeniuta, A. Severyn, and E. Barth,
weight dropout is an efficient way for skin lesion “Recurrent dropout without memory loss,”
segmentation. COLING 2016 - 26th Int. Conf. Comput. Linguist.
V. CONCLUSION AND FUTURE WORK Proc. COLING 2016 Tech. Pap., pp. 1757–1766,
In this paper, we proposed enhanced version of Dropout 2016.
technique to regularize deep neural networks. By applying [7] D. Krueger et al., “Zoneout: Regularizing rNNs
transfer learning, we extracted features from the images by randomly preserving hidden activations,” 5th
before feeding them into the model. Then, Weighted Int. Conf. Learn. Represent. ICLR 2017 - Conf.
Dropout-based convolutional neural network has been Track Proc., pp. 1–11, 2017.
applied by placing dropout operations before each [8] S. Merity, N. S. Keskar, and R. Socher,
convolution layer. The connections between nodes are “Regularizing and Optimizing LSTM Language
dropped randomly with Bernoulli probability in weight Models,” arXiv, 2017.
dropout method. In addition to image classification task, we [9] H. Wu and X. Gu, “Max-pooling dropout for
also tested the proposed method on the skin lesion semantic regularization of convolutional neural networks,”
segmentation dataset, which was not utilized in previous Lect. Notes Comput. Sci. (including Subser. Lect.
dropout methods before. The proposed method can Notes Artif. Intell. Lect. Notes Bioinformatics),
generalize deep neural networks with lower gradient vol. 9489, pp. 46–54, 2015.
descent and faster convergence while training and achieved [10] T. DeVries and G. W. Taylor, “Improved
promising results on both classification and segmentation Regularization of Convolutional Neural Networks
task. with Cutout,” arXiv Prepr., vol. 1708.14552,
Deactivating basic components of neural network 2017.
models which are discussed in this work is based on [11] K. Sanjar, O. Bekhzod, J. Kim, J. Kim, A. Paul,
randomly setting the particular component into zero. The and J. Kim, “Improved U-net: Fully convolutional
main disadvantage of these techniques is the risk of losing network model for skin-lesion segmentation,”
useful information of the image. In our future research Appl. Sci., vol. 10, no. 10, 2020.

You might also like