Weight Dropout For Preventing Neural Networks From Overfitting
Weight Dropout For Preventing Neural Networks From Overfitting
Forward Propagation:
Extract features: output of ResNet-50 pre-trained model
Batch Normalization
Non-linearity function: 𝑎 = 𝑔(𝑥)
Random sample M mask tensor: 𝑀 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 ( 𝑝)
Convolutional layer: 𝑌 = ( 𝑀 ∗ 𝑊) 𝑥
Compute class predictions: 𝑌̂ = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 (𝑌, 𝑊𝑠 )
Backpropagation:
Compute the derivative of loss w.r.t parameters
Update softmax layer: 𝑊𝑠 = 𝑊𝑠 − 𝑎 𝐿𝑜𝑠𝑠′𝑊𝑠
(a) (b) Update Weight Dropout layer: 𝑊 = 𝑊 − 𝑎 𝐿𝑜𝑠𝑠′𝑊
Figure 6. Convolutional building blocks with weight dropout
operation IV. EXPERIMENTAL RESULTS
Previous dropout methods were initially resented
right after the convolutional layer and the BN layer, which We evaluate the proposed method accuracy for two types
directs to intense fluctuations of inputs obtained by BN of Computer Vision tasks:
layer. We investigate the collapse of the standard dropout
to the incorrect placement of the dropout procedures and ▪ Image Classification;
suggest universal convolutional blocks by placing dropout ▪ Skin Lesion Segmentation.
operations before each convolution layer (see Fig. 6b). We obtained successful generalization method on all
Incorporating drop operation before the convolution datasets.
operator precedes to lower gradient variance and earlier
convergence while training. A. MNIST
MNIST handwritten digit classification dataset contains
D. Optimization algorithm
28x28 grayscale images, every image represents one of the
Forward propagation of the model starts with choosing 10 classes, a digit in range zero and nine. Before feeding
a training example from training dataset. By running the this dataset into the neural network, we up-sample the size
training example through ResNet-50 model, we of images by maintaining aspect ratio and normalize each
extract features with frozen weight values. These features pixel value into [0; 1] range. Table 1 illustrates the
are the input to the Weight Dropout based conv layer. In performance of various models in the fully connected
BN and activation function layers, there are not learnable layers. We use an initial learning rate of 0.1 and this value
parameters. The amount produced from activation function will reduce during training with the aspect of callbacks.
is input to the Weight Dropout layer where a mask tensor is
TABLE 1. MNIST ACCURACY RATE FOR DIFFERENT DROPOUT MODELS
created from a distribution. Selecting different mask
tensor for each single training example is the main factor to Training Validation Time (per
Model
efficient training in Weight Dropout layer. After a mask accuracy accuracy (%) epoch)
tensor is chosen, this tensor is implemented to learnable No-dropout 96.02 85.26 6 ms
parameters in order to calculate the input to convolutional Standard
94.57 91.31 14.2 ms
layer. The resulting matrix of the convolutional layer, then, Dropout
merged into one column vector in Dense layer. The Max-pooling
93.44 90.54 319 ms
predicting softmax layer takes this column vector and dropout
generates class predictions before the Cross Entropy Loss Ising-dropout 92.85 86.21 9.7 ms
function computes the error between predicted and target Proposed
96.78 94.51 8.3 ms
values. The trainable parameters in the model, then are method
optimized with batch gradient descent (BGD) by passing If we increase the number of hidden layers No-dropout
backward derivatives of loss function with respect to the overfits while other models improve performance. Weight
parameters. As we mentioned earlier, the mask tensor in Dropout consistently gives higher validation accuracy
Weight Dropout should be different for each training step. comparing to other techniques.
While backpropagation stage, only the active elements of B. CIFAR-10
mask tensor in forward pass are updated. Overall algorithm
CIFAR-10 is a data set of ordinary tiny images
for optimization is given in Algorithm 1.
consisting of 50,000 training images and 10,000 for testing,
both subfolders split into 10 classes. Because CIFAR-10
Algorithm 1. BGD Training with Weight Dropout
consists of tiny images, the accuracy rate on this dataset is
not reasonably higher than the other up-to-date models. We
present the model performance on CIFAR-10 dataset on
Table 2. We used PReLU activation function as we works, we will continue improving dropout technique by
discussed in experimental settings part. choosing the dropped components not randomly but with
some information.
TABLE 2. CIFAR-10 ACCURACY RATE FOR DIFFERENT DROPOUT MODELS
ACKNOWLEDGMENT
Training Validation Time (per
Model
accuracy accuracy epoch) This study was supported by the BK21 Plus Project
No-dropout 90.42 81.62 19 ms (SW Human Resource Development Program for
Standard Supporting Smart Life) funded by the Ministry of
89.21 86.49 43 ms
Dropout Education, School of Computer Science and Engineering,
Max-pooling 91.71 87.98 708 ms Kyungpook National University, Korea
dropout (21A20131600005). This work is also supported
Ising-dropout 88.54 85.43 28 ms by National Research Foundation of Korea. Grant
Proposed 94.31 91.68 21 ms Number: 2020R1A2C1012196.
method
REFERENCE
C. Skin Lesion segmentation [1] N. Srivastava, G. Hinton, K. Alex, I. Sutskever,
By comparing the model results with no-dropout and and S. Ruslan, “Dropout: A Simple Way to
standard dropout methods on dermoscopic skin lesion Prevent Neural Networks from Overfitting,” J.
segmentation dataset, we achieved competitive accuracy Mach. Learn. Res., vol. 299, no. 3–4, pp. 1929–
performance on test data set. No-dropout method is 1958, 2014.
completely example of U-Net model, while Standard- [2] G. E. Hinton, N. Srivastava, A. Krizhevsky, I.
dropout is the proposed method in [11]. Figure 7 Sutskever, and R. R. Salakhutdinov, “Improving
demonstrates the overall accuracies on these 3 methods. neural networks by preventing co-adaptation of
feature detectors,” pp. 1–18, 2012.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton,
120
80
78.23
neural networks,” NIPS, pp. 1106–1114, 2012.
60 51.48 [4] M. D. Zeiler and R. Fergus, “Stochastic pooling
for regularization of deep convolutional neural
44.05
39.26
40
20
networks,” 1st Int. Conf. Learn. Represent. ICLR
2013 - Conf. Track Proc., pp. 1–9, 2013.
[5] I. A. Popova and N. G. Stepanova, “Estimation of
0
Pixel accuracy (%) Dice (%) IoU (%)