Optimization of Hyper-Parameter For CNN Model Using Genetic Algorithm
Optimization of Hyper-Parameter For CNN Model Using Genetic Algorithm
Abstract—Recently CNN is not only widely used in the field cannot be scientifically reproduced and requires intuitive and
of image recognition but also used in various fields such as many experiences. Recent research suggests that more
classifying vibration data. Therefore, increasing the complex and automated methods are needed to find more
performance of CNN models is becoming more important. One optimal parameters [7].
of the various methods to improve the performance of CNN
models is to optimize hyper-parameters. Two other simple methods are “Grid Search” and
“Random Search”, of which “Random Search” is theoretically
This paper presents a method for optimizing the hyper- and experimentally more efficient than “Grid Search” [6-7].
parameters of CNN models that classify MNIST data using
genetic algorithm. Population-based algorithms, different from “Bayesian Optimization” is also used to optimize hyper-
previous studies, can be used to optimize several parameters at parameter. However, when optimizing a large number of
once. In addition, different types and ranges of parameters from hyper-parameters at the same time, such as CNN, population-
the existing genetic algorithms are used. Using this method, the based algorithms are more appropriate [7].
hyper-parameter values that best classify MNIST have been Recently, reinforcement learning such as Q-Learning is
obtained and are presented.
also used for hyper-parameter optimization [7-9]. Most
Keywords—Genetic Algorithm, optimization, hyper-parameter
reinforcement learning, however, is advantageous when
optimizing structural parameter, but many other parameter,
I. INTRODUCTION such as learning rate and regularization, have the disadvantage
of being selected by the user [7].
Deep Learning began with the Perceptron concept in 1958
[1]. Perceptron can be applied to linearly separable issues, but In this paper, the population based Genetic Algorithm is
not to problems such as XORs that are not linearly separable used to optimize the hyper-parameters for the CNN network
[2]. However, this limit was resolved through the MLP (Multi- to classify MNIST data [14]. In addition, the accuracy of the
Layer Perceptron), which added ‘Hidden’ layer in 1986. With trained CNN network with optimized hyper-parameters was
the application of multiple layers, the number of required calculated to verify the performance of the optimized
parameters was increased, making it difficult to find the parameters using test data which was not used during training
optimal weight and bias value [3], but using the process.
Backpropagation algorithm, the optimal weight and bias can
be found. II. BACKGROUND
In 1989, LeCun, Yann, et. al. combined the Hyper-parameters are not parameters obtained through
backpropagation algorithm with the Convolution layer and training, but parameters that user must set oneself before
applied it to MNIST data [4] and published a structure called applying a deep learning model. Even if the deep learning
LeNet-5, which is the neural network underlying the model is created correctly, setting the wrong hyper-parameter
Convolution Neural Networks (CNN) [5]. Using CNN, the will prevent the training from working properly.
local invariant feature can be easily extracted and overcome For example, as shown in Table I, the two CNN models
the problems of existing neural networks, which have result in with different parameters, show quite different performance in
superior performance in the field of text and voice recognition. terms of accuracy. Even though only learning rate is different
The deeper the layer, the gradient vanishing occurs in the in Case 1 and Case 2, the test data accuracy is remained at
backpropagation process. However, deep learning is now around 0.1(10%) in Case 1, and remained at around over 90%
applicable to various fields by using the ReLU function as an in Case 2 (Fig. 1). This is the results of 20 repeated
activation function to solve the vanishing problem. experiments in each case.
However, there are still no established methods to This shows that it is difficult to determine the optimal
optimize hyper-parameter, and various studies are underway. hyper-parameter values systematically because there are
Traditional optimization methods include “Manual Search”. It various types and ranges of hyper-parameters.
can be determined by the experimenter’s experience, so it
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
TABLE I. HYPER-PARAMETER VALUE
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
extraction of small features. Therefore, it is important to E. Parent chromosome generation
determine the appropriate number of layers for the data. Fig. Based on calculated fitness value, the parent chromosomes
4 shows the convolution layer structure with four layers. [4]. are selected to create next generation. Use roulette wheel
selection method to select parent chromosomes. This is a way
that if the fitness is high, the probability of selection is also
high, and even if the fitness is low, there is a probability to be
selected.
F. Offspring chromosome generation
New offspring genes were generated by crossover two
Fig. 4. 4 Convolution layers structure parent chromosomes, which are selected by the roulette wheel
selection method.
C. Create initial population
One chromosome consisted of five genes. The five genes
are Learning rate, Dropout in convolution layer, Dropout in
Fully-connected layer, Batch size and number of Convolution
layers.
To perform GA, each gene was encoded into binary code
with certain number of bits. The learning rate was expressed
in 8 bits, ranging from 0.0001 to 0.1. This means that there are
28 (=256) values between 0.0001 and 0.1. dropout 1 is the
dropout ratio in the convolution layer, dropout 2 is the dropout
ratio in the fully-connected layer. These two values was
represented by four bits, 0 to 0.5, respectively. There are 24 (=
16) values between 0 and 0.5. For the batch size, 8 values {50,
100, 200, 250, 400, 500, 1000, 1250} are set and used 3 bits
to express this. Similarly, the number of convolution layers
was set to {1, 2, 3, 4} and expressed in 2 bits to indicate this. Fig. 6. Chromosome crossover
These five genes were combined to form a 21 bit chromosome.
The structure of the chromosome is shown in Fig. 5 and Table The centers of each gene are selected as crossover points
II. as shown in Fig. 6. The probability of crossover was set to 0.6.
The mutation process is applied to the newly generated
offsprings with a probability of 0.05. The mutation process
reverses the values of the randomly selected bit.
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
IV. RESULT
During 30 generation, the parameters with highest fitness
were obtained for each generation. In addition, a total of four
experiments were conducted to verify the performance of this
parameter optimization method.
Fitness
1 0.9946
2 0.9947
3 0.9953
4 0.9947
Fig. 11. Dropout 2 by generation
The values of each hyper-parameter used to obtain fitness Fig. 11 shows the dropout ratio applied to the fully-
in table IV are shown in Fig. 9-13. As the generation connected layer. The optimal dropout has a range difference
progresses, each hyper-parameter value changed and fitness for each trial.
increases.
As shown in Fig. 9, the range of learning rate was set Fig. 12 shows the change in the batch size. Optimal batch
between 0.0001 and 0.1, but the optimal learning rate remains sizes are different in each experiment. It has a relatively small
very low. When the number of generation is about 15, it can range of values. Training more data at one time did not
be seen that the learning rate does not change and maintains a increase the accuracy of the test data.
constant value.
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
Fig. 14. Case 1 drop out
Fig. 13. Layer by generation
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
The dropout value is not large because simple black and white [4] Y. LeCun, Be Boser and et al., “Handwritten digit recognition with a
data is used. In addition, training less than 250 data at a time back-propagation network”, Advances in neural information
processing systems., 1990
gives high accuracy. Training a lot of data at once did not yield
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
high accuracy. When the number of layers is one or two, the learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
structure is too simple to train well. pp. 2278–2323, 1998.
This study was conducted using Intel Core i5-8500 CPU [6] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
and NVIDIA GeFORCE GTX 1060 6GB GPU. It took about
[7] T. Hinz, N. Navarro-Guerrero, S. Magg, and S. Wermter, “Speeding up
4110 seconds to evaluate a generation's fitness and find the the Hyperparameter Optimization of Deep Convolutional Neural
best one. An approximation of the global optimal solution was Networks,” Int. J. Comput. Intell. Appl., vol. 17, no. 02, p. 1850008,
obtained, but it took a long time because there was a lot of data 2018.
used in the experiment. To reduce the time needed for [8] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
optimization and achieve better results, research is needed on learning.,” arXiv Prepr. arXiv1611.01578, 2016.
hybrid algorithms that use GA to find global optimal solutions [9] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural
and then local optimal search algorithm to find precise optimal network architectures using reinforcement learning.,” arXiv Prepr.
point. arXiv1611.02167, 2016.
[10] J. Gu et al., “Recent advances in convolutional neural networks,”
Pattern Recognit., vol. 77, pp. 354–377, 2018.
[11] H. Wang and Bhiksha Raj, “On the origin of deep learning,” arXiv
ACKNOWLEDGEMENT Prepr. arXiv1702.07800, 2017.
This research was supported by Basic Science Research [12] J. McCall, “Genetic algorithms for modelling and optimisation,” J.
Comput. Appl. Math., vol. 184, no. 1, pp. 205–222, 2005.
Program through the National Research Foundation of Korea
(NRF) funded by the Ministry of Education(No. [13] John H. Holland, “Genetic Algorithms,” Sci. Am., vol. 267, no. 1, pp.
66–73, 1992.
2017R1D1A1B03029991).
[14] E. Kussul and T. Baidyk, “Improved method of handwritten digit
recognition tested on MNIST database,” in Image and Vision
Computing, 2004, vol. 22, no. 12 SPEC. ISS., pp. 971–981.
REFERENCES [15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
Salakhutdinov, “Dropout: A simple way to prevent neural networks
from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[1] F. Rosenblatt, “The perceptron: A probabilistic model for information [16] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networks
storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. with Bernoulli approximate variational inference.,” arXiv Prepr.
386–408, 1958. arXiv1506.02158, 2015.
[2] M. L. Minsky and S. Papert, Perceptrons (An introduction to
computational geometry): Epilogue. 1988.
[3] J. L. McClelland, D. E. Rumelhart, and P. D. P. R. Group, “Parallel
distributed processing,” Explor. Microstruct. Cogn., pp.216-271,
1986.
Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.