0% found this document useful (0 votes)
79 views6 pages

Optimization of Hyper-Parameter For CNN Model Using Genetic Algorithm

Uploaded by

Hamzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views6 pages

Optimization of Hyper-Parameter For CNN Model Using Genetic Algorithm

Uploaded by

Hamzah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Optimization of hyper-parameter for CNN model

using genetic algorithm


Ji-Hyun Yoo Hyun-il Yoon Hyeong-Gyun Kim
dept. of Information and dept. of Information and dept. of Information and
Communication Engineering Communication Engineering Communication Engineering
Myongji University Myongji University Myongji University
Yongin, Republic of Korea Yongin, Republic of Korea Yongin, Republic of Korea
[email protected] [email protected] [email protected]

Hee-Seung Yoon Seung-Soo Han


dept. of Information and dept. of Information and
Communication Engineering Communication Engineering
Myongji University Myongji University
Yongin, Republic of Korea Yongin, Republic of Korea
[email protected] [email protected]

Abstract—Recently CNN is not only widely used in the field cannot be scientifically reproduced and requires intuitive and
of image recognition but also used in various fields such as many experiences. Recent research suggests that more
classifying vibration data. Therefore, increasing the complex and automated methods are needed to find more
performance of CNN models is becoming more important. One optimal parameters [7].
of the various methods to improve the performance of CNN
models is to optimize hyper-parameters. Two other simple methods are “Grid Search” and
“Random Search”, of which “Random Search” is theoretically
This paper presents a method for optimizing the hyper- and experimentally more efficient than “Grid Search” [6-7].
parameters of CNN models that classify MNIST data using
genetic algorithm. Population-based algorithms, different from “Bayesian Optimization” is also used to optimize hyper-
previous studies, can be used to optimize several parameters at parameter. However, when optimizing a large number of
once. In addition, different types and ranges of parameters from hyper-parameters at the same time, such as CNN, population-
the existing genetic algorithms are used. Using this method, the based algorithms are more appropriate [7].
hyper-parameter values that best classify MNIST have been Recently, reinforcement learning such as Q-Learning is
obtained and are presented.
also used for hyper-parameter optimization [7-9]. Most
Keywords—Genetic Algorithm, optimization, hyper-parameter
reinforcement learning, however, is advantageous when
optimizing structural parameter, but many other parameter,
I. INTRODUCTION such as learning rate and regularization, have the disadvantage
of being selected by the user [7].
Deep Learning began with the Perceptron concept in 1958
[1]. Perceptron can be applied to linearly separable issues, but In this paper, the population based Genetic Algorithm is
not to problems such as XORs that are not linearly separable used to optimize the hyper-parameters for the CNN network
[2]. However, this limit was resolved through the MLP (Multi- to classify MNIST data [14]. In addition, the accuracy of the
Layer Perceptron), which added ‘Hidden’ layer in 1986. With trained CNN network with optimized hyper-parameters was
the application of multiple layers, the number of required calculated to verify the performance of the optimized
parameters was increased, making it difficult to find the parameters using test data which was not used during training
optimal weight and bias value [3], but using the process.
Backpropagation algorithm, the optimal weight and bias can
be found. II. BACKGROUND
In 1989, LeCun, Yann, et. al. combined the Hyper-parameters are not parameters obtained through
backpropagation algorithm with the Convolution layer and training, but parameters that user must set oneself before
applied it to MNIST data [4] and published a structure called applying a deep learning model. Even if the deep learning
LeNet-5, which is the neural network underlying the model is created correctly, setting the wrong hyper-parameter
Convolution Neural Networks (CNN) [5]. Using CNN, the will prevent the training from working properly.
local invariant feature can be easily extracted and overcome For example, as shown in Table I, the two CNN models
the problems of existing neural networks, which have result in with different parameters, show quite different performance in
superior performance in the field of text and voice recognition. terms of accuracy. Even though only learning rate is different
The deeper the layer, the gradient vanishing occurs in the in Case 1 and Case 2, the test data accuracy is remained at
backpropagation process. However, deep learning is now around 0.1(10%) in Case 1, and remained at around over 90%
applicable to various fields by using the ReLU function as an in Case 2 (Fig. 1). This is the results of 20 repeated
activation function to solve the vanishing problem. experiments in each case.
However, there are still no established methods to This shows that it is difficult to determine the optimal
optimize hyper-parameter, and various studies are underway. hyper-parameter values systematically because there are
Traditional optimization methods include “Manual Search”. It various types and ranges of hyper-parameters.
can be determined by the experimenter’s experience, so it

978-1-7281-3939-5/19/$31.00 © 2019 IEEE

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
TABLE I. HYPER-PARAMETER VALUE

PARAMETER CASE 1 CASE 2


LEARNING RATE 0.1 0.0001
DROPOUT 1 0.5 0.5
DROPOUT 2 0.5 0.5
BATCH SIZE 250 250
LAYER 4 4

Fig. 2. Genetic Algorithm flow chart

III. OPTIMIZATION PROCESS


A. Data selection
The data applied to CNN modeling in this experiment is
MNIST data. This data is black and white handwritten
numeric data with 60,000 training data and 10,000 test data,
which is the most commonly used data set.
Fig. 1. Accuracy according to learning rate difference
B. Hyper-parameter selection
CNN is a neural network including a convolution layer. The hyper-parameters chosen in this experiment are
Basic components are Convolution layer, Pooling layer, and learning rate, dropout rate, batch size and number of layers.
Fully-connected layer [10]. Convolution layer extracts
features of input data [10]. In the Convolution layer, the input The learning rate determines how much to learn at once. If
data of the neural network keeps the same dimension and the the value is too small, learning will finish before training and
data type of the input/output data is maintained for each layer, vice versa, the values will diverge, preventing them from
which is effective for recognizing spatial features in image learning correctly.
training. Pooling Layer reduces the sensitivity of input/output
Dropout is the random deletion of neuronal connections as
data to minimize the effect of movement and distortion [5, 10].
shown in Fig. 3 [15]. If dropout is not applied, the network is
That is, even if the image is slightly distorted or changed in
overfitting with training data and is less accurate with test data.
position, the image can be recognized as the same image.
To prevent overfitting and to improve accuracy the dropout
After repeating the above two steps, the data is transformed in
process should be applied. However, in the case of the
one dimension through the Fully-Connected layer and this
convolution layer, the dropout effect is smaller than that of the
transformed one dimensional data is classified in the last layer
fully-connected layer because the number of parameters is
[10].
relatively small [15]. Therefore, the ratio of dropout applied to
Genetic Algorithm is inspired by the evolution of nature convolution layers and dropout applied to fully-connected
and is the most representative algorithm for finding the global layers was set differently. In this paper, the dropout rate of the
optimal solution [13]. GA has genes that represent genetic convolution layer is called Dropout 1, and the dropout rate of
information, and the genes are gathered to form chromosomes the fully-connected layer is called Dropout 2.
[12]. In other words, chromosomes are a collection of genetic
information, which is a solution to the specific problem that
GA is trying to solve. If a solution is obtained, the result of
this solution is obtained to determine how appropriate it is.
The result indicates how appropriate this solution, which is
called fitness. Fitness is used to determine whether to proceed
with GA or to select chromosomes to find the next generation
of the population [12, 13].
The fitness is used to select the parent chromosome and Fig. 3. (a) Standard Neural Net (b) After applying dropout [6]
transform it to produce an offspring chromosome. There is a
crossover process in which part of two parent chromosomes Batch Size means the number of data to be trained at once.
are exchanged based on the crossover probability. Mutation It is used to prevent computer overload and to speed up
prevents the solution from falling into local optimal solution. training when training a large number of data such as MNIST.
The process of calculating the fitness of the offspring In this experiment, it was chosen to ensure whether the batch
chromosomes is shown in Fig 2. The offspring chromosome size affected the training.
becomes the new parent chromosome and generates the next
generation of new offspring chromosomes. This is GA process In this paper, the number of layers to optimize is the layer
to find the optimal solution by repeating the above steps. that does the convolution. Convolution layers extract the
features of the image. The deeper the layer, the better the

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
extraction of small features. Therefore, it is important to E. Parent chromosome generation
determine the appropriate number of layers for the data. Fig. Based on calculated fitness value, the parent chromosomes
4 shows the convolution layer structure with four layers. [4]. are selected to create next generation. Use roulette wheel
selection method to select parent chromosomes. This is a way
that if the fitness is high, the probability of selection is also
high, and even if the fitness is low, there is a probability to be
selected.
F. Offspring chromosome generation
New offspring genes were generated by crossover two
Fig. 4. 4 Convolution layers structure parent chromosomes, which are selected by the roulette wheel
selection method.
C. Create initial population
One chromosome consisted of five genes. The five genes
are Learning rate, Dropout in convolution layer, Dropout in
Fully-connected layer, Batch size and number of Convolution
layers.
To perform GA, each gene was encoded into binary code
with certain number of bits. The learning rate was expressed
in 8 bits, ranging from 0.0001 to 0.1. This means that there are
28 (=256) values between 0.0001 and 0.1. dropout 1 is the
dropout ratio in the convolution layer, dropout 2 is the dropout
ratio in the fully-connected layer. These two values was
represented by four bits, 0 to 0.5, respectively. There are 24 (=
16) values between 0 and 0.5. For the batch size, 8 values {50,
100, 200, 250, 400, 500, 1000, 1250} are set and used 3 bits
to express this. Similarly, the number of convolution layers
was set to {1, 2, 3, 4} and expressed in 2 bits to indicate this. Fig. 6. Chromosome crossover
These five genes were combined to form a 21 bit chromosome.
The structure of the chromosome is shown in Fig. 5 and Table The centers of each gene are selected as crossover points
II. as shown in Fig. 6. The probability of crossover was set to 0.6.
The mutation process is applied to the newly generated
offsprings with a probability of 0.05. The mutation process
reverses the values of the randomly selected bit.

Fig. 5. Chromosome structure

50 chromosomes were generated by randomly selected


values , which is called the initial population.
Fig. 7. Chromosome mutation
TABLE II. CHROMOSOME STRUCTURE
Fig. 7 shows that mutations change the values of the fourth
PARAMETER RANGE NUMBER OF BITS element of the learning rate, the first element of dropout 1 and
LEARNING RATE 0.0001 ~ 0.1 8 dropout 2, the second element of batch size, and the last
DROPOUT 1 0 ~ 0.5 4 element of the layer.
DROPOUT 2 0 ~ 0.5 4
50, 100, 200, 250, TABLE III.
BATCH SIZE 3
400, 500, 1000, 1250
VALUE
LAYER 1, 2, 3, 4 2
POPULATION 50
D. Fitness evaluation CROSSOVER RATE 0.6
To evaluate the generated chromosomes, fitness should be MUTATION RATE 0.05
calculated. In this experiment, fitness is defined as the
accuracy of the test data. Accuracy is the ratio of the number
of properly classified data out of 10,000 test data. In this paper, the number of generation was set to 30.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑧𝑒𝑑 𝑑𝑎𝑡𝑎
Fitness = (1)
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑑𝑎𝑡𝑎(10,000)

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
IV. RESULT
During 30 generation, the parameters with highest fitness
were obtained for each generation. In addition, a total of four
experiments were conducted to verify the performance of this
parameter optimization method.

Fig. 10. Dropout 1 by generation

Fig. 10 shows the change of the dropout ratio applied to


the convolution layer. The optimal dropout is around 0.2.

Fig. 8. Fitness by generation (4 experiments)

Fig. 8 shows the fitness change of the test data during 30


generations of GA . As generation progresses, it can be seen
that the fitness increases. The accuracy of four experiments is
between 0.994 and 0.996.

TABLE IV. FINAL FITNESS VALUE

Fitness
1 0.9946
2 0.9947
3 0.9953
4 0.9947
Fig. 11. Dropout 2 by generation

The values of each hyper-parameter used to obtain fitness Fig. 11 shows the dropout ratio applied to the fully-
in table IV are shown in Fig. 9-13. As the generation connected layer. The optimal dropout has a range difference
progresses, each hyper-parameter value changed and fitness for each trial.
increases.

Fig. 9. Learning rate by generation Fig. 12. Batch size by generation

As shown in Fig. 9, the range of learning rate was set Fig. 12 shows the change in the batch size. Optimal batch
between 0.0001 and 0.1, but the optimal learning rate remains sizes are different in each experiment. It has a relatively small
very low. When the number of generation is about 15, it can range of values. Training more data at one time did not
be seen that the learning rate does not change and maintains a increase the accuracy of the test data.
constant value.

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
Fig. 14. Case 1 drop out
Fig. 13. Layer by generation

Fig. 13 shows the variation of the number of convolution


layers. When the convolution layer has 3 layers or 4 layers, it
is confirmed that the optimal fitness can be obtained.

TABLE V. OPTIMIZED HYPER-PARAMETER VALUE

Learning rate 0.0009


Dropout 1 0.2667
1 Dropout 2 0.2
Batch size 50
Layer 4
Learning rate 0.0013
Dropout 1 0.2
2 Dropout 2 0.43333 Fig. 15. Case 4 drop out
Batch size 250
As shown in Fig. 14 and Fig. 15, in the added experiment,
Layer 3 the value of dropout 1 did not steadily change, but the value
Learning rate 0.0005 of dropout 2 has change.
Dropout 1 0.2333 TABLE VI. OPTIMIZED DROPOUT RATE (50 GENERATRIONS)
3 Dropout 2 0.3333
Dropout 1 0.2667
Batch size 50 1
Dropout 2 0.46667
Layer 4
Dropout 1 0.2
Learning rate 0.0005 2
Dropout 2 0.43333
Dropout 1 0.2333 Dropout 1 0.2333
Dropout 2 0.1667 3
4 Dropout 2 0.3333
Batch size 100 Dropout 1 0.2333
4
Layer 4 Dropout 2 0.43333
As shown in Table VI, after 50 generations, a similar range
Table V shows the optimized hyper-parameters found of dropout 2 values were obtained. At this time, the accuracy
through GA. The values in the table are rounded to four increased to less than 0.01%. However, the accuracy is hardly
decimal places. Modeling CNNs using the values given in the changed compared to the time spent. Thus, in this experiment,
table above yields over 99.4% accuracy. 30 generations were adequate to achieve high accuracy using
MNIST data.
Since the value of dropout 2 is not stable, a total of 50
generations of experiments were performed by adding 20 V. CONCLUSION
generations to the existing experiments. As a result, the In this paper, optimal hyper-parameter values of the CNN
dropouts of Case 2 and Case 3 did not change, but the values model for classifying MNIST data were obtained using
of dropout 2 of Case 1 and Case 4 changed. genetic algorithm. The learning rate is approximately 0.0004
to 0.0012. The learning rate was found to be less than that of
Backpropagation neural network. Dropout 1 obtained similar
values near 0.2, but dropout 2 was in the range of 0.16 to 0.43.

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.
The dropout value is not large because simple black and white [4] Y. LeCun, Be Boser and et al., “Handwritten digit recognition with a
data is used. In addition, training less than 250 data at a time back-propagation network”, Advances in neural information
processing systems., 1990
gives high accuracy. Training a lot of data at once did not yield
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
high accuracy. When the number of layers is one or two, the learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
structure is too simple to train well. pp. 2278–2323, 1998.
This study was conducted using Intel Core i5-8500 CPU [6] J. Bergstra and Y. Bengio, “Random search for hyper-parameter
optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, 2012.
and NVIDIA GeFORCE GTX 1060 6GB GPU. It took about
[7] T. Hinz, N. Navarro-Guerrero, S. Magg, and S. Wermter, “Speeding up
4110 seconds to evaluate a generation's fitness and find the the Hyperparameter Optimization of Deep Convolutional Neural
best one. An approximation of the global optimal solution was Networks,” Int. J. Comput. Intell. Appl., vol. 17, no. 02, p. 1850008,
obtained, but it took a long time because there was a lot of data 2018.
used in the experiment. To reduce the time needed for [8] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
optimization and achieve better results, research is needed on learning.,” arXiv Prepr. arXiv1611.01578, 2016.
hybrid algorithms that use GA to find global optimal solutions [9] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing neural
and then local optimal search algorithm to find precise optimal network architectures using reinforcement learning.,” arXiv Prepr.
point. arXiv1611.02167, 2016.
[10] J. Gu et al., “Recent advances in convolutional neural networks,”
Pattern Recognit., vol. 77, pp. 354–377, 2018.
[11] H. Wang and Bhiksha Raj, “On the origin of deep learning,” arXiv
ACKNOWLEDGEMENT Prepr. arXiv1702.07800, 2017.
This research was supported by Basic Science Research [12] J. McCall, “Genetic algorithms for modelling and optimisation,” J.
Comput. Appl. Math., vol. 184, no. 1, pp. 205–222, 2005.
Program through the National Research Foundation of Korea
(NRF) funded by the Ministry of Education(No. [13] John H. Holland, “Genetic Algorithms,” Sci. Am., vol. 267, no. 1, pp.
66–73, 1992.
2017R1D1A1B03029991).
[14] E. Kussul and T. Baidyk, “Improved method of handwritten digit
recognition tested on MNIST database,” in Image and Vision
Computing, 2004, vol. 22, no. 12 SPEC. ISS., pp. 971–981.
REFERENCES [15] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.
Salakhutdinov, “Dropout: A simple way to prevent neural networks
from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
[1] F. Rosenblatt, “The perceptron: A probabilistic model for information [16] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networks
storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. with Bernoulli approximate variational inference.,” arXiv Prepr.
386–408, 1958. arXiv1506.02158, 2015.
[2] M. L. Minsky and S. Papert, Perceptrons (An introduction to
computational geometry): Epilogue. 1988.
[3] J. L. McClelland, D. E. Rumelhart, and P. D. P. R. Group, “Parallel
distributed processing,” Explor. Microstruct. Cogn., pp.216-271,
1986.

Authorized licensed use limited to: De Montfort University. Downloaded on May 18,2020 at 07:35:53 UTC from IEEE Xplore. Restrictions apply.

You might also like