Parameters_optimization_of_deep_learning_models_using_Particle_swarm_optimization
Parameters_optimization_of_deep_learning_models_using_Particle_swarm_optimization
Abstract— Deep learning has been successfully applied in made available by open source projects as well, currently
several fields such as machine translation, manufacturing, and various commonly used deep learning platforms include: H2O
pattern recognition. However, successful application of deep platform, Deeplearning4j (DL4j), Theano, Torch, TensorFlow,
learning depends upon appropriately setting its parameters to and Caffe.
achieve high-quality results. The number of hidden layers and
the number of neurons in each layer of a deep machine learning One of the challenges in a successful implementation of
network are two key parameters, which have main influence on deep machine learning is setting the values for its many
the performance of the algorithm. Manual parameter setting and parameters, particularly the topology of its network. Let L be
grid search approaches somewhat ease the users’ tasks in setting the number of hidden layers, Ni be the number of neurons in
these important parameters. Nonetheless, these two techniques
can be very time-consuming. In this paper, we show that the
layer i and N={N1, N2, …, NL}. Parameters L and N are very
Particle swarm optimization (PSO) technique holds great important and have a major influence on the performance of
potential to optimize parameter settings and thus saves valuable deep machine learning. Manually tuning these parameters
computational resources during the tuning process of deep (essentially through trial and error method) and finding high-
learning models. quality settings is a time-consuming process [3]. Besides, the
solutions obtained by the manual process are usually not
Specifically, we use a dataset collected from a Wi-Fi campus
network to train deep learning models to predict the number of equally distributed in the objective space.
occupants and their locations. Our preliminary experiments To address this challenge, grid search is a common
indicate that PSO provides an efficient approach for tuning the approach for setting parameter values of the deep learning
optimal number of hidden layers and the number of neurons in
each layer of the deep learning algorithm when compared to the
models. Grid search is more efficient and saves time in setting
grid search method. Our experiments illustrate that the L and N; with this approach, a list of discrete values of L and
exploration process of the landscape of configurations to find the N are prepared in advance, where each entry shows the
optimal parameters is decreased by 77 % - 85%. In fact, the PSO number of hidden layers and its corresponding number of
yields even better accuracy results. neurons. The deep learning algorithm trains multiple different
Keywords - smart building services, deep machine learning, parameter models using all the list’s entries. Finally, the selection of the
optimization, particle swarm optimization. parameters is measured using the models’ accuracy. However,
grid search is still a computationally demanding process as the
I. INTRODUCTION number of possible combinations is exponential, especially
Deep learning is an aspect of artificial neural networks when the number of parameters increases and the interval
that aims to imitate complex learning methods that human between discrete values is reduced. In addition if the list of
beings use to gain certain types of knowledge. We can think of parameters are poorly chosen, the network may learn slowly,
deep learning as a technique that employs neural networks that or perhaps not at all [4].
utilize multiple hidden layers of abstraction. This is in contrast This paper proposes another parameter selection method
to traditional shallow neural networks that employ one hidden for deep learning models using PSO. PSO is a popular
layer [1]. population-based heuristic algorithm that simulates the social
Deep learning models are utilized in a wide variety of behavior of individuals such as birds flocking, a school of fish
applications including the popular iOS Siri and Google voice swimming or a colony of ants moving to a potential position
systems. Recently, deep neural networks have been utilized to to achieve particular objectives in a multidimensional space
win numerous contests in pattern recognition and machine [5]. PSO is found to have the extensive capability of global
learning. Some leading examples include Microsoft research optimization for its simple concept, easy implementation,
on a deep learning system that demonstrated the ability to scalability, robustness, and fast convergence. It employs only
classify 22,000 categories of pictures at 29.8 percent of simple mathematical operators and is computationally
accuracy. They also demonstrated real-time speech translation inexpensive in terms of both memory requirements and speed
between Mandarin Chinese and English. [2]. Deep learning is [6].
1286
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 18:43:18 UTC from IEEE Xplore. Restrictions apply.
III. PSO-BASED PARAMETER OPTIMIZATION MODEL
The PSO algorithm is an iterative optimization method Algorithm 1: PSO for Parameter Optimization of Deep
that was originally proposed in 1995 by Kennedy and Eberhart Learning Models.
[5]. PSO was developed to mimic bird and fish swarms.
Animals who move as a swarm can reach their aims easily. Input: Wi-Fi dataset, location, time and MAC addresses
The basic form of the PSO algorithm is composed of a group Output: Optimal configuration in terms of the number of
of particles which repeatedly communicate with each other; hidden layers and number of neurons in each layer for the
the population is called a swarm. Each particle represents a deep learning model.
possible solution to the problem (i.e., the position of one Begin:
particle represents the values of the attributes of a solution) 1) Initialization
[20]. Each particle has its position, velocity and a fitness value a. Set the values of acceleration constants (c1 and c2), W,
that is determined by an optimization function. The velocity PopSize, MaxIt, and specify the range bounds:
determines the next direction and distance to move. The MinLayer, MaxLayer, MinNeurons, MaxNeurons,
fitness value is an assessment of the quality of the particle. MaxLayerVelocity and MaxNeuronVelocity.
The position of each particle in the swarm is tweaked to move b. Define the fitness function (i.e., deep learning model
closer to the particle which has the best position. Each particle accuracy).
updates its velocity and position by tracking two extremes in c. Establish initial random population for the number of
each iteration. One is called the personal best, which is the hidden layers and number of neurons in each layer.
best solution that the particle was able to obtain individually d. Calculate the fitness value for each particle and set the
so far. The other is called the global best which is the best personal best () for each particle and the global
solution that all particles were able to find collectively so far. best () for the population.
2) Repeat the following steps until the gbest solution does not
PSO is mathematically modeled as follows [5]: change anymore or the maximum number of iterations is
ݒ௧ାଵ = w.ݒ௧ + c1 Ǥ ݀݊ܽݎǤ( - ௧ ) + c2 Ǥ ݀݊ܽݎǤ ( - reached.
௧ ) (2) a. Update the number of hidden layers, the number of
Each step t, the position of particle i, ݔ௧ is updated based on neurons in each layer, the velocity of the number of
the particle’s velocityݒ௧ : hidden layers and the number of neurons in each particle
ݔ௧ାଵ = ݔ௧ + ݒ௧ାଵ (3) according to the Equations (4) through (7).
b. Calculate the fitness value for each particle. If the
In Equations (2) and (3) above, ݒ௧ and ݔ௧ are the tth speed fitness value of the new location is better than the
and position components of the ith particle. c1 and c2 are the fitness value of personal best, the new location is
acceleration coefficients and represent the weights of updated to be the personal best location.
approaching the and of a particle. w is the c. If the currently best particle in the population is better
inertia coefficient as it helps the particles to move by interia than the global best, the best particle replaces the
towards better positions. rand is a uniform random value recorded global best.
between 0 and 1. The parameters utilized in our experiments 3) Return the optimal number of hidden layers, the number of
are listed in Table I. neurons in each layer for the deep learning model.
End
TABLE I: THE PARAMETERS UTILIZED IN OUR EXPERIMENTS
Algorithm 1 above provides the details of our proposed
Parameter Value PSO based parameter selection techniques for deep learning
Population size 10, 25, or 50 models. The algorithm is presented for campus occupant
Learning coefficients: c1, c2 uniformly distributed between [0, 4] prediction scenario using Wi-Fi collected data. This scenario
will be fully explored in the next section (i.e., Section IV).
Maximum number of iterations 10
Number of hidden layers within the range [1, 200]
In our implementation of PSO, the ith particle’s velocity is
calculated according to the following:
Number of neurons in each layer within the range [1, 10]
x Velocity of number of layers
represents the number of hidden layers ௧ାଵ ௧
ǡ =w.ǡ + c1 Ǥ ݀݊ܽݎǤ(ܮ
ୠୣୱ୲
- ௧ǡ ) + c2 Ǥ ݀݊ܽݎǤ ( ܩୠୣୱ୲ -
Particle dimensions and the number of neurons in each ௧
layer ǡ ) (4)
MinLayerVelocity= -0.1(MaxLayers - Where ܸ is the velocity of the number of hidden layers,
Hidden layers velocity MinLayers) ܮୠୣୱ୲
is the particle’s best local value of the number of
MaxNeuronVelocity=
+0.1(MaxNeurons - MinNeurons) hidden layers, and ܩୠୣୱ୲ is the best global value of the
MaxNeuronVelocity= 0.1 number of hidden layers.
Neuron velocity (MaxNeurons - MinNeurons) x Velocity of number of neurons
MinNeuronVelocity= -(0.1
(MaxNeurons - MinNeurons)) ௧ାଵ ௧
ǡ =w.ேǡ + c1 Ǥ ݀݊ܽݎǤ(ܰ
ୠୣୱ୲
- ௧ǡ ) + c2 Ǥ ݀݊ܽݎǤ ( ܩୠୣୱ୲
- ௧ǡ ) (5)
1287
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 18:43:18 UTC from IEEE Xplore. Restrictions apply.
Where ܸே is the velocity of the number of neurons in each evaluate the accuracy of the predicted occupancy for each
hidden layer, ܰୠୣୱ୲
is the particle’s best local value of the dataset.
number of neurons in each hidden layer, and ܩୠୣୱ୲ is the best Three different swarm sizes of 10, 25 and 50 particles are
global value of the number of neurons in each hidden layer. used in our experiments. Figure 2 shows the accuracy (c.f.
x Position for number of layers Figure 2a) and the number of configurations that need to be
ܮ௧ାଵ ௧ ௧ାଵ evaluated (c.f. Figure 2b) to achieve that accuracy for
= ܮ +ǡ (6)
predicting the occupancy within a 60 minute time window.
x Position for number of neurons
These two figures jointly illustrate the by using a small
ܰ௧ାଵ = ܰ௧ +௧ାଵ
ǡ (7)
population size (e.g., 10 particles), the PSO based parameter
IV. EXPERIMENTAL RESULTS AND LESSONS LEARNED value selection technique was able to achieve an accuracy that
is almost the same as that achieved by using a larger
In our experiments, we select a smart building application population size (e.g., 25 and 50 particles) for almost all the
to assess the performance of our proposed PSD based datasets that we experimented with. Therefore, the PSO based
parameter selection technique. We built a deep learning model parameter value selection approach does not require a large
based on 6 weeks (January 15, 2016 – Feb 29, 2016) of Wi-Fi number of particles to produce competitive results. Another
access data collected from 14 buildings of the campus of the observation drawn from Figure 2(b) is that the number of
University of Houston campus. Our goal is to build a deep iterations needed to reach the globally best solution is almost
learning model that predicts the number of occupants at a one-third and one-fifth the number of configurations that need
given location in 15, 30 and 60 minutes from the current time. to be evaluated by the grid search method when the PSO
Awareness of the number of occupants in a building at a given based techniques employs 25 and 50 particles, respectively.
time is crucial for many smart building applications including This demonstrates that the PSO based technique can be
energy efficiency and emergency response services [21]. computationally efficient to determine the deep learning
Our experiments were conducted using the R language. parameters. Therefore, in the following experiments can
We executed our experiments on a 24-core machine with simply consider the PSO based technique with 10 particles
2.40GHz Intel® Xeon® CPU and 32 GB RAM. In our and compare our results with the grid searching technique.
scenarios, we split a 6 weeks dataset into 7 parts; each part Figures 3-5 show the number of different configurations
corresponds to a day of the week. Each dataset has the that need to be evaluated to reach the globally best solution in
following features: Access Point ID (APID), Date, Time, User terms of predicting the occupancy in the next 60, 30 and 15
MAC address and Building number. The three features that minutes, respectively. These figures illustrate that better
our deep learning model needs to predict are the count of accuracy can be achieved when using our proposed PSO based
MAC addresses within 15, 30 and 60 minutes from the current parameter value selection technique while having to evaluate a
time at a given date, time and location (i.e., APID and significantly lower number of configurations compared to the
Building number). In the process, we built a deep learning grid search approach. This clearly exhibits the supremacy of
model for each day of the week. Table II summarizes the the PSO based technique over grid search. Thus, it can serve
different parts of the dataset. Further, each dataset for a as a great candidate for parameter tuning of deep machine
specific day of the week has been split into training and learning models. Of course, one needs to carefully analyze
testing sets. Specifically, the first five weeks of the dataset dataset biases or domain specific properties that give rise to
were used as a training set while the data that pertains to the these results, but that is beyond the scope of this paper and is
sixth week is used as a testing set. We then set out to address left for future extensions.
the main goal of this paper which is to compare our proposed
PSO based parameter selection technique vis-à-vis the grid TABLE II: TRAINING AND TESTING DATASETS FOR THE DAYS OF
THE WEEK
search technique in terms of finding the best parameters for
the seven models that correspond to the days of the week. Number of Number Actual Actual Actual
records in of records occupancy occupancy occupancy
In order to evaluate and compare the grid search and PSO Dataset
training in testing in the next in the next in the next
approaches, both the accuracy and the number of set set 60 minute 30 minute 15 minute
configurations that need to be explored to get the best Sat. 335137 71551 167 110 93
accuracy are evaluated. In case of PSO, the algorithm
Sun. 213434 108597 184 100 80
terminates when the maximum number of iterations is reached
or when there is no difference between the accuracies of two Mon. 1686200 795439 715 648 488
consecutive iterations. Since the count of occupants at a given Tue. 2129033 411025 732 628 474
time and location is a continually changing number (i.e.,
Wed. 2141754 404023 792 618 481
regression problem), it does not make much sense to predict
the exact number of occupants N at a given date, time and Thur. 1986703 269976 794 689 493
locations. Rather, it is more practical to allow a small Fri. 1200046 253995 323 262 234
tolerance in the count, for example N± n. Therefore, we
consider clusters with window size ±n (e.g., 20) when we
1288
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 18:43:18 UTC from IEEE Xplore. Restrictions apply.
(a) Accuracy vs. Model (b) Number of different configurations vs. Model
Fig. 2: Comparison between three different swarm sizes (10, 25 and 50).
(a) Accuracy vs. Model (b) Number of different configurations vs. Model
Fig. 3: Comparison between PSO and grid search to predict within a 60 minute interval.
(a) Accuracy vs. Model (b) Number of different configurations vs. Model
Fig. 4: Comparison between PSO and grid search to predict within a 30 minute interval.
(a) Accuracy vs. Model (b) Number of different configurations vs. Model
Fig. 5: Comparison between PSO and grid search to predict within a 15 minute interval.
1289
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 18:43:18 UTC from IEEE Xplore. Restrictions apply.
[4] Y. Ganjisaffar, T. Debeauvais, S. Javanmardi, R. Caruana, and C. V.
V. CONCLUSIONS AND FUTURE WORK Lopes, “Distributed tuning of machine learning algorithms using
MapReduce clusters,” in Proceedings of the Third Workshop on Large
Multiple parameters have to be set and tuned for deep Scale Data Mining: Theory and Applications, 2011, p. 2.
learning models. These parameters can have a significant [5] J. Kennedy and R. Eberhart, "Particle swarm optimization,” in
influence on the results and the computational needs of deep Proceedings of the IEEE international conference on neural networks
learning models. Optimization methods therefore need to be IV, 1995, pp. 1942-1948.
[6] C. W. de Silva, Mechatronic Systems: Devices, Design, Control,
used to help find optimal parameter settings. Consequently, Operation and Monitoring. Boca Raton: CRC Press, 2007.
the user can focus on the results of deep learning rather than [7] J. Karwowski, M. Okulewicz, and J. Legierski, “Application of Particle
on spending time and efforts on deciding the optimal Swarm Optimization Algorithm to Neural Network Training Process in
parameter values. This paper presents a PSO based parameter the Localization of the Mobile Terminal,” in Engineering Applications
of Neural Networks, 2013, pp. 122–131.
value selection technique to optimize the performance of deep [8] Y. M. M. Hassim and R. Ghazali, “Solving a classification task using
learning models, by selecting the number of hidden layers and Functional Link Neural Networks with modified Artificial Bee Colony,”
the number of neurons in each layer. Our results show that the in 2013 Ninth International Conference on Natural Computation
proposed PSO algorithm is useful in the process of training (ICNC), 2013, pp. 189–193.
[9] H. Shah and R. Ghazali, “Prediction of Earthquake Magnitude by an
deep learning models. We demonstrated the performance of Improved ABC-MLP,” in 2011 Developments in E-systems Engineering,
the proposed technique in a smart building scenario where the 2011, pp. 312–317.
number of occupants needs to be predicted in the next 60, 30 [10] L. Qiongshuai and W. Shiqing, “A hybrid model of neural network and
and 15 minutes based on collected Wi-Fi data. The results classification in wine,” in 2011 3rd International Conference on
obtained show that training times decreased by 77% - 85% Computer Research and Development, 2011, vol. 3, pp. 58–61.
[11] B. A. Garro, H. Sossa and R. A. Vázquez, "Artificial neural network
when using the PSO based approach compared to the grid
synthesis by means of artificial bee colony (ABC) algorithm," 2011
search method. Our proposed PSO based technique also gives IEEE Congress of Evolutionary Computation (CEC), New Orleans, LA,
a better classification accuracy compared to the grid search 2011, pp. 331-338. doi: 10.1109/CEC.2011.5949637.
approach. As a future extension, we intend to explore the use [12] K. Bovis, S. Singh, J. Fieldsend and C. Pinder, "Identification of masses
in digital mammograms with MLP and RBF nets," in Proceedings of the
of PSO to tune other deep learning parameters such as: the
IEEE-INNS-ENNS International Joint Conference on Neural Networks.
activation functions and the number of epochs. Note that it is IJCNN 2000. Neural Computing: New Challenges and Perspectives for
easy to implement parallel versions of PSO on GPUs. the New Millennium, Como, 2000, pp. 342-347.
Therefore, resulting in further reduced training times, while [13] S. Mirjalili, S. Z. Mohd Hashim, and H. Moradian Sardroudi, “Training
feedforward neural networks using hybrid particle swarm optimization
letting researchers focus on extracting subject matter
and gravitational search algorithm,” Applied Mathematics and
knowledge using deep learning models, rather than letting Computation, vol. 218, no. 22, Jul. 2012, pp. 11125–11137.
them focus on the parameter value selection process itself. [14] S. M. H. Bamakan, H. Wang, and A. Z. Ravasan, “Parameters
Optimization for Nonparallel Support Vector Machine by Particle
ACKNOWLEDGMENT Swarm Optimization,” Procedia Computer Science, vol. 91, 2016, pp.
482–491.
This article was made possible by NPRP grant # [7- [15] A. Blum, Neural Networks in C++: An Object-Oriented Framework for
Building Connectionist Systems, 1 edition. New York: Wiley, 1992.
1113-1-199] from the Qatar National Research Fund (a [16] Z. Boger and H. Guterman, “Knowledge extraction from artificial neural
member of Qatar Foundation). The statements made herein are network models,” in Systems, Man, and Cybernetics, 1997.
solely the responsibility of the authors. Computational Cybernetics and Simulation, 1997 IEEE International
Conference on, 1997, vol. 4, pp. 3030–3035.
REFERENCES [17] K. Swingler, Applying Neural Networks: A Practical Guide, Pap/Dsk
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, edition. San Francisco: Morgan Kaufmann, 1996.
MA: The MIT Press, 2016. [18] G. S. Linoff and M. J. A. Berry, Data Mining Techniques: For
[2] “Machine Learning and Understanding for Intelligent Extreme Scale Marketing, Sales, and Customer Relationship Management, 3 edition.
Scientific Computing and Discovery,” Advanced Scientific Computing Indianapolis, IN: Wiley, 2011.
Research (ASCR) Division of the Office of Science, U.S. Department of [19] A. Arora, A. Candel, J. Lanford, E. LeDell, and V. Parmar, Deep
Energy, Workshop Report, Jan. 2015. [Online]. Available: Learning with H2O. 2015.
https://www.orau.gov/machinelearning2015/Machine_Learning_DOE_ [20] C. J. A. Bastos-Filho, D. F., M. P., P. B. C. Miranda, and E. M. N.
Workshop_Report_6.pdf . [Accessed: January 28-2017]. Figueiredo, “Multi-Ring Particle Swarm Optimization,” in Evolutionary
[3] Y. Malitsky, D. Mehta, B. O’Sullivan, and H. Simonis, “Tuning Computation, W. P. dos Santos, Ed. InTech, 2009.
parameters of large neighborhood search for the machine reassignment [21] V. L. Erickson, M. Á. Carreira-Perpiñán, and A. E. Cerpa, “Occupancy
problem,” in International Conference on AI and OR Techniques in Modeling and Prediction for Building Energy Management,” ACM
Constriant Programming for Combinatorial Optimization Problems, Trans. Sen. Netw., vol. 10, no. 3, May 2014, p. 42:1–42:28.
2013, pp. 176–192.
1290
Authorized licensed use limited to: ULAKBIM UASL - EGE UNIVERSITESI. Downloaded on September 05,2024 at 18:43:18 UTC from IEEE Xplore. Restrictions apply.