0% found this document useful (0 votes)
110 views6 pages

Parameter Tuning in Random Forest Based On Grid Se PDF

This document summarizes a research paper that used grid search to tune the parameters of a random forest classifier for gender classification based on voice frequency. The researchers tuned two random forest parameters - the number of variables used to build trees and the number of trees - using grid search on a voice gender dataset containing over 3,000 voice samples. Their results showed that tuning the parameters increased the random forest's accuracy on the dataset from 96.75% to 96.91%, with the optimal parameters being the square root of variables and 300 trees. The study demonstrated that parameter tuning can help develop more accurate classifiers using random forests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views6 pages

Parameter Tuning in Random Forest Based On Grid Se PDF

This document summarizes a research paper that used grid search to tune the parameters of a random forest classifier for gender classification based on voice frequency. The researchers tuned two random forest parameters - the number of variables used to build trees and the number of trees - using grid search on a voice gender dataset containing over 3,000 voice samples. Their results showed that tuning the parameters increased the random forest's accuracy on the dataset from 96.75% to 96.91%, with the optimal parameters being the square root of variables and 300 trees. The study demonstrated that parameter tuning can help develop more accurate classifiers using random forests.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320319964

Parameter Tuning in Random Forest Based on Grid Search Method for Gender
Classification Based on Voice Frequency

Article · October 2017


DOI: 10.12783/dtcse/cece2017/14611

CITATIONS READS

0 2,037

4 authors, including:

Muhammad Farhan Ramadhan Imas Sukaesih Sitanggang


Airlangga University Bogor Agricultural University
4 PUBLICATIONS   10 CITATIONS    100 PUBLICATIONS   361 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Developing an Early Warning System for Forest and Peat Land Fires in Sumatera and Kalimantan using Spatio-Temporal Data Mining Approach View project

All content following this page was uploaded by Imas Sukaesih Sitanggang on 05 January 2018.

The user has requested enhancement of the downloaded file.


2017 International Conference on Computer, Electronics and Communication Engineering (CECE 2017)
ISBN: 978-1-60595-476-9

Parameter Tuning in Random Forest Based on Grid Search Method for


Gender Classification Based on Voice Frequency
Muhammad Murtadha RAMADHAN*, Imas Sukaesih SITANGGANG,
Fahrendi Rizky NASUTION and Abdullah GHIFARI
Computer Science Department, Faculty of Mathematics and Natural Sciences,
Bogor Agricultural University, Indonesia
*Corresponding author

Keywords: Classification, Parameter tuning, Random Forest.

Abstract. Parameter optimization is one of methods to improve accuracy of machine learning


algorithms. This study applied the grid search method for tuning parameters in the well-known
classification algorithm namely Random Forests. Random Forests was implemented on the voice
gender dataset to identify gender based on the human voice’s characteristics. There are two
parameters that were tuned to obtain the optimal values. Those parameters are number of variables
used in building trees and number of trees that involves in the classifiers. Experimental results on
voice gender dataset show that the highest accuracy of Random Forest with parameter tuning is
0.96907 which is higher than the accuracy of the model without parameter tuning (0.9675). The
optimal parameter for the best classifier is number of variables is 'sqrt' which is square root of
parameters involved in dataset and number of trees is 300. This study shows that the tuning
parameter results optimal parameters for developing the best classifier using Random Forests.

Introduction
Parameter tuning in machine learning algorithms is an important task in order to get optimal
values of the parameters. Several studies have been successfully proposed and implemented
methods in parameter tuning to obtain the classification models with higher accuracy. Friedrichs
and Igel [3] proposed a general approach to determine the kernel from a parameterized kernel space
in SVM based on the evolution strategy. This study shows that extended Gaussian kernels with
scaling and rotating parameters can lead to a significantly better performance than standard
Gaussian kernels. Bergstra and Bengio [6] studied random search and sequential manual
optimization to optimize the hyperparameters of learning algorithms including Deep Belief
Network (DBN). This study shows that random experiments are more efficient than grid
experiments for hyper-parameter optimization in several learning algorithms because not all
hyperparameters are equally important to tune [6]. Wenwen et al [9] proposed an improved grid
search algorithm as the extension of the traditional grid search algorithm in parameter optimization
of support vector machine (SVM). Experimental results on the tumor gene datasets show that the
proposed grid search algorithm has higher classification accuracy of principal component analysis
(PCA)-SVM and kernel PCA (KPCA)-SVM compared to those the traditional grid search
algorithm. Dewancker et al. [5] proposes a mechanism for comparing the performance of multiple
optimization methods for multiple performance metrics in optimization problems that utilizes
nonparametric statistical analysis. Klein et al [1] proposed a generative model for the validation
error as a function of training set size in order to accelerate hyperparameter optimization. This study
developed Bayesian optimization procedure namely FABOLAS which often finds high-quality
solutions 10 to 100 times faster than other Bayesian optimization methods.
This study implements grid search method to obtain optimal parameters in Random Forest
algorithm. Random Forests is one of widely used classification algorithms which creates many
classification trees for identifying class label of new objects. Majority vote is implemented on each
tree to determine class label of a new objects. The optimal parameters of Random Forests are then
used to develop classification model on human voice frequency dataset for gender identification.
625
The paper is organized as follows: introduction is in section 1. Research methodology including
Random Forests and grid search method is briefly discussed in section 2. Section 3 explains results
and discussion. Finally we summarize the conclusion in section 4.

Methods
Dataset
The dataset used in this work is the voice gender dataset which is available at
https://www.kaggle.com/primaryobjects/voicegender. The dataset consists of 3,168 recorded male
and female voice samples that is commonly used for gender recognition by voice and speech
analysis. There are 20 predictor variables and one target variable in the dataset. The class label of
target variable is gender Male and Female. Number of objects with each label is 1584. Atributes’
descriptions in detail are provided at https://www.kaggle.com/primaryobjects/voicegender.
Random Forest Algorithm
Random Forest is categorized as an ensemble learning method that generates many classifiers and
aggregate their result for prediction [2]. Breiman [8] explains that random forests are used in
classification process by combining tree predictors such that each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest. In
addition, random forests have different construction with standard classification or regression trees.
Liaw and Weiner [2] explain that standard trees use the best split among variables in splitting each
node, whereas each node is split using the best among a subset of predictors randomly chosen at
that node.
The random forest algorithm (for both classification and regression) is as follows [2] :
1. Draw ntree bootstrap samples from the original data
2. For each of the bootstrap samples, grow an unpruned classification or regression tree, with the
followinff modification: at each node, rather than choosing the best split among all predictors,
randomly sample mtry of the predictors an choose the best split from among those variables.
(Bagging can be thought of as the special case of random forests obtained when mtry = p, the number
of predictors.)
3. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for
classification, average for regression).
An estimate of the error rate can be obtained, based on the training data, by the following:
1. At each bootstrap iteration, predict the datanot in the bootstrap sample (what Breiman calls
“out-of-bag”, or OOB, data) using the tree grown with the bootstrap sample.
2. Aggregate the OOB predictions. (On the average, each data point would be out-of-bag around
36% of the times, so aggregate these predictions.) Calcuate the error rate, and call it the OOB
estimate of error rate.
Andy and Weiner [2] in their study show that OOB estimate of error rate is quite accurate, given
that enough trees have been grown.
Grid Search Method
Grid search method is an alternative to find best parameter for a model, so that classifier can
accurately predict the unlabelled data (i.e. testing data) [10]. The method is categorized as
exhaustive method for the best parameter values must be explored each by setting sort of prediction
values at first. Then, the method will show the score for each parameter value to consider which one
will be choosen. This method is applicable in case the required maximum is known to be within
finite area defined by upper and lower bounds of each of the independent variables [10].
Hsu, Chang, and Lin [4] shows that grid search method will be recommendedly used along with
cross-validation to obtain best values in two parameters which in that case are C and y. The study
examines various pairs of (C, y) and the one with the best cross-validation accuracy is picked. The
experiments conclusively expose that there are two motivations why grid-search approach was

626
selected: 1. Probability of feeling unsafe to use methods which avoid doing an exhausive parameter
search by approximation heuristics. 2. Comptutational time required to find good parameters by
grid-search is not much more than that by advanced methods since there are only two parameters.

Result and Discussion


This study implemented five classification algorithms on the voice frequency dataset. Those
algorithms are K-Nearest Neighbor (KNN), Logistic Regression, Random Forest, Decision Trees
and Neural Network. The accuracy of classifiers as the results of those algorithms are provided in
Table 1. We applied the 7-fold cross validation in calculating accuracy of classifiers. The results
shows that Random Forest has better performance than other classification algorithms with highest
accuracy of 0.9675 on the dataset before and after transformation. We performed data
transformation by converting the range value of main variables to the new interval of [0, 1].
Table 1. Accuracy of classification algorithm on the dataset before and after transformation.

Accuracy before data Accuracy after data


Algorithm
transformation transformation
KNN 0.6826 0.9505
Logistic Regression 0.8897 0.9669
Random Forest 0.9675 0.9675
Decision Trees 0.9527 0.9527
Neural Network 0.7640 0.9603

In order to improve accuracy of Random Forests, this study conducted tuning parameters of the
Random Forests algorithm using the grid search approach. The Random Forests algorithm has
several parameters to be adjusted in order to get optimal classifier. Two of those parameters are
maximum number of variable used in individual tree and number of trees constructed for classifying
new data. This study utilizes the class GridSearchCV which is available in Scikit Learn
(http://scikit-learn.org). GridSearchCV considers all parameter combinations to obtain optimal
values of parameters. In this method, all the possible combinations of parameter values are
evaluated and the best combination is retained to results the best classifier.
In the GridSearchCV, maximum number of variable used in individual tree is denoted as
max_features whereas number of trees that want to be constructed in the model is named as n
estimators. Table 2 provides the results of tuning parameters for Random Forest on the voice
gender dataset. Score in Table 2 represents accuracy of classifier that was calculated using the 7-
fold Cross Validation method. We tried two options of max_features namely sqrt and log2. If
max_features is sqrt then max_features is equal to sqrt(n_features) whereas if If max_features is
log2 max_features is equal to log2(n_features).

627
Table 2. Best parameter of Random Forests as the results of tuning parameter using GridSearchCV.

No Tuning Parameter Score (7-fold Cross Best Parameter


Validation)

1 max_features=[sqrt, log2] 0.96812 {'max_features': 'sqrt',


n_estimators=[10,100,1000,2000] 'n_estimators': 1000}

2 max_features=[sqrt] 0.96875 {'max_features': 'sqrt',


n_estimators=[500,1000,1500] 'n_estimators': 500}

3 max_features=[sqrt] 0.96907 {'max_features': 'sqrt',


n_estimators=[200,300,500,700] 'n_estimators': 300}

4 max_features=[sqrt] 0.96907 {'max_features': 'sqrt',


n_estimators=[250,300,350,400] 'n_estimators': 300}

5 max_features=[sqrt] 0.96907 {'max_features': 'sqrt',


n_estimators=[275,300,325] 'n_estimators': 300}

6 max_features=[sqrt] 0.96907 {'max_features': 'sqrt',


n_estimators=[290,300,310] 'n_estimators': 300}

According to Table 2, the highest accuracy of the classifier is 0.96907 at the parameter
'max_features' = 'sqrt' and 'n_estimators': 300. The accuracy of Random Forest without parameter
tuning is 0.9675. Therefore the Random Forest model with those two parameters is used to classify
the object in the voice gender dataset. Figure 1 illustrates the attribute’s score as the result of
Random Forest at the best parameter. As shown in Figure 1, the attribute average of fundamental
frequency measured across acoustic signal (meanfun) has the highest score of 0.3836 meaning that
this variable has strong influence in the model to classify the object into the class Male or the class
Female.

0.45
0.38
0.4
0.35
0.3
Score

0.25
0.2 0.18
0.15 0.13
0.1 0.07
0.050.03
0.05 0.020.020.020.010.010.010.010.010.010.010.010.010.010.01
0

Attributes in voice gender dataset

Figure 1. Attribute’s score of the voice gender dataset.

Summary
This study has applied the grid search approach that is implemented in GridSearchCV to find
optimal parameters of Random Forest algorithm. Experimental results on the voice gender dataset
shows that Random Forest provides the best classifier with the accuracy of 0.9675 compared to
those of other classification algorithms namely K-Nearest Neighbor (K-NN), Logistic Regression,
Decision Trees and Neural Network. Tuning two parameters of Random Forest results optimal

628
values of maximum number of variable used in individual tree (max_features) as sqrt(n_features)
and number of trees that want to be constructed in the model (n estimators) as 300. Based on the
two optimal values of parameter, accuracy of Random Forest increases to 0.96907. The results
show that tuning parameter has successfully generated the best classifier to classify a new data.

Acknowledgement
The authors would like to thank to Directorate of Student Affairs, Bogor Agricultural University,
Indonesia for the support in the research process.

References
[1] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, Fast bayesian optimization of machine
learning hyperparameters on large datasets, Proceedings of the 20th International Conference on
Artificial Intelligence and Statistics (AISTATS) 2017.
[2] A. Liaw, and M. Wiener, Classification and regression by random forest, R news 2.3 (2002):
18-22.
[3] F. Friedrichs, C. Igel, Evolutionary tuning of multiple SVM parameters. Proceedings of the
12th European Symposium on Artificial Neural Networks (ESANN) 2004.
[4] C.W. Hsu, C.C. Chang, and C.J. Lin, A practical guide to support vector classification, (2003):
1-16.
[5] I. Dewancker, M. McCourt, S. Clark, P. Hayes, A. Johnson, G. Ke, A strategy for ranking
optimization methods using multiple criteria, JMLR: Workshop and Conference Proceedings
64:11–20, 2016, ICML 2016 AutoML Workshop.
[6] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, Journal of
Machine Learning Research 13 (2012) 281-305.
[7] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, third ed., Waltham,
Elsevier, 2011.
[8] L. Breiman, 2001, Random forests, Machine learning, 45(1) (2001) 5-32.
[9] L. Wenwen, X. Xiaoxue, L. Fu, and Z. Yu, Application of improved grid search algorithm on
SVM for classification of tumor gene, International Journal of Multimedia and Ubiquitous
Engineering, Vol 9 No 11 (2014) pp 181 – 188.
[10] M. Ataei, and M. Osanloo. Using a combination of genetic algorithm and the grid search
method to determine optimum cutoff grades of multiple metal deposits. International Journal of
Surface Mining, Reclamation and Environment 18.1 (2004): 60-78.

629

View publication stats

You might also like