0% found this document useful (0 votes)
47 views

Classification of Headache Using Random Forest Algorithm

Uploaded by

Prahaladhan V B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Classification of Headache Using Random Forest Algorithm

Uploaded by

Prahaladhan V B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2020 4th International Conference on Informatics and Computational Sciences (ICICoS)

Classification of Headache Disorder Using Random


Forest Algorithm
Dhiyaussalam Adi Wibowo Fajar Agung Nugroho
Department of Informatics Department of Informatics Department of Informatics
Diponegoro University Diponegoro University Diponegoro University
Semarang Semarang Semarang
Indonesia Indonesia Indonesia
[email protected] [email protected] [email protected]

Eko Adi Sarwoko I Made Agus Setiawan


Department of Informatics School of Health and Rehabilitation Science
Diponegoro University University of Pittsburgh
Semarang, Indonesia USA
[email protected] [email protected]

Abstract— Headache disorder is one of the most often illness. can be used as a tool for a doctor in diagnosing primary
At least 50% of the world’s population has experienced a headaches appropriately. With the role of the computer to
headache. Primary headaches have several types; migraine, make the diagnosis of primary headaches will facilitate the
tension, cluster, and medication overuse. Computer aid for work of doctors who only need to take an anamnesis. Even
diagnosis could help people locate the headache type without the sufferers can diagnose primary headaches independently in
need to meet the doctor. The Random Forest algorithm was used order to make appropriate treatment using a computer [4].
in this study to produce a reliable model for classifying the
headaches types and generate feature importance. In this study, In the past decade, a number of classification algorithms
the Migbase dataset was used, and several parameters of the in machine learning were used to classify the types of
algorithm were tuned to produce the best model. Based on the headaches. Vandewiele et al., (2018) used several algorithms
experiment results, the best accuracy reaching 99,56% with the to produce a model and suggested using the Decision Tree as
Random Forest parameters are 100 for n_estimators, 33 for a model used to classify [5]. Krawczyk et al., (2013) classified
max_features, and 5 for max_depth. several algorithms and used feature selection algorithms and
Keywords— Primary Headache; Classification; Machine the highest accuracy value was generated using the Random
Learning; Random Forest Forest algorithm [4]. Aljaaf et al., (2015) conducted a study
using a number of algorithms and artificial datasets based on
I. INTRODUCTION The International Classification of Headache Disorders
Headache is a symptom of nervous system disorders that (ICHD-2) and suggested the Decision Tree as a method that
occurs in the head. Headaches can be experienced by all ages, has the highest accuracy [6].
races, and economic status and are more common in women. One of the best machine learning methods for classifying
Headaches occur in almost 50% of the population and are the primary headaches is Random Forest. Random Forest
third highest symptom in the world. Primary headaches have performs better than most single classification models. With
4 types, namely migraine, tension, cluster, and medication- Random Forest it is possible to produce a system that can
overuse headache. Statistically the most common type of diagnose primary headaches automatically with comparable
headache is tension with a figure of 60-80%, followed by accuracy or even better than doctors [4].
migraine by 15% and a cluster of 0.1%. Medication-overuse
headache commonly occurs with other primary headaches [1]. Random Forest is an algorithm that is very successful in
carrying out classification and regression methods. Random
Lack of knowledge about primary headache symptoms is Forest also has a high level of efficiency for regression and
an obstacle in choosing the right treatment. Worldwide, on classification problems. Random Forest has a very good
average, only 4 hours are dedicated to headache treatment and performance with a large amount of data. By setting some
only 50% of headaches are diagnosed and treated. Headache values will improve the performance of this algorithm [7].
symptoms are considered by the public as non-serious
problems. Half of headache sufferers choose to treat In this research Random Forest algorithm will be used
themselves and the number of people who realize that there is because it is not susceptible to overfitting. Overfitting itself is
effective treatment is so low. If the headache is not treated it susceptible to datasets that have features with variable data
can cause sufferers to experience anxiety to depression [2]. [8]. With the health dataset used, the variation is quite high so
that the best model is needed so that the classification of
The role of computers in diagnosing a disease can improve headache types can be classified with high accuracy. To
the quality of health services. Diagnosis made as early as produce the best Random Forest model setting-ups the hyper-
possible can facilitate treatment of the disease. The use of parameter needs to be done and in this study some parameters
computers as an early diagnosis can make it easier for are set manually.
sufferers of the disease because it simplifies the diagnosis
process. Patients do not need to meet directly with the doctor
to make a diagnosis and just use a computer so that the
diagnosis process is more effective and faster [3]. Computers

978-1-7281-9526-1/20/$31.00
Authorized licensed use limited to: University ©2020 IEEE Island. Downloaded on May 31,2021 at 23:27:19 UTC from IEEE Xplore. Restrictions apply.
of Prince Edward
2020 4th International Conference on Informatics and Computational Sciences (ICICoS)

II. METHOD
A. General Overview
The classification of headache types using the Random
Forest algorithm is done in several stages. These stages are
illustrated in the problem solving outline diagram which can
be seen in Fig. 1.
B. Random Forest
Random Forest is a machine learning algorithm that uses
a combination of many Decision Trees to obtain classification
and regression results. Random Forest has significant
advances in classification accuracy thanks to a set of Decision
Trees that classify individually and vote to get classification
results [7]. Results from Random Forest are better and less
prone to overfitting than the Decision Tree. Overfitting is a
problem that often occurs in machine learning where Fig. 1. The problem solving outline diagram.
performance when testing using test data is not as good as
performance when training data using training data. TABLE I. THE TYPES OF DATA FOR EACH FEATURES.
Overfitting is caused by the high variation in data and / or the
number of features used during training, resulting in a Features Types of The Data
complex model [8]. headache_days <1; 1 – 14; 7 – 365; None

Random Forest is included in ensemble learning, which is A: 0 – 4 second


a combination of several models used to create a single model B: 5 – 199 second
that is better than an ordinary single model. Basically, these C: 120 – 239 second
models are used to conduct data training and vote as the final D: 240 – 899 second
result. Random Forest uses a Gini measure of impurity to
select the portion with the lowest impurity at each node. Gini E: 900 – 1799 second
durationGroup
impurity is a measure of the distribution of class labels across F: 1800 – 10799 second
nodes. Formally, the measure of the Gini impurity for the G: 10800 – 14399 second
variable 𝑋 = {𝑥1 , 𝑥2 , ..., 𝑥𝑗 } at node 𝑡, where 𝑗 is the number
H: 14400 – 259199 second
of children at node 𝑡, 𝑁 is the number of samples, 𝑛𝑐𝑖 is the
I: 259200 – 604799 second
number of samples with the value 𝑥𝑖 belonging to class 𝑐 , 𝑚𝑖
is the number of samples with the value 𝑥𝑖 at node 𝑡, then the J: 604800+ second
Gini impurity is formulated by : location Unilateral; Orbital; Bilateral

𝐶 severity Mild; Moderate; Severe


𝑛𝑐𝑖 2
𝐼(𝑡𝑥𝑖 ) = 1 − ∑ ( ) characterisation Pressing; Pulsating; Stabbing
𝑚𝑖
𝑐=0
nausea, vomitting, photophobia,
The split Gini index is the weighted average of the Gini aggravation, pericranial,
measure over the different values of the 𝑋 variable, formulated conjunctival_injection, lacrimation,
nasal_congestion, rhinorrhoea,
by: eyelid_oedema, sweating, miosis,
𝑗 ptosis agitation, motor_weakness,
𝑚𝑖 speech_disturbance,
𝐺(𝑟, 𝑋) = ∑ 𝐼(𝑟𝑥𝑖 ) visual_sympthomps, Yes; No
𝑁 sensory_simptomps,
𝑖=1
homonymous_symptomps,
Feature importance can be calculated from the average dysarthria, vertigo, tinnitus,
impurity reduction of all the Decision Tree in a Random hypacusia, diplopia, ataxia,
Forest without assuming whether the data used is linearly decreased_consciousness,
separated or not. 𝐹𝐼 is feature-𝑖 in a Decision Tree, and 𝑘 is nasal_visual_symptomps,
paraesthesias, aura_development,
represent all of node. The calculation of the features headache_with_aura, hemiplegic
importance in each Decision Tree formulated by: None; Hour; Day
aura_duration
∑𝑗 𝐺𝑖𝑗 2 – 4; 5 – 9; 10 – 19; 20+
𝐹𝐼𝑖 = previous_attacks
∑𝑘 𝐺𝑖𝑘
C. Collecting Dataset
Then, 𝑅𝐹𝐹𝐼 is feature-𝑖 in Random Forest, and 𝑇 is
number of Decision Tree. To calculate the value of feature The dataset used in this study was sourced from the
importance in Random Forest are formulated by: Migbase dataset which can be downloaded via
http://www.migbase.com/migbase_dataset.xls. The data type
∑𝑗 𝐹𝐼𝑖𝑗 of the dataset used consists of 850 data with 39 features.
𝑅𝐹𝐹𝐼𝑖 =
𝑇 Migbase dataset consists of 3 class labels, namely migraine,
tension, and cluster with the proportions of 71.73%, 21.67%,
and 6.60% respectively. Data on features each feature has a

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 31,2021 at 23:27:19 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Informatics and Computational Sciences (ICICoS)

different type. Features and types of data for each feature can
be seen in Table I [5].
D. Data Preprocessing
The stage after collecting the dataset is data preprocessing.
At this stage all data is converted to numeric in order to
facilitate data processing. After changing the data for each
feature, the writer drops the feature or deletes the feature
because it has the same value in each data in the feature. The
features to be deleted have the same value in each of the data
in each feature so that they do not affect the modeling results.
For example, in the motor_weakness feature which has a value
of "No" or "0" in each data. The removed features can be seen
in Table II.
In addition to the manual checking method as in Table II,
it can also be checked using the correlation matrix of the target
label. Removes features with correlation values outside the
range between -0.5 to 0.5 because the correlation value is held
in that range shows a linear relationship that is not too strong
[9]. Features with correlation values with the values to be
dropped can be seen in Table III.

After dropping some of features, the remaining features


are 14 features: durationGroup, location, characterisation,
nausea, photophobia, phonophobia, aggravation,
conjunctival_injection, lacramation, nasal_congestion,
rhinorroea, eyelid_oedema, sweeting, and miosis.
E. Model Development
The classification model using the Random Forest
algorithm in this study was built based on the Scikit-learn
library documentation. In building the Random Forest model
there are several steps that can be seen in Fig. 2 [10]. Fig. 2. Flow Chart of Random Forest Algorithm.

TABLE II. DROPPED FEATURES MANUALLY. TABLE IV. CONFUSION MATRIX OF TEST RESULT ON ONE OF
MODEL.
Features Types of The Data
Actual Class
motor_weakness 0 Migraine Tension Cluster
tinnitus 0 Migraine 4 0 0
Predicted
Class

hypacusia 0 Tendsion 0 3 0
paraesthesias 0 Cluster 0 0 3
decreased_consciousness 0
nasal_visual_symptomps 0 F. Evaluation
Models that have been built in the previous stage will
TABLE III. DROPPED FEATURES USING CORRELATION VALUE. evaluate the accuracy value. The model will be evaluated
Features Correlation Value using 10 data with the contents of 4 classes of migraine, 3
classes of tension, and 3 classes of clusters whose prediction
vomitting, hemiplegic -0.2
results can be seen in Table IV.
severity, visual_symptomps,
sensory_symptomps, aura_development, -0.1 From the Table IV, we can calculate the accuracy of the
headache_with_aura, aura_duration tested model. The accuracy of 100% was obtained from the
speech_disturbance -0.07 tested model.
homonymous_symptomps -0.04 G. Parameter Optimization
ataxia, previous_attacks 0.02 To get the best performance results from the Random
Forest model, setting-ups some parameter need to be done
dysanthria, vertigo, diplopia 0.04 manually. Manually setting the number of trees or
headache_days, agitation 0.1 n_estimators can reduce the error rate on the performance of a
Random Forest model [11]. Under fitting can occur if the
pericranial 0.2 number of trees used is too small. To avoid under fitting, the
minimum value of n_estimators used is 10 with the value
max_depth = 4. The values in these parameters will be set with

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 31,2021 at 23:27:19 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Informatics and Computational Sciences (ICICoS)

a minimum value of n_estimators = 10 and max_depth = 4 and performance of the model required a large number of trees.
will be increased gradually. For n_estimators the values to be But at some point the performance of the model will reach the
used are 10, 20, 50, and 100. While for max_depth the values maximum limit. The determination of n_estimators
to be used are 4, 5, and none (no limitation) [12]. parameters aims to obtain the best model performance with
good efficiency [14].
For other parameters that will be set manually is the
max_features parameter with 3 choices: 6, 14, or 33. The value From Table V we can calculate the highest average
of n_features = 33 uses all the features in the dataset that accuracy value obtained from the value of n_estimators = 100
haven't been dropped using the correlation value. Value with an average accuracy = 98.74%, followed by n_estimators
n_features = 14 uses all features after the feature is dropped = 50 with an average accuracy = 98.72%, then n_estimators =
using the correlation value. While 6 features are features used 10 with an average average accuracy = 98.62%, and finally
to classify primary pain in McGeeney's (2009) study [13]. The n_estimators = 50 with an average accuracy = 98.56%. So the
six features are durationGroup, location, severity, best n_estimators value for this Random Forest model is 100.
characterization, nausea, and photophobia.
C. Max_features
III. RESULT AND DISCUSSION Max_features is used to determine the maximum number
A. Best Parameter of features used. In this study the value of max_features used
Random Forest model testing are conducted to find the are 6, 14, and 33. The value of. For the value of max_features
best hyper-parameters. Training and testing scenarios were = 6, the features are selected manually and the features are
carried out in Jupyter Notebook using a migbase dataset durationGroup, location, severity, characterization, nausea,
containing 850 samples with an assessment of model and photophobia. These features are features that are used to
performance using cross-validation. diagnose types of headaches [13].

In the experiment to determine the best hyper-parameters From Table V we can calculate the average accuracy for
for Random Forest, 36 random forest models were generated. each value of max_features. Max_features with a value of 6
These models are trained using a combination of hyper- have an average accuracy value of 97.99%, max_features with
parameters n_estimators, max_features, and max_depth. The a value of 14 have an average accuracy value of 98.78%, and
hyper-parameter values will be set manually. Manual lastly max_features with a value of 33 have an average
determination of n_estimator with values 10, 20, 50, 100. accuracy value of 99.22%. Max_features with a value of 33
Manual determination of max_features with values 6, 14, and have higher accuracy because they have more features to build
33. Manual determination of max_depth with values 4, 5, and a classification model while max_features with values 6 and
none. The score measurement used to evaluate the 14 have fewer features so they can reduce the complexity of
performance of the model is the accuracy of the 5-fold cross- the Random Forest model produced.
validation calculation. The score of the model test results can D. Max_depth
be seen in Table V.
Max_depth is used to determine the maximum depth of a
Based on the scores of the Random Forest model test tree. If the maximum depth value of a tree has been reached,
results Table V produces the highest accuracy that is 99.56% then a node on the tree cannot split again. The max_depth
in the parameters n_estimators = 100, max_features = 33, and value used to find the best hyper-parameters is 4, 5, and none.
max_depth = none. This accuracy result is bigger than recent The max_depth value is used to reduce the variation of
research with accuracy score 98.11% [5]. For the features that will be used in modeling. The higher the depth
classification results for each class of 20% random data from value of a tree, the more complex the tree is produced.
the dataset can be seen in Table 6 about the confusion matrix Determining the value of max_depth can affect the
of one of the Random Forest models. performance of the Random Forest as long as the dataset used
for each tree is the same [14].
In the confusion matrix in Table VI can be obtained
accuracy for each class. For the migraine class the accuracy From Table V we can calculate the average accuracy for
was 99.19%, the tension class was 100% accuracy, and finally each max-depth value. Max_depth with a value of 4 has an
the cluster class was 92.86%. average accuracy of 98.65%. Then max_depth with a value of
5 has an average accuracy of 98.68%. Finally, max_depth with
B. N_estimators a value of none has an accuracy of 98.66% with the depthest
N_estimators are used to determine the number of trees tree valued at 17. In this study the value of max_depth has only
contained in the Random Forest model. The more number of a slight effect on the performance results of a Random Forest
trees in the model, the longer the time needed to do the model model. This happens because the complexity of the dataset
training. The number of trees used in a Random Forest model used is not too high.
will affect the performance of the model. To improve the
TABLE V. THE MODEL TEST RESULT SCORE USING CROSS-VALIDATION ACCURACY.
max_features 6 14 33
max_depth 4 5 None 4 5 None 4 5 None
10 97.94% 97.94% 97.94% 98.97% 98.68% 98.68% 98.97% 99.26% 99.26%
n_estimato

20 97.94% 97.94% 98.09% 98.68% 98.68% 98.97% 98.82% 99.12% 98.97%


rs

50 98.09% 98.09% 97.94% 98.82% 98.97% 98.68% 99.26% 99.41% 99.26%


100 98.24% 97.94% 97.94% 98.82% 98.68% 98.68% 99.41% 99.41% 99.56%

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 31,2021 at 23:27:19 UTC from IEEE Xplore. Restrictions apply.
2020 4th International Conference on Informatics and Computational Sciences (ICICoS)

TABLE VI. CONFUSION MATRIX OF TEST RESULT ON THE BEST with n_estimators = 100, max_features = 33, and max_depth
MODEL.
= None. Piece of one of the Decision Trees can be seen in Fig.
Actual Class 4.
Migraine Tension Cluster The results of the model can be used by the doctor to
Migraine 122 0 1 diagnose the type of headache. In clinical practice doctors do
Predicted

not need a model for classification. However, at the moment


Class

Tendsion 1 32 0
this model will be needed when the doctor cannot make a
Cluster 0 0 13 definitive diagnosis [16]. This model can also be used by lay
people to find out the type of headache suffered in order to
take appropriate treatment.
IV. CONCLUSION
In this research, a Random Forest algorithm was used to
produce a reliable model for classifying the headaches types
and generate feature importance. The best performance
produced 99.56% accuracy, and the lowest performance
generated an accuracy value of 97.79%. The most efficient
parameters were 100 for n_estimators, 33 for max_features,
and 5 for max_depth. In the best Random Forest model, the
value of feature importance was produced with the highest
value of feature importance generated by the characterization
feature, which has an importance value of 0.25, and 15
features produce the least important value with 0 value.
REFERENCES
[1] F. Ahmed, “Headache disorders: differentiating and managing the
common subtypes,” Br. J. Pain, vol. 6, no. 3, pp. 124–132, 2012.
[2] WHO, “Headache disorders,” WHO, 2016. .
[3] P. S. K. Patra, D. P. Sahu, and I. Mandal, “An Expert System for
Diagnosis Of Human Diseases,” Int. J. Comput. Appl., vol. 1, no. 13,
pp. 71–74, 2010.
Fig. 3. Feature importance value for each feature. [4] B. Krawczyk, D. Simić, S. Simić, and M. Woźniak, “Automatic
diagnosis of primary headaches by machine learning methods,” Cent.
Eur. J. Med., vol. 8, no. 2, pp. 157–165, 2013.
[5] G. Vandewiele et al., “A decision support system to follow up and
diagnose primary headache patients using semantically enriched data,”
vol. 6, pp. 1–15, 2018.
[6] A. J. Aljaaf, D. Al-Jumeily, A. J. Hussain, P. Fergus, M. Al-Jumaily,
and N. Radi, “A systematic comparison and evaluation of supervised
machine learning classifiers using headache dataset,” Lect. Notes
Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes
Bioinformatics), vol. 9227, pp. 101–108, 2015.
[7] L. Breiman, “Random Forest,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
[8] S. Raschka and V. Mirjalili, Python Machine Learning Second Edition,
2nd ed. Birmingham: Packt Publishing Ltd., 2017.
[9] M. M. Mukaka, “Statistics corner: A guide to appropriate use of
Fig. 4. Visualization of One of The Decision Tree in The Best Random correlation coefficient in medical research,” Malawi Med. J., vol. 24,
Forest Model no. 3, pp. 69–71, 2012.
[10] F. Pedregosa, R. Weiss, and M. Brucher, “Scikit-learn : Machine
E. Features Importance Learning in Python,” vol. 12, pp. 2825–2830, 2011.
One of the results of the Random Forest algorithm is [11] P. Probst and A. L. Boulesteix, “To tune or not to tune the number of
features importance [15]. The value of feature importance for trees in random forest,” J. Mach. Learn. Res., vol. 18, no. 2001, pp. 1–
8, 2018.
each feature can be seen in Fig. 3.
[12] C. Kertesz, “Rigidity-Based Surface Recognition for a Domestic
From the results of calculations in Figure 3 it can be seen Legged Robot,” IEEE Robot. Autom. Lett., vol. 1, no. 1, pp. 309–315,
2016.
that not all features contribute to the model. Features with
[13] B. E. McGeeney, “An introduction to headache classification,” Tech.
importance value 0 have no effect on the resulting model. Reg. Anesth. Pain Manag., vol. 13, no. 1, pp. 2–4, 2009.
Conversely, features with high importance value are very [14] J.-F. Coeurjolly and A. Leclercq-Samson, “TUNING PARAMETERS
influential in making the model. The highest importance IN RANDOM FORESTS,” vol. 60, no. 2001, pp. 144–162, 2018.
value is produced by characterisation features with a value of [15] S. Ronaghan, “The Mathematics of Decision Trees, Random Forest and
2.5 and there are 15 features that have importance value of 0. Feature Importance in Scikit-learn and Spark,” Medium, 2018. .
[16] J. Olesen, “Headache Classification Committee of the International
F. Model Visualization Headache Society (IHS) The International Classification of Headache
The results from the Random Forest model are best used Disorders, 3rd edition,” Cephalalgia, vol. 38, no. 1, pp. 1–211, 2018.
to classify the types of headaches. The model used is a model

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 31,2021 at 23:27:19 UTC from IEEE Xplore. Restrictions apply.

You might also like