Gini Vs Entrophy
Gini Vs Entrophy
Abstract—Decision tree is a supervised machine learning descriptive variables only and no outcome variable. The model
algorithm suitable for solving classification and regression has to determine the patterns and interesting structures in the
problems. Decision trees are recursively built by applying split data that are not known beforehand [2].
conditions at each node that divides the training records into
subsets with output variable of same class. The process starts Classification is a supervised learning problem, where the
from the root node of the decision tree and progresses by objective is to analyse the training data and develop a model
applying split conditions at each non-leaf node resulting into that can predict the future behavior, here the training dataset is
homogenous subsets. However, achieving pure homogenous labeled. Decision tree algorithm is commonly used for
subsets is not possible. Therefore, the goal at each node is to classification tasks. Decision trees classify data into finite
identify an attribute and a split condition on that attribute that number of classes based on the values of input variables. It is
minimizes the mixing of class labels, thus resulting into nearly most appropriate for categorical data [3].
pure subsets. Several splitting indices were proposed to evaluate
the goodness of the split, common ones being GINI index and Decision tree is a simple flowchart that selects class labels
Information gain. The aim of this study is to conduct an of an output variable using the values of one or more input
empirical comparison of GINI index and information gain. variables. The classification process starts at the root node of
Classification models are built using decision tree classifier the decision tree and recursively progresses until it reaches the
algorithm by applying GINI index and Information gain leaf node with class labels. At each node a split condition is
individually. The classification accuracy of the models is applied to decide whether the input value should continue
estimated using different metrics such as Confusion matrix, towards left or right sub tree until it reaches the leaf nodes [4].
Overall accuracy, Per-class accuracy, Recall and Precision. The The split condition applied at each node should result in
results of the study show that, regardless of whether the dataset homogenous subsets. Homogenous subsets have records with
is balanced or imbalanced, the classification models built by same class label. However, it is impossible to achieve pure
applying the two different splitting indices GINI index and homogenous subsets with real time data. Some kind of mixing
information gain give same accuracy. In other words, choice of will always be there. Therefore, while building the decision
splitting indices has no impact on performance of the decision tree, the goal at each node is to select split conditions that best
tree classifier algorithm.
divide the dataset into homogenous subsets. The “goodness of
Keywords—Supervised learning; classification; decision tree;
split criterion” was introduced, which is derived from the
information gain; GINI index notion of impurity [5]. Impurity is measured mathematically
for each split condition and split condition with lowest
I. INTRODUCTION impurity value is chosen.
Machine learning problems can be broadly classified into To measure the impurity value of a split condition several
two categories viz. supervised learning and unsupervised indices are proposed viz., GINI index, Information gain, gain
learning as shown in Fig. 1. With supervised learning ratio and misclassification rate. This paper empirically
techniques, the training data is labeled. It means each examines the effect of GINI index and Information gain on
observation in the data set has both descriptive variables (i.e., classification task. The classification accuracy is measured to
independent variables or decision variables) and a labeled check the suitability of the models in making good predictions.
outcome variable. Labels can be either categories or continuous
values [1]. With supervised learning, a labeled data set is used Rest of the paper is organised as follows: Section II
to train the model in making predictions. A learning model introduces the theoretical notions of Information gain and GINI
maps the input variables to the output variable, with the aim of index. Section III is literature review. Sections IV and V gives
accurately predicting the output for future input variables. the details of data and experimental procedure to compare
Information gain and GINI index on balanced and imbalanced
Unlike supervised learning, with unsupervised learning the data set along with results obtained, and Section VI
data is not labeled. This means that the training data has summarizes the results of the study.
612 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
Machine
Learning
Supervised Unsupervised
II. THEORITICAL NOTATION For a dataset with one class label, will be 1 and ( )
is 0. Hence the Entropy of homogenous data set is zero [8]. If
This section briefly discusses theoretical notions of the entropy is higher the uncertainty/impurity/mixing is
Information gain and GINI index. Raileanu and Stoffel [6] higher [9].
presented theoretical comparison of GINI index and
Information gain. B. Information Gain
Let L be a learning sample, L= {(x1, c1), (x2, c2) … (xi, cj)}; Information gain is based on Entropy. Information gain is
Where x1, x2…xi is a measurement vector and c1, c2 … cj are the difference between Entropy of a class and conditional
class labels. xi can be viewed as a vector of input variables, and entropy of the class and the selected feature. It measures the
split conditions are based on one of these variables. If pi is usefulness of a feature f in classification [10] i.e., the
probability that an arbitrary tuple belongs to class ci, pi can be difference in Entropy from before to after the split of set L on a
measured as feature f. In other words, it measures the reduction of
uncertainty after splitting the set on a feature. If information
gain value increases, it means the feature f is more useful for
classification. The feature with highest information gain is the
best feature to be selected for split. Assuming that there are V
A. Entropy different values for a feature f, |Lv| represents the subset of L
Information gain is based on Entropy. Entropy measures with f=v, Information gain after splitting L on a feature f is
the extent of impurity or randomness in a dataset [7]. If the measured as [8].
observations of subsets of a dataset are homogenous, then there
| |
is no impurity or randomness in the dataset. If all the ( )= ( ) ∑ ( ( ))
| |
observations of subsets belong to one class, the entropy of that
dataset would be 0. Entropy is defined as the sum of the C. GINI Index
probability of each label times the log probability of that same GINI index determines the purity of a specific class after
label. splitting along a particular attribute. The best split increases the
( ) ⟨ | ⟩ ⟨ | ⟩ ⟨ | ⟩ ⟨ | ⟩ purity of the sets resulting from the split. If L is a dataset with j
⟨ | ⟩ ⟨ | ⟩ different class labels, GINI is defined [3] as
( )
( ) ∑
613 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
A study empirically compared different feature selection job Type of job of customer categorical
measures and proposed a variant of GINI index which uses marital marital status categorical
GINI index ratios for feature selection. In this study they
education Educational qualification categorical
compared the classification accuracy of modified GINI with
other classification algorithms ID3, C4.5 and GINI. The results default Has credit in default categorical
show that ID3 and C4.5 based on Information gain have low housing Has housing loan categorical
classification and prediction accuracy than GINI index and
loan Has personal loan categorical
modified GINI index. Modified GINI index is reported to
obtain the highest accuracy among all algorithms that were Contact communication type (cell,
contact categorical
compared [13]. Adhatrao et.al [14] present experiments to telephone)
compare the performance of two decision tree algorithms, ID3 month Last contact month of the year categorical
and C4.5 in predicting the performance of first year day_of_ week Last contact day of the week categorical
engineering students based on the performance achieved by old
number of contacts performed during this
students who are now in second year engineering. The results campaign
campaign and for this client
numeric
show that both the algorithms give same accuracy. In a study
Hssina, et.al [15] compared different decision tree algorithms number of days that passed by after the
pdays client was last contacted from a previous numeric
viz. ID3, C4.5, C5, CART and the results reported show that campaign
C4.5 has achieved the highest classification accuracy. C4.5
uses information gain to evaluate goodness of split. number of contacts performed before this
previous numeric
campaign and for this client
Above discussed studies give varied results on the outcome of the previous marketing
performance of Information gain and GINI index. Moreover, poutcome categorical
campaign
the empirical studies compared the models that were built employment variation rate - quarterly
using different tree based algorithms. These algorithms differ emp.var.rate numeric
indicator
in splitting attribute selection, number of splits (binary
cons.price.index consumer price index - monthly indicator numeric
/ternary), order of splitting attribute (splitting the same attribute
only once or multiple times), stopping criteria and pruning consumer confidence index - monthly
cons.conf.index numeric
technique (pre/post) [14]. All these factors contribute to the indicator
performance of the models built using these algorithms. euribor3m euribor 3 month rate - daily indicator numeric
The present study is unique as it focuses only on finding the number of employees - quarterly
nr.employed numeric
indicator
impact of GINI index and Information gain on classification.
Therefore, unlike other studies, this study develops y (outcome has the client subscribed a term deposit?
categorical
classification models using single algorithm called decision variable) (binary: 'yes','no') (Yes=1, No=0)
614 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
When developing a decision tree, the goal at each node is to and precision [3] [21]. First confusion matrix is created using
identify the attribute and a split condition of the attribute that which all other metrics are easily calculated.
best divides the training set into pure subsets at that node [17].
Confusion matrix
Given a dataset with input variables and an outcome
variable with a class label, the decision tree algorithm Confusion matrix gives detailed view of the performance
recursively divides the training set until each division contains with breakdown of correct and incorrect predictions for each
examples of same class label. If all the observations of the class. The performance is measured by comparing the
division belong to one class, then it is homogenous subset and predicted outcome values with actual values. The information
if they belong to multiple classes it is impure or heterogeneous is tabulated in the form of a confusion matrix as shown in
[18]. To evaluate the goodness of the split, two splitting Table II.
indices, GINI index and Information gain are used. Both GINI
index and Information gain are applied on Decision tree
classifier algorithm and models are developed.
The dataset is split into two parts, training and test. The
general practice is to divide the dataset into 80:20 ratios, 80 %
training data and 20% test data (unseen data). Using the
decision tree classifier algorithm, a classification model built
recursively from the training data, dividing the data until each
division is pure (homogenous class) and then its prediction
accuracy is tested on the unseen test data. In this experiment,
the classification model is trained to predict whether customers
would subscribe for a term deposit (Yes or No) using the 19
input variables.
A k-fold cross validation method minimizes the bias
associated with random sampling of the training and hold out
of data samples while comparing the predictive accuracy of Fig. 2. Decision Tree Visualization using Information Gain.
two or more methods [3]. In our experiment classification
model is trained and tested 10 times where the training set is
split into 10 exclusive subsets of equal size and each time, the
model is trained on all 9 leaving 1 subset which will be used
for testing. Overall accuracy is simply average of the 10
individual accuracies obtained.
B. Decision Tree Classifier
Many algorithms have been proposed for creating decision
trees. In this experiment, Decision tree classifier, a supervised
learning algorithm is used. It is based on CART and can be
used for creating both classification and regression trees [19].
rpart is a package in R programming, which implements many
of the ideas found in CART model. Different splitting
criterions can be applied while splitting the nodes of the tree
using rpart function [20]. The classification models built by
applying Information gain and GINI index are shown in Fig. 2
and Fig. 3, respectively. Fig. 3. Decision Tree Visualization using GINI Index.
It is noted that both the splitting measures select the same TABLE. II. CONFUSION MATRIX
feature, „Number of employees‟ with same split condition at
the root node. „Number of employees‟ which is a numeric Actual
attribute is selected with split condition nr.employees >=5088.
Positive Negative
C. Performance Evaluation Metrics
Positive
using a labeled dataset. It means each record in the training True positive count (TP) False Positive count (FP)
dataset has a class label associated with it. The model is later
used to predict the class labels of new/unseen data. Predictive
accuracy of classification model is its ability to correctly
Negative
predict the class label of an unseen data. The common metrics False Negative count (FN) True Negative count (TN)
for measuring the accuracy of classification models are
confusion matrix, overall accuracy, per-class accuracy, recall
615 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
where True positives (TP) corresponds to the number of TABLE. III. CONFUSION MATRIX OF CLASSIFICATION RESULTS OBTAINED
BY DECISION T REE CLASSIFIER WITH INFORMATION G AIN
positive examples correctly predicted by the model, False
negatives(FN) represents number of positive examples wrongly Actual Total
predicted as negative, False positive(FP) refers to number of
0 1
negative examples wrongly predicted as positive and True
negative (TN) is number of negative examples correctly Predicted 0 7198 718 7916
predicted [22] 1 119 202 321
Overall Accuracy Total 7317 920 8237
D. Performance Evaluation on the Test Set Originally Cohen‟s Kappa(κ) coefficient was introduced to
measure the level of inter-observer agreement, its value
The test set has a total of 8237 observations. Confusion ranging from 0 to 1 [25]. If κ is 0 then the agreement between
matrix of Decision tree classifier with Information gain and observed and expected is only by chance; if it 1, it is a perfect
GINI Index are shown in Table III and Table IV. agreement. κ value between 0 and 0.2 indicates slight
Positive/majority class is represented as 0 negative/minority agreement, 0.2 to 0.4 says fair agreement, 0.6 to 0.8 is
class is represented using 1. substantial agreement. [26]. The Kappa (κ) statistic takes into
account the chance agreement and is defined as.
616 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
617 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
0 1 0 1 0 1 0 1
Information gain 0 5858 311 0 6122 332 0 6041 322 0 6720 459
1 1459 609 1 1195 588 1 1276 598 1 597 461
0 1 0 1 0 1 0 1
GINI index 0 5858 311 0 6139 339 0 6064 330 0 6720 459
1 1459 609 1 1178 581 1 1253 590 1 597 461
after the data set balanced. The results are in line as stated by
Tables VIII and IX summarizes the results obtained by the Mingers [11] that splitting indices have no impact on accuracy.
classification models after applying different resampling In summary, the results obtained in this paper show that
techniques. The results in the tables show that balancing the
classification accuracy of decision trees for both balanced and
data set has decreased the majority class accuracy but imbalanced data sets, is not sensitive to the choice feature
improved the minority class accuracy. Balancing the data set selection metrics that were studied.
has improved the minority class accuracy by increasing the
count of true negative. As discussed earlier it is relatively Another interesting observation is balancing the dataset has
simple to achieve high overall accuracy with imbalanced data lowered the majority class accuracy with decrease in count of
sets, but classifying data reliably is difficult. Thus, after true positives and minority class accuracy has improved with
balancing the dataset the objective of classifying data reliably increase in the true negative count. In other words, the
is achieved as the minority class accuracy has improved. sensitivity decreased and specificity improved after the data set
is balanced. Despite the fact that there is a decrease in overall
TABLE. VIII. RESULTS OBTAINED WITH DIFFERENT RESAMPLING accuracy, there is clearly a significant rise in the minority class
TECHNIQUES USING I NFORMATION GAIN accuracy. This proves that classification accuracy is sensitive
Majority Minority to number of positive and negative samples in the data set and
Overall type of data, balanced or imbalanced.
Metric class class Recall Precision Kappa
Accuracy
accuracy accuracy
REFERENCES
Over 78.5 80.0 66.2 80.0 94.9 29.9 [1] James, G., Witten, D., Hastie, T., and Tibshirani, R.: „Tree-based
Under 81.4 83.6 63.9 83.6 94.8 33.7 methods‟: „An introduction to statistical learning‟ (Springer, 2013), pp.
303-335.
Both 80.6 82.5 65 82.5 94.9 32.7 [2] Doherty, C., Camina, S., White, K., and Orenstein, G.: „The path to
SMOTE 87.18 91.8 50.1 91.8 93.6 39.3 predictive analytics and machine learning‟ (O'Reilly Media, 2017.
2017).
[3] Turban, E., Sharda, R., and Delen, D.: „Business intelligence and
TABLE. IX. RESULTS OBTAINED WITH DIFFERENT RESAMPLING analytics: systems for decision support‟ (Pearson Higher Ed, 2014.
TECHNIQUES USING I NFORMATION GAIN
2014).
Majority Minority [4] Loh, W.-Y., and Shih, Y.-S.: „Split selection methods for classification
Overall trees‟, Statistica sinica, 1997, pp. 815-840.
Metric class class Recall Precision Kappa
Accuracy
accuracy accuracy [5] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J.:
„Classification and regression trees. Belmont, CA: Wadsworth‟,
Over 78.5 80.0 66.2 80.0 94.9 29.9
International Group, 1984, 432, pp. 151-166.
Under 81.5 83.9 63.1 83.9 94.7 33.6 [6] Raileanu, L.E., and Stoffel, K.: „Theoretical comparison between the
gini index and information gain criteria‟, Annals of Mathematics and
Both 80.7 82.8 64.1 82.8 94.8 32.6
Artificial Intelligence, 2004, 41, (1), pp. 77-93.
SMOTE 87.18 91.8 50.1 91.8 93.6 39.3 [7] Shannon, C.E.: „A note on the concept of entropy‟, Bell System Tech. J,
1948, 27, (3), pp. 379-423.
Further analysis of results show that, SMOTE has achieved [8] Wang, Y., Li, Y., Song, Y., Rong, X., and Zhang, S.: „Improvement of
highest overall accuracy among all the resampling methods. ID3 algorithm based on simplified information entropy and coordination
Also, with Smote technique kappa value is 39%. It shows that degree‟, Algorithms, 2017, 10, (4), pp. 124.
SMOTE technique is relatively more reliable technique for [9] Fan, R., Zhong, M., Wang, S., Zhang, Y., Andrew, A., Karagas, M.,
balancing the dataset than other three methods studied. Chen, H., Amos, C., Xiong, M., and Moore, J.: ‘ Entropy ‐ based
information gain approaches to detect and to characterize gene‐gene
VI. CONCLUSIONS and gene‐environment interactions/correlations of complex diseases‟,
Genetic epidemiology, 2011, 35, (7), pp. 706-721.
The empirical results reported in this paper show that both [10] Lefkovits, S., and Lefkovits, L.: „Gabor feature selection based on
Information gain and GINI index produce the same accuracy information gain‟, Procedia Engineering, 2017, 181, pp. 892-898.
for classification problems. The experiment is conducted [11] Mingers, J.: „An empirical comparison of selection measures for
before and after the data set is balanced. The results obtained decision-tree induction‟, Machine learning, 1989, 3, (4), pp. 319-342.
prove that there is no significant difference in the performance [12] Patil, N., Lathi, R., and Chitre, V.: „Comparison of C5. 0 & CART
of models using GINI index and Information gain before and classification algorithms using pruning technique‟, Int. J. Eng. Res.
Technol, 2012, 1, (4), pp. 1-5.
618 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020
[13] Suneetha, N., Hari, V., and Kumar, V.S.: „Modified gini index [21] Zheng, A.: „Evaluating machine learning models: a beginner's guide to
classification: a case study of heart disease dataset‟, International key concepts and pitfalls‟, 2015.
Journal on Computer Science and Engineering, 2010, 2, (06), pp. 1959- [22] Tan, P.-N., Steinbach, M., and Kumar, V.: „Introduction to data mining‟
1965. (Pearson Education India, 2016. 2016).
[14] Adhatrao, K., Gaykar, A., Dhawan, A., Jha, R., and Honrao, V.: [23] Blagus, R., and Lusa, L.: „Improved shrunken centroid classifiers for
„Predicting students' performance using ID3 and C4. 5 classification high-dimensional class-imbalanced data‟, BMC bioinformatics, 2013,
algorithms‟, arXiv preprint arXiv:1310.2071, 2013. 14, pp. 64-64.
[15] Hssina, B., Merbouha, A., Ezzikouri, H., and Erritali, M.: „A [24] McHugh, M.L.: „Interrater reliability: the kappa statistic‟, Biochemia
comparative study of decision tree ID3 and C4. 5‟, International Journal medica: Biochemia medica, 2012, 22, (3), pp. 276-282.
of Advanced Computer Science and Applications, 2014, 4, (2), pp. 13-
19. [25] McGee, S.: „Evidence-based physical diagnosis e-book‟ (Elsevier Health
Sciences, 2012. 2012).
[16] Moro, S., Cortez, P., and Rita, P.: „A data-driven approach to predict the
[26] Ensrud, K.E., and Taylor, B.C.: „Epidemiologic Methods in Studies of
success of bank telemarketing‟, Decis Support Syst, 2014, 62, pp. 22-31.
Osteoporosis‟: „Osteoporosis‟ (Elsevier, 2013), pp. 539-561.
[17] SHARDA, R.D.: „BUSINESS INTELLIGENCE AND ANALYTICS:
[27] Zheng, Z., Cai, Y., and Li, Y.: „Oversampling method for imbalanced
Systems for Decision Support‟ (PRENTICE HALL, 2016. 2016).
classification‟, Computing and Informatics, 2016, 34, (5), pp. 1017-
[18] https://people.revoledu.com/kardi/tutorial/DecisionTree. 1037.
[19] https://dataaspirant.com/2017/02/03/decision-tree-classifier- [28] Cordón, I.: „Working with imbalanced datasets‟.
implementation-in-r/.
[29] Wasikowski, M.: „Combating the class imbalance problem in small
[20] Therneau, T., Atkinson, B., Ripley, B., and Ripley, M.B.: „Package sample data sets‟, University of Kansas, 2009.
„rpart‟‟, Available online: cran. ma. ic. ac. uk/web/packages/rpart/rpart.
pdf (accessed on 20 April 2016), 2015.
619 | P a g e
www.ijacsa.thesai.org