0% found this document useful (0 votes)
90 views8 pages

Gini Vs Entrophy

This document summarizes and compares two decision tree splitting criteria: GINI index and information gain. It presents the theoretical foundations of both, including definitions of entropy and information gain. The document then describes an empirical study conducted to compare the performance of decision tree classifiers built using these two splitting criteria on balanced and imbalanced datasets. The results showed that the choice of splitting criteria did not impact the accuracy of the decision tree classifiers. In other words, models built using GINI index performed just as well as those built using information gain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views8 pages

Gini Vs Entrophy

This document summarizes and compares two decision tree splitting criteria: GINI index and information gain. It presents the theoretical foundations of both, including definitions of entropy and information gain. The document then describes an empirical study conducted to compare the performance of decision tree classifiers built using these two splitting criteria on balanced and imbalanced datasets. The results showed that the choice of splitting criteria did not impact the accuracy of the decision tree classifiers. In other words, models built using GINI index performed just as well as those built using information gain.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

(IJACSA) International Journal of Advanced Computer Science and Applications,

Vol. 11, No. 2, 2020

Evaluating the Impact of GINI Index and Information


Gain on Classification using Decision Tree Classifier
Algorithm*
Suryakanthi Tangirala
Faculty of Business, University of Botswana
Gaborone, Botswana

Abstract—Decision tree is a supervised machine learning descriptive variables only and no outcome variable. The model
algorithm suitable for solving classification and regression has to determine the patterns and interesting structures in the
problems. Decision trees are recursively built by applying split data that are not known beforehand [2].
conditions at each node that divides the training records into
subsets with output variable of same class. The process starts Classification is a supervised learning problem, where the
from the root node of the decision tree and progresses by objective is to analyse the training data and develop a model
applying split conditions at each non-leaf node resulting into that can predict the future behavior, here the training dataset is
homogenous subsets. However, achieving pure homogenous labeled. Decision tree algorithm is commonly used for
subsets is not possible. Therefore, the goal at each node is to classification tasks. Decision trees classify data into finite
identify an attribute and a split condition on that attribute that number of classes based on the values of input variables. It is
minimizes the mixing of class labels, thus resulting into nearly most appropriate for categorical data [3].
pure subsets. Several splitting indices were proposed to evaluate
the goodness of the split, common ones being GINI index and Decision tree is a simple flowchart that selects class labels
Information gain. The aim of this study is to conduct an of an output variable using the values of one or more input
empirical comparison of GINI index and information gain. variables. The classification process starts at the root node of
Classification models are built using decision tree classifier the decision tree and recursively progresses until it reaches the
algorithm by applying GINI index and Information gain leaf node with class labels. At each node a split condition is
individually. The classification accuracy of the models is applied to decide whether the input value should continue
estimated using different metrics such as Confusion matrix, towards left or right sub tree until it reaches the leaf nodes [4].
Overall accuracy, Per-class accuracy, Recall and Precision. The The split condition applied at each node should result in
results of the study show that, regardless of whether the dataset homogenous subsets. Homogenous subsets have records with
is balanced or imbalanced, the classification models built by same class label. However, it is impossible to achieve pure
applying the two different splitting indices GINI index and homogenous subsets with real time data. Some kind of mixing
information gain give same accuracy. In other words, choice of will always be there. Therefore, while building the decision
splitting indices has no impact on performance of the decision tree, the goal at each node is to select split conditions that best
tree classifier algorithm.
divide the dataset into homogenous subsets. The “goodness of
Keywords—Supervised learning; classification; decision tree;
split criterion” was introduced, which is derived from the
information gain; GINI index notion of impurity [5]. Impurity is measured mathematically
for each split condition and split condition with lowest
I. INTRODUCTION impurity value is chosen.
Machine learning problems can be broadly classified into To measure the impurity value of a split condition several
two categories viz. supervised learning and unsupervised indices are proposed viz., GINI index, Information gain, gain
learning as shown in Fig. 1. With supervised learning ratio and misclassification rate. This paper empirically
techniques, the training data is labeled. It means each examines the effect of GINI index and Information gain on
observation in the data set has both descriptive variables (i.e., classification task. The classification accuracy is measured to
independent variables or decision variables) and a labeled check the suitability of the models in making good predictions.
outcome variable. Labels can be either categories or continuous
values [1]. With supervised learning, a labeled data set is used Rest of the paper is organised as follows: Section II
to train the model in making predictions. A learning model introduces the theoretical notions of Information gain and GINI
maps the input variables to the output variable, with the aim of index. Section III is literature review. Sections IV and V gives
accurately predicting the output for future input variables. the details of data and experimental procedure to compare
Information gain and GINI index on balanced and imbalanced
Unlike supervised learning, with unsupervised learning the data set along with results obtained, and Section VI
data is not labeled. This means that the training data has summarizes the results of the study.

612 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

Machine
Learning

Supervised Unsupervised

Classification Regression Clustering Dimension


Reduction
Predicting a Predicting a Identify a pattern
categorical numeric variable or groups of Reduces the
variable similar objects number of variables
Input: Labeled data being considered to
Input: Labeled set Algorithms find the exact
data set Output–Continuous K-Means information
Output: Discrete Values Clustering required
values Algorithms ANN (Artificial
Algorithms: Linear Regression Neural Networks) Algorithms
Decision Trees Decision Trees Principal
Support Vector Random forests component
Machines Analysis (PCOA)
Naïve Bayes
Fig. 1. Broad Classification of Machine Learning Techniques.

II. THEORITICAL NOTATION For a dataset with one class label, will be 1 and ( )
is 0. Hence the Entropy of homogenous data set is zero [8]. If
This section briefly discusses theoretical notions of the entropy is higher the uncertainty/impurity/mixing is
Information gain and GINI index. Raileanu and Stoffel [6] higher [9].
presented theoretical comparison of GINI index and
Information gain. B. Information Gain
Let L be a learning sample, L= {(x1, c1), (x2, c2) … (xi, cj)}; Information gain is based on Entropy. Information gain is
Where x1, x2…xi is a measurement vector and c1, c2 … cj are the difference between Entropy of a class and conditional
class labels. xi can be viewed as a vector of input variables, and entropy of the class and the selected feature. It measures the
split conditions are based on one of these variables. If pi is usefulness of a feature f in classification [10] i.e., the
probability that an arbitrary tuple belongs to class ci, pi can be difference in Entropy from before to after the split of set L on a
measured as feature f. In other words, it measures the reduction of
uncertainty after splitting the set on a feature. If information
gain value increases, it means the feature f is more useful for
classification. The feature with highest information gain is the
best feature to be selected for split. Assuming that there are V
A. Entropy different values for a feature f, |Lv| represents the subset of L
Information gain is based on Entropy. Entropy measures with f=v, Information gain after splitting L on a feature f is
the extent of impurity or randomness in a dataset [7]. If the measured as [8].
observations of subsets of a dataset are homogenous, then there
| |
is no impurity or randomness in the dataset. If all the ( )= ( ) ∑ ( ( ))
| |
observations of subsets belong to one class, the entropy of that
dataset would be 0. Entropy is defined as the sum of the C. GINI Index
probability of each label times the log probability of that same GINI index determines the purity of a specific class after
label. splitting along a particular attribute. The best split increases the
( ) ⟨ | ⟩ ⟨ | ⟩ ⟨ | ⟩ ⟨ | ⟩ purity of the sets resulting from the split. If L is a dataset with j
⟨ | ⟩ ⟨ | ⟩ different class labels, GINI is defined [3] as

( )
( ) ∑

( ) ∑ ( ) Where pi is relative frequency if class i in L. If the dataset


is split on attribute A into two subsets L1 and L2 with sizes N1
and N2 respectively, GINI is calculated as

613 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

tree classifier on which GINI index and information gain are


( ) ( ) ( ) applied individually. This neutralizes the impact of all other
factors on models.
Reduction in impurity is calculated as
( ) ( ) ( ) IV. EXPERIMENTAL SETUP
This section gives the details of data and experimental
III. LITERATURE REVIEW procedure.
This section briefly presents some of the empirical studies A. Dataset Description
that compared the performance of decision tree algorithms
which use different impurity metrics for feature selection at The experiment is conducted using real data provided by
non-leaf nodes. An attempt is made to find out if the choice of UCI Machine Learning repository [16]. The data was collected
these feature selection metrics has any impact on the accuracy by Portuguese banking institution by making phone calls to
of the model from past studies. customers. The dataset is relatively a large dataset with 41187
rows and 21 columns. One input variable, „duration‟ is
Mingers [11] tested different feature selection measures discarded, as it is highly multi valued and should be avoided
empirically, and reported that choice of the feature selection for good prediction. Details of the remaining variables are
measure affects the size of the tree but not its accuracy. The given in Table I. The classification goal is to predict whether
accuracy remained the same even when attributes are randomly customer will subscribe for a term deposit (y) based on
selected. Patil [12] studied the two decision tree based remaining 19 input variables. The dataset is clean; it doesn‟t
classification algorithms C5.0 and CART. C5.0 uses have Null values. Term deposit (y) is the outcome variable
information gain and CART algorithm uses GINI index to with two class labels (yes or no). Therefore, it is a binary
select the features for split conditions. Their study was an classification problem.
experiment to compare C5.0 and CART classification
algorithms to classify if a customer qualifies for membership TABLE. I. DESCRIPTION OF THE DATASET
card or not. The study revealed that C5.0 gives higher
classification accuracy of 99.6% than CART algorithm with Variable Description Type
94.8% accuracy. age Age of the customer numeric

A study empirically compared different feature selection job Type of job of customer categorical
measures and proposed a variant of GINI index which uses marital marital status categorical
GINI index ratios for feature selection. In this study they
education Educational qualification categorical
compared the classification accuracy of modified GINI with
other classification algorithms ID3, C4.5 and GINI. The results default Has credit in default categorical
show that ID3 and C4.5 based on Information gain have low housing Has housing loan categorical
classification and prediction accuracy than GINI index and
loan Has personal loan categorical
modified GINI index. Modified GINI index is reported to
obtain the highest accuracy among all algorithms that were Contact communication type (cell,
contact categorical
compared [13]. Adhatrao et.al [14] present experiments to telephone)
compare the performance of two decision tree algorithms, ID3 month Last contact month of the year categorical
and C4.5 in predicting the performance of first year day_of_ week Last contact day of the week categorical
engineering students based on the performance achieved by old
number of contacts performed during this
students who are now in second year engineering. The results campaign
campaign and for this client
numeric
show that both the algorithms give same accuracy. In a study
Hssina, et.al [15] compared different decision tree algorithms number of days that passed by after the
pdays client was last contacted from a previous numeric
viz. ID3, C4.5, C5, CART and the results reported show that campaign
C4.5 has achieved the highest classification accuracy. C4.5
uses information gain to evaluate goodness of split. number of contacts performed before this
previous numeric
campaign and for this client
Above discussed studies give varied results on the outcome of the previous marketing
performance of Information gain and GINI index. Moreover, poutcome categorical
campaign
the empirical studies compared the models that were built employment variation rate - quarterly
using different tree based algorithms. These algorithms differ emp.var.rate numeric
indicator
in splitting attribute selection, number of splits (binary
cons.price.index consumer price index - monthly indicator numeric
/ternary), order of splitting attribute (splitting the same attribute
only once or multiple times), stopping criteria and pruning consumer confidence index - monthly
cons.conf.index numeric
technique (pre/post) [14]. All these factors contribute to the indicator
performance of the models built using these algorithms. euribor3m euribor 3 month rate - daily indicator numeric

The present study is unique as it focuses only on finding the number of employees - quarterly
nr.employed numeric
indicator
impact of GINI index and Information gain on classification.
Therefore, unlike other studies, this study develops y (outcome has the client subscribed a term deposit?
categorical
classification models using single algorithm called decision variable) (binary: 'yes','no') (Yes=1, No=0)

614 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

When developing a decision tree, the goal at each node is to and precision [3] [21]. First confusion matrix is created using
identify the attribute and a split condition of the attribute that which all other metrics are easily calculated.
best divides the training set into pure subsets at that node [17].
 Confusion matrix
Given a dataset with input variables and an outcome
variable with a class label, the decision tree algorithm Confusion matrix gives detailed view of the performance
recursively divides the training set until each division contains with breakdown of correct and incorrect predictions for each
examples of same class label. If all the observations of the class. The performance is measured by comparing the
division belong to one class, then it is homogenous subset and predicted outcome values with actual values. The information
if they belong to multiple classes it is impure or heterogeneous is tabulated in the form of a confusion matrix as shown in
[18]. To evaluate the goodness of the split, two splitting Table II.
indices, GINI index and Information gain are used. Both GINI
index and Information gain are applied on Decision tree
classifier algorithm and models are developed.
The dataset is split into two parts, training and test. The
general practice is to divide the dataset into 80:20 ratios, 80 %
training data and 20% test data (unseen data). Using the
decision tree classifier algorithm, a classification model built
recursively from the training data, dividing the data until each
division is pure (homogenous class) and then its prediction
accuracy is tested on the unseen test data. In this experiment,
the classification model is trained to predict whether customers
would subscribe for a term deposit (Yes or No) using the 19
input variables.
A k-fold cross validation method minimizes the bias
associated with random sampling of the training and hold out
of data samples while comparing the predictive accuracy of Fig. 2. Decision Tree Visualization using Information Gain.
two or more methods [3]. In our experiment classification
model is trained and tested 10 times where the training set is
split into 10 exclusive subsets of equal size and each time, the
model is trained on all 9 leaving 1 subset which will be used
for testing. Overall accuracy is simply average of the 10
individual accuracies obtained.
B. Decision Tree Classifier
Many algorithms have been proposed for creating decision
trees. In this experiment, Decision tree classifier, a supervised
learning algorithm is used. It is based on CART and can be
used for creating both classification and regression trees [19].
rpart is a package in R programming, which implements many
of the ideas found in CART model. Different splitting
criterions can be applied while splitting the nodes of the tree
using rpart function [20]. The classification models built by
applying Information gain and GINI index are shown in Fig. 2
and Fig. 3, respectively. Fig. 3. Decision Tree Visualization using GINI Index.

It is noted that both the splitting measures select the same TABLE. II. CONFUSION MATRIX
feature, „Number of employees‟ with same split condition at
the root node. „Number of employees‟ which is a numeric Actual
attribute is selected with split condition nr.employees >=5088.
Positive Negative
C. Performance Evaluation Metrics
Positive

Classification is technique where the model is developed


Predicted class

using a labeled dataset. It means each record in the training True positive count (TP) False Positive count (FP)
dataset has a class label associated with it. The model is later
used to predict the class labels of new/unseen data. Predictive
accuracy of classification model is its ability to correctly
Negative

predict the class label of an unseen data. The common metrics False Negative count (FN) True Negative count (TN)
for measuring the accuracy of classification models are
confusion matrix, overall accuracy, per-class accuracy, recall

615 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

where True positives (TP) corresponds to the number of TABLE. III. CONFUSION MATRIX OF CLASSIFICATION RESULTS OBTAINED
BY DECISION T REE CLASSIFIER WITH INFORMATION G AIN
positive examples correctly predicted by the model, False
negatives(FN) represents number of positive examples wrongly Actual Total
predicted as negative, False positive(FP) refers to number of
0 1
negative examples wrongly predicted as positive and True
negative (TN) is number of negative examples correctly Predicted 0 7198 718 7916
predicted [22] 1 119 202 321
 Overall Accuracy Total 7317 920 8237

Overall classifier accuracy is the rate at which the model


TABLE. IV. CONFUSION MATRIX OF CLASSIFICATION RESULTS OBTAINED
makes accurate predictions. It is the ratio of number of correct BY DECISION T REE CLASSIFIER WITH G INI I NDEX
predictions to total number of predictions made.
Actual Total
0 1
Predicted 0 7190 709 7899
1 127 211 338
Total 7317 920 8237
 Per-class accuracy
Per class accuracy gives the average of accuracy of TABLE. V. RESULTS OF OTHER PERFORMANCE EVALUATION METHODS
prediction of each class. It is particularly useful when the data
Overall Majority Minority
sets are imbalanced. Overall accuracy is micro average and Methods classifier class class
Recall
Precision
per-class accuracy is macro average. (sensitivity)
Accuracy accuracy accuracy
Information
89.84 98.3 21.9 98.3 90.9
gain
GINI Index 89.85 98.2 22.9 98.2 91.0

Accuracy, recall, precision and F1 score values are shown


( ) in Table V.
( )
Results in Table V, quite clearly show that there is no
( ) significant difference between the classification accuracy
obtained by the two feature selection measures. Overall
( ) accuracy as well as per class accuracy values remain
approximately the same. Other observations are in line with
 Precision is defined as the ratio of correctly classified literature which says, classifiers trained on low dimensional,
majority class values (True positives) divided by sum of imbalanced data classify most of the samples to majority class
correctly classified majority class values (True [23]. Therefore, it is deceivingly simple to achieve high overall
positives) and incorrectly classified majority class accuracy, although it is difficult to classify the data reliably.
values (False positive). It should be high. This is evident from the results obtained, where the majority
class accuracy is too high (98.3%) when compared to minority
class accuracy (22% approx.). With imbalanced data set, even
when the minority class accuracy is very low, the overall
 Recall is defined as the ratio of correctly classified accuracy would be high because of high True positive count as
majority class values (True positives) divided by sum of in our case. Hence, kappa statistic is measured which takes in
correctly classified majority class values (True to account the chance agreement.
positives) and incorrectly classified minority class
values (False Negatives). Recall estimates the  Kappa Coefficient:
classifiers accuracy in predicting the majority class. It Kappa coefficient is an interesting alternative to measure
should be high. the accuracy of classifier models. It is particularly useful when
the data sets are imbalanced [24]. It is used to quantify the
reproducibility of discrete variable.

D. Performance Evaluation on the Test Set Originally Cohen‟s Kappa(κ) coefficient was introduced to
measure the level of inter-observer agreement, its value
The test set has a total of 8237 observations. Confusion ranging from 0 to 1 [25]. If κ is 0 then the agreement between
matrix of Decision tree classifier with Information gain and observed and expected is only by chance; if it 1, it is a perfect
GINI Index are shown in Table III and Table IV. agreement. κ value between 0 and 0.2 indicates slight
Positive/majority class is represented as 0 negative/minority agreement, 0.2 to 0.4 says fair agreement, 0.6 to 0.8 is
class is represented using 1. substantial agreement. [26]. The Kappa (κ) statistic takes into
account the chance agreement and is defined as.

616 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

S + = {(x, y) ∈ S: y = 1} be the positive or minority


( ) instances.
Kappa coefficient is used to evaluate the accuracy of S − = {(x, y) ∈ S: y = −1} be the negative or majority
models by measuring agreement between predicted values and instances.
true values. Using the confusion matrix in Table III and In the test set if, |S +| > |S −|, the performance of
Table IV, kappa values for the classifiers are generated as classification algorithm will be very poor, and misclassification
Kappa value of the classifier model based on Information rate will be high especially when it comes to the minority class.
gain, Kappa (κ) = Therefore, to improve the performance, resampling methods
are applied on the training dataset to generate a new set E with
(( ) ( )|( )) (( )( ) ( )( ) |( ) ) synthetic instances of minority class, transforming the training
(( )( ) ( )( ) |( ) ) dataset into, S = (S + ∪ E) ∪ S –
=0.284
A. Resampling
Kappa value 0.28 indicates that observed agreement is 28% Imbalanced datasets have imbalanced class distribution.
of the way between chance and perfect agreement. The dataset used for the study is imbalanced with 29231
Kappa value of the classifier model based on GINI index, positive samples and 3719 negative samples. In such situations,
it is difficult to classify the data reliably, although it is simple
Kappa (κ) = to attain high accuracy. It is quite essential to balance the
(( ) ( )|( )) (( )( ) ( )( ) |( ) ) dataset to classify reliably. Distribution of classes can be
(( )( ) ( )( ) |( ) )
balanced by random oversampling minority class observations
or random under sampling majority class observations or by
=0.293
combining both over and under in a systematic manner [29].
Kappa value 0.29 indicates that observed agreement is 29% Random oversampling creates the problem of over fitting the
of the way between chance and perfect agreement. classifiers and under sampling suffers from loss of useful
observations. Another heuristic method, SMOTE (Synthetic
It is clearly evident from the results obtained that both the Minority Oversampling Technique) based on oversampling is
classifier models obtained near to equal results. In other widely used which reduces the over fitting to certain extent and
words, the results clearly show that the classification accuracy performs better than random over sampling. SMOTE generates
of decision trees is not sensitive to choice of feature selection synthetic observations of minority class [27] [23].
measures.
Before applying any of the resampling techniques training
High overall accuracy (89% approx.) and very low and test data must be split to avoid over fitting and poor
minority class accuracy (22%) show that the data is not generalizations. After resampling we have nearly equal ratio of
classified reliably. This could be because the dataset used in observations for each class in the training set. The number of
the experiment is highly imbalanced with 29231 positive observations after applying the resampling methods on the
(majority) samples and 3719 negative (minority) samples. In training set can be seen in Table VI.
next section we provide the details of methods for balancing
the dataset and discuss the results of the experiment conducted B. Results: Performance Evaluation after Resampling
after balancing the dataset. After balancing the dataset with resampling techniques, the
experiment described in section IV is repeated and accuracy is
V. BALANCING THE DATASET
measured. Confusion matrix created after applying resampling
Imbalanced datasets have imbalanced class distribution; techniques is shown in Table VII.
where by more observations belong to one class than other.
Classification algorithms suffer from the problem of TABLE. VI. NUMBER OF OBSERVATIONS AFTER APPLYING RESAMPLING
imbalanced dataset which leads to biases and poor TECHNIQUES
generalizations. Sometimes, in real world applications,
Number Number
minority class would be of most interest and classifying them Number
Training of of Imbalance
correctly should be given high importance, allowing small Dataset of
set size positive Negative ratio
features
error rate in classification of majority class since the cost of samples samples
misclassifying them could be relatively very [27]. Original 20 32950 29231 3719 89:11
For a binary classification problem, if S is the training data, Over 20 58462 29231 29231 Equal
y is the response variable, [28] defines imbalanced
Under 20 7438 3719 3719 Equal
classification problem as follows:
Both 20 32950 16556 16394 50.2 : 49.8
S = {(x1, y1) … (xm, ym)}, where yi ∈ {-1, 1} will be data
labels. SMOTE 20 26033 14876 11157 57 : 43

617 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

TABLE. VII. CONFUSION MATRIX WITH DIFFERENT RESAMPLING TECHNIQUES

OVER UNDER BOTH SMOTE

0 1 0 1 0 1 0 1
Information gain 0 5858 311 0 6122 332 0 6041 322 0 6720 459
1 1459 609 1 1195 588 1 1276 598 1 597 461

0 1 0 1 0 1 0 1
GINI index 0 5858 311 0 6139 339 0 6064 330 0 6720 459
1 1459 609 1 1178 581 1 1253 590 1 597 461
after the data set balanced. The results are in line as stated by
Tables VIII and IX summarizes the results obtained by the Mingers [11] that splitting indices have no impact on accuracy.
classification models after applying different resampling In summary, the results obtained in this paper show that
techniques. The results in the tables show that balancing the
classification accuracy of decision trees for both balanced and
data set has decreased the majority class accuracy but imbalanced data sets, is not sensitive to the choice feature
improved the minority class accuracy. Balancing the data set selection metrics that were studied.
has improved the minority class accuracy by increasing the
count of true negative. As discussed earlier it is relatively Another interesting observation is balancing the dataset has
simple to achieve high overall accuracy with imbalanced data lowered the majority class accuracy with decrease in count of
sets, but classifying data reliably is difficult. Thus, after true positives and minority class accuracy has improved with
balancing the dataset the objective of classifying data reliably increase in the true negative count. In other words, the
is achieved as the minority class accuracy has improved. sensitivity decreased and specificity improved after the data set
is balanced. Despite the fact that there is a decrease in overall
TABLE. VIII. RESULTS OBTAINED WITH DIFFERENT RESAMPLING accuracy, there is clearly a significant rise in the minority class
TECHNIQUES USING I NFORMATION GAIN accuracy. This proves that classification accuracy is sensitive
Majority Minority to number of positive and negative samples in the data set and
Overall type of data, balanced or imbalanced.
Metric class class Recall Precision Kappa
Accuracy
accuracy accuracy
REFERENCES
Over 78.5 80.0 66.2 80.0 94.9 29.9 [1] James, G., Witten, D., Hastie, T., and Tibshirani, R.: „Tree-based
Under 81.4 83.6 63.9 83.6 94.8 33.7 methods‟: „An introduction to statistical learning‟ (Springer, 2013), pp.
303-335.
Both 80.6 82.5 65 82.5 94.9 32.7 [2] Doherty, C., Camina, S., White, K., and Orenstein, G.: „The path to
SMOTE 87.18 91.8 50.1 91.8 93.6 39.3 predictive analytics and machine learning‟ (O'Reilly Media, 2017.
2017).
[3] Turban, E., Sharda, R., and Delen, D.: „Business intelligence and
TABLE. IX. RESULTS OBTAINED WITH DIFFERENT RESAMPLING analytics: systems for decision support‟ (Pearson Higher Ed, 2014.
TECHNIQUES USING I NFORMATION GAIN
2014).
Majority Minority [4] Loh, W.-Y., and Shih, Y.-S.: „Split selection methods for classification
Overall trees‟, Statistica sinica, 1997, pp. 815-840.
Metric class class Recall Precision Kappa
Accuracy
accuracy accuracy [5] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J.:
„Classification and regression trees. Belmont, CA: Wadsworth‟,
Over 78.5 80.0 66.2 80.0 94.9 29.9
International Group, 1984, 432, pp. 151-166.
Under 81.5 83.9 63.1 83.9 94.7 33.6 [6] Raileanu, L.E., and Stoffel, K.: „Theoretical comparison between the
gini index and information gain criteria‟, Annals of Mathematics and
Both 80.7 82.8 64.1 82.8 94.8 32.6
Artificial Intelligence, 2004, 41, (1), pp. 77-93.
SMOTE 87.18 91.8 50.1 91.8 93.6 39.3 [7] Shannon, C.E.: „A note on the concept of entropy‟, Bell System Tech. J,
1948, 27, (3), pp. 379-423.
Further analysis of results show that, SMOTE has achieved [8] Wang, Y., Li, Y., Song, Y., Rong, X., and Zhang, S.: „Improvement of
highest overall accuracy among all the resampling methods. ID3 algorithm based on simplified information entropy and coordination
Also, with Smote technique kappa value is 39%. It shows that degree‟, Algorithms, 2017, 10, (4), pp. 124.
SMOTE technique is relatively more reliable technique for [9] Fan, R., Zhong, M., Wang, S., Zhang, Y., Andrew, A., Karagas, M.,
balancing the dataset than other three methods studied. Chen, H., Amos, C., Xiong, M., and Moore, J.: ‘ Entropy ‐ based
information gain approaches to detect and to characterize gene‐gene
VI. CONCLUSIONS and gene‐environment interactions/correlations of complex diseases‟,
Genetic epidemiology, 2011, 35, (7), pp. 706-721.
The empirical results reported in this paper show that both [10] Lefkovits, S., and Lefkovits, L.: „Gabor feature selection based on
Information gain and GINI index produce the same accuracy information gain‟, Procedia Engineering, 2017, 181, pp. 892-898.
for classification problems. The experiment is conducted [11] Mingers, J.: „An empirical comparison of selection measures for
before and after the data set is balanced. The results obtained decision-tree induction‟, Machine learning, 1989, 3, (4), pp. 319-342.
prove that there is no significant difference in the performance [12] Patil, N., Lathi, R., and Chitre, V.: „Comparison of C5. 0 & CART
of models using GINI index and Information gain before and classification algorithms using pruning technique‟, Int. J. Eng. Res.
Technol, 2012, 1, (4), pp. 1-5.

618 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 11, No. 2, 2020

[13] Suneetha, N., Hari, V., and Kumar, V.S.: „Modified gini index [21] Zheng, A.: „Evaluating machine learning models: a beginner's guide to
classification: a case study of heart disease dataset‟, International key concepts and pitfalls‟, 2015.
Journal on Computer Science and Engineering, 2010, 2, (06), pp. 1959- [22] Tan, P.-N., Steinbach, M., and Kumar, V.: „Introduction to data mining‟
1965. (Pearson Education India, 2016. 2016).
[14] Adhatrao, K., Gaykar, A., Dhawan, A., Jha, R., and Honrao, V.: [23] Blagus, R., and Lusa, L.: „Improved shrunken centroid classifiers for
„Predicting students' performance using ID3 and C4. 5 classification high-dimensional class-imbalanced data‟, BMC bioinformatics, 2013,
algorithms‟, arXiv preprint arXiv:1310.2071, 2013. 14, pp. 64-64.
[15] Hssina, B., Merbouha, A., Ezzikouri, H., and Erritali, M.: „A [24] McHugh, M.L.: „Interrater reliability: the kappa statistic‟, Biochemia
comparative study of decision tree ID3 and C4. 5‟, International Journal medica: Biochemia medica, 2012, 22, (3), pp. 276-282.
of Advanced Computer Science and Applications, 2014, 4, (2), pp. 13-
19. [25] McGee, S.: „Evidence-based physical diagnosis e-book‟ (Elsevier Health
Sciences, 2012. 2012).
[16] Moro, S., Cortez, P., and Rita, P.: „A data-driven approach to predict the
[26] Ensrud, K.E., and Taylor, B.C.: „Epidemiologic Methods in Studies of
success of bank telemarketing‟, Decis Support Syst, 2014, 62, pp. 22-31.
Osteoporosis‟: „Osteoporosis‟ (Elsevier, 2013), pp. 539-561.
[17] SHARDA, R.D.: „BUSINESS INTELLIGENCE AND ANALYTICS:
[27] Zheng, Z., Cai, Y., and Li, Y.: „Oversampling method for imbalanced
Systems for Decision Support‟ (PRENTICE HALL, 2016. 2016).
classification‟, Computing and Informatics, 2016, 34, (5), pp. 1017-
[18] https://people.revoledu.com/kardi/tutorial/DecisionTree. 1037.
[19] https://dataaspirant.com/2017/02/03/decision-tree-classifier- [28] Cordón, I.: „Working with imbalanced datasets‟.
implementation-in-r/.
[29] Wasikowski, M.: „Combating the class imbalance problem in small
[20] Therneau, T., Atkinson, B., Ripley, B., and Ripley, M.B.: „Package sample data sets‟, University of Kansas, 2009.
„rpart‟‟, Available online: cran. ma. ic. ac. uk/web/packages/rpart/rpart.
pdf (accessed on 20 April 2016), 2015.

619 | P a g e
www.ijacsa.thesai.org

You might also like