0% found this document useful (0 votes)
6 views

Decision_Trees_Concepts_Algorithms

This article provides a comprehensive overview of decision tree algorithms in machine learning, discussing their core concepts, various algorithms, and applications in fields like medical diagnosis and fraud detection. It highlights the interpretability and versatility of decision trees, which can handle both categorical and numerical data without extensive preprocessing. The paper aims to fill gaps in existing literature by detailing the mathematical formulations and algorithmic representations of decision trees and their ensemble methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Decision_Trees_Concepts_Algorithms

This article provides a comprehensive overview of decision tree algorithms in machine learning, discussing their core concepts, various algorithms, and applications in fields like medical diagnosis and fraud detection. It highlights the interpretability and versatility of decision trees, which can handle both categorical and numerical data without extensive preprocessing. The paper aims to fill gaps in existing literature by detailing the mathematical formulations and algorithmic representations of decision trees and their ensemble methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A Survey of Decision Trees: Concepts,


Algorithms, and Applications
IBOMOIYE DOMOR MIENYE1 , (Member, IEEE) and NOBERT JERE1
1
Department of Information Technology, Walter Sisulu University, Buffalo City Campus, East London 5200, South Africa
Corresponding author: Ibomoiye Domor Mienye (e-mail: [email protected]).

ABSTRACT Machine learning (ML) has been instrumental in solving complex problems and significantly
advancing different areas of our lives. Decision tree-based methods have gained significant popularity
among the diverse range of ML algorithms due to their simplicity and interpretability. This paper presents a
comprehensive overview of decision trees, including the core concepts, algorithms, applications, their early
development to the recent high-performing ensemble algorithms and their mathematical and algorithmic
representations, which are lacking in the literature and will be beneficial to ML researchers and industry
experts. Some of the algorithms include classification and regression tree (CART), Iterative Dichotomiser 3
(ID3), C4.5, C5.0, Chi-squared Automatic Interaction Detection (CHAID), conditional inference trees, and
other tree-based ensemble algorithms, such as random forest, gradient-boosted decision trees, and rotation
forest. Their utilisation in recent literature is also discussed, focusing on applications in medical diagnosis
and fraud detection.

INDEX TERMS Algorithms, CART, C4.5, C5.0, Decision tree, Ensemble learning, ID3, Machine learning

I. INTRODUCTION Gini index.


ACHINE learning-based applications are Furthermore, decision trees are known for their
M revolutionising various industries and sec-
tors, including healthcare, finance, and marketing
interpretability [9], [10]. The resulting tree struc-
ture allows users to understand and interpret the
[1]–[4]. With the advancement of technology and decision-making process easily. This is especially
the availability of large datasets, ML algorithms valuable in domains where transparency and ex-
have become increasingly powerful and accurate in plainability are crucial, making it easier for stake-
making predictions and informed decisions. These holders to trust and validate the results. Another sig-
applications are transforming how organisations op- nificance of decision tree-based algorithms is their
erate and paving the way for a more efficient and ability to handle categorical and numerical data.
data-driven future. Traditional statistical methods often struggle with
Decision tree-based algorithms have been em- categorical variables, requiring them to be converted
ployed in diverse applications, including but not into numerical values. Decision trees, on the other
limited to classification, regression, and feature se- hand, can directly handle both types of data, elimi-
lection [5]–[7]. The basic idea behind decision tree- nating the need for data preprocessing. This makes
based algorithms is that they recursively partition decision tree-based algorithms more versatile and
the data into subsets based on the values of different efficient in a wide range of applications.
attributes until a stopping criterion is met. This There are a few reviews of decision trees in the
process results in a tree-like structure, where each literature; for example, Che et al. [11] presented
node represents a decision or a split based on a a review of decision trees and ensemble classi-
specific attribute [8]. The algorithm determines the fiers with specific applications to bioinformatics.
best attribute to use for each split based on certain The review focused on ID3, CART, and ensemble
criteria, such as information gain, gain ratio, and methods such as bagging, boosting, and stacked

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

generalization. Cañete-Sifuentes et al. [12] reviewed cesses [15], [16]. However, the modern implemen-
multivariate decision trees (MDT) and compared the tation of decision trees in the context of ML started
performance of several MDT induction classifiers. decades later. Breiman et al. [17] developed the
Anuradha and Gupta [13] presented a review of CART algorithm in 1984, introducing concepts such
decision tree classifiers, focusing on a high-level as the Gini index and binary splitting, which are
description of key concepts, such as node splitting now widespread in decision tree designs. Quinlan
and tree pruning. Meanwhile, Costa and Pedreira [18] developed ID3, one of the first notable decision
[14] reviewed recent decision tree-based classifier tree algorithms, in 1986. Furthermore, Quinlan [19]
advances. The paper covered three main issues: how enhanced the ID3, introducing the C4.5 decision
decision trees fit the training data, their generaliza- tree in 1993. These developments and integration of
tion, and interpretability. decision trees into ensemble methods like random
However, most of the existing surveys and re- forests and boosting algorithms have solidified their
views of decision trees focus on their applications place as fundamental algorithms in machine learn-
in specific domains or a high-level overview of the ing.
decision tree concept. Therefore, the current liter- The learning procedure of decision trees involves
ature lacks a comprehensive overview of decision a series of steps where the data is split into ho-
tree algorithms, their early developments, succinct mogenous subsets, as shown in Figure 1. The root
mathematical formulations, and algorithmic repre- node, which is the starting point of the tree, rep-
sentations in a single peer-reviewed paper. There- resents the entire dataset. The algorithm identifies
fore, it is essential to have a review that fills this the feature and the threshold that leads to the best
gap in view of the continuous use and prevalence of split based on a specific criterion [20]. The process
decision tree-based algorithms and their application continues recursively, with each subset of the data
in today’s technological advancements. Hence, in being further split at each child node. This continues
this study, we present a detailed review of decision until a stopping criterion is reached, typically when
tree-based algorithms. Specifically, the paper aims the nodes are pure (i.e., all data points in a node
to cover the different decision tree algorithms, in- belong to the same class) or when a predefined depth
cluding ID3, C4.5, C5.0, CART, conditional infer- of the tree is reached. The nodes where the tree
ence trees, and CHAID, together with other tree- ends, called leaf node or terminal node, represent
based ensemble algorithms, such as random forest, the outcomes or class labels. The decision to split at
rotation forest, and gradient boosting decision trees. each node is made using mathematical formulations
The paper aims to present their mathematical formu- such as information gain, Gini impurity, or variance
lations and algorithmic representations clearly and reduction.
concisely. Furthermore, the success of decision tree tech-
The rest of the paper is structured as follows: niques mainly depends on several factors contribut-
Section II presents a comprehensive overview of ing to their performance, interpretability, and appli-
the decision tree, covering key areas such as split- cability to a wide range of problems. These factors
ting criteria and tree pruning methods. Section III include data quality, tree depth, splitting criteria,
discusses different decision tree algorithms, their and tree pruning method. According to Piramuthu
learning process, splitting criteria, and mathematical [21], the effectiveness of decision trees is highly
formulations. Section IV reviews decision tree ap- dependent on the training data quality. Hence, it
plications in recent literature, including applications is necessary to use clean or preprocessed data not
in medical diagnosis and fraud detection. Section containing missing values and outliers, which can
V discusses key findings and future research direc- significantly enhance the performance of the re-
tions, and Section VI concludes the paper. sulting models. Additionally, feature selection and
feature engineering are necessary because inputting
II. OVERVIEW OF DECISION TREE relevant and well-transformed features can lead to
This section provides a comprehensive overview of more efficient and accurate splits.
decision trees, focusing on the main building blocks
and splitting criteria. Decision trees, as a concept in A. SPLITTING RULES
ML, have a history that dates back to the mid-20th The term splitting criteria, or splitting rules, de-
century. Initial decision tree studies were started by scribes the methods used to determine where a tree
Charles J. Clopper and Egon S. Pearson in 1934, should make a split in its nodes, effectively deciding
who introduced the concept of binary decision pro- how to divide the dataset into subsets based on dif-
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

the lowest weighted Gini Impurity for the resulting


subsets.

2) Information Gain
Information Gain (IG), a criterion used in ID3 and
C4.5, is based on the notion of entropy in informa-
tion theory. Entropy measures the unpredictability
or randomness in a set of data [26]. The IG tech-
nique searches for a split that maximizes the dif-
ference in certainty or decreases uncertainty before
and after the split. It determines the effectiveness
of an attribute in splitting the training data into
homogenous sets. Meanwhile, the entropy (E) of a
set S is given by the formula:

n
X
E(S) = − pi log2 (pi ) (2)
i=1

where n is the number of unique classes in the set,


and pi is the proportion of the samples in the set that
FIGURE 1. A Decision tree example belong to class i. Therefore, the IG for a split on a
dataset S with an attribute A can be computed as
follows:
ferent conditions [22], [23]. The choice of splitting
criterion is crucial as it directly impacts the tree’s
structure and, ultimately, its performance. Different X |Sv |
IG = E(S) − E(Sv ) (3)
decision tree algorithms use different criteria for this |S|
v∈V alues(A)
purpose, including the following:
where V alues(A) are the different values that at-
1) Gini index tribute A can take, and Sv is the subset of S for
Gini Index, also called Gini Impurity, is a well- which attribute A has the value v [27]. This formula
known splitting criterion used in the CART algo- calculates the change in entropy from the original
rithm. It measures the probability of a randomly set S to the sets Sv created after the split. A higher
chosen sample being incorrectly classified if it was IG indicates a more effective attribute for splitting
randomly labelled [24]. It is used to evaluate the the data, as it results in more homogeneous subsets.
quality of a split in the tree and is calculated for each
potential split in the dataset. The Gini Index for a set 3) Information Gain Ratio
can be represented mathematically as: The information gain ratio (IGR), an extension of
n
X information gain, is a splitting criterion mainly used
Gini(S) = 1 − p2i (1) in the C4.5 decision tree to overcome the bias of
i=1 information gain towards features that have several
distinct values by considering the number and size
where S, n, and pi represent a set of samples, the
of branches when choosing an attribute. The IGR
number of unique classes in the set, and the propor-
normalises the information gain by dividing it by the
tion of the samples in the set that belong to class
intrinsic information or split information (SplitInfo)
i, respectively. This formula calculates the proba-
of the split. This normalisation reduces the bias to-
bility of incorrectly classifying a randomly chosen
wards the multi-valued attributes, resulting in more
element from the set S based on the distribution
balanced and effective decision trees [26], [27]. The
of classes in it. The value of Gini Impurity ranges
IGR criterion is calculated as:
from 0 (perfect purity) to 1 (maximal impurity) [25].
When the algorithm evaluates where to split the
data, it calculates the Gini index for each potential Inf ormationGain(S, A)
split and typically chooses the split that results in IGR(S, A) = (4)
SplitInf o(S, A)
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

4) Chi-Square tree that contribute little to predicting the target vari-


2
The Chi-Square (χ ) splitting criterion measures able. It often requires a separate validation dataset to
the independence between an attribute and the class assess the impact of pruning [35]. This dataset tests
[28]. The χ2 test assesses whether the distribution the tree’s performance as it undergoes pruning.
of sample observations across different categories
deviates significantly from what would be expected C. INTERPRETABILITY OF DECISION TREES
if the categories were independent of the class. Decision trees are known for their inherent inter-
Given an attribute A with different categories and pretability, making them valuable in various do-
a target class C, the χ2 can be computed as: mains where understanding the decision-making
process is crucial [14], [36]. Unlike many other ML
r X k algorithms that produce black-box models, decision
2
X (Oij − Eij )2
χ = (5) trees offer transparency by representing the decision
Eij
i=1 j=1 process as a sequence of simple, intuitive rules.
Specifically, each node in a decision tree corre-
where r is the number of categories of the attribute
sponds to a feature and a decision threshold, and
A, k is the number of classes, Oij is the observed
the path from the root to a leaf node represents
frequency in cell (i, j) that belong to class j), and
a series of decisions based on the feature values.
Eij is the expected frequency in cell (i, j) under
This clear structure allows stakeholders to easily
the null hypothesis of independence, calculated as
(row_totali ×column_totalj ) comprehend and interpret how the model arrives at
Eij = total_samples . A high χ2 value
its predictions.
indicates a significant association between the at-
Furthermore, while complex models such as deep
tribute and the class, suggesting that the attribute is
neural networks and ensemble methods may achieve
a good predictor for splitting the dataset [29], [30].
high accuracy, their black-box nature makes it chal-
This criterion is useful for categorical data, and it
lenging to understand how they arrive at their pre-
identifies the most significant splits based on the chi-
dictions [37], [38]. In contrast, decision trees pro-
square test of independence.
vide a visual representation of the decision-making
process, allowing stakeholders to trace each deci-
B. TREE PRUNING METHODS sion back to specific features and thresholds. For
1) Pre-Pruning instance, in a medical diagnosis application, a deci-
Pre-pruning or early stopping techniques are used sion tree model may reveal which symptoms or risk
to effectively limit the size of the tree and reduce factors are most influential in predicting a particular
the possibility of overfitting [31], [32]. The main disease. This transparency enables domain experts
benefit of pre-pruning is its simplicity and the reduc- to validate the model’s decisions and identify poten-
tion in computational cost due to the construction tial biases or errors, thereby improving trust in the
of smaller trees. However, setting the pre-pruning model’s predictions.
parameters too aggressively may lead to underfit- Additionally, decision trees can facilitate feature
ting. Meanwhile, this strategy halts the tree’s growth selection and variable importance analysis, aiding
according to predefined criteria, such as maximum in feature engineering and model refinement [39]–
depth, minimum number of instances in a node, [41]. By examining the splits in the tree and the
minimum information gain, and maximum number associated feature importance scores, practitioners
of leaf nodes [33]. can identify the most influential features in the
prediction process. This information can guide data
2) Post-pruning preprocessing efforts and inform decisions about
Post-pruning, also called backward pruning, is a feature inclusion or exclusion in the model, leading
technique used to trim down a fully grown tree to to more efficient and interpretable models.
improve its generalization capabilities. Unlike pre-
pruning, which stops the tree from fully growing, III. DECISION TREE ALGORITHMS
post-pruning allows the tree to first grow to its full A. ITERATIVE DICHOTOMISER 3
size and then prunes it back [34]. Common post- The ID3 decision tree was first introduced in 1986
pruning techniques include reduced error pruning, by Quinlan [18]. It is particularly noted for its
pessimistic error pruning, error-based pruning, min- simplicity and effectiveness in solving classifica-
imum error pruning, and cost complexity pruning tion problems. The algorithm follows a top-down,
[33]. Post-pruning primarily removes sections of the greedy search approach through the given dataset to
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

construct a decision tree. It begins with the entire Equation 4, to select the best attribute to split the
dataset and divides it into subsets based on the at- dataset at each node, aiming to overcome the bias
tribute that maximizes the Information Gain (Equa- towards attributes with more levels found in the
tion 3), intending to efficiently classify the instances original Information Gain criterion used by ID3.
at each node of the tree. The ID3 is described in C5.0 is an improvement over C4.5, also proposed
Algorithm 1. by Quinlan [42], designed to be faster and more
memory efficient. It introduces several enhance-
Algorithm 1 ID3 Decision Tree Algorithm ments, such as advanced pruning methods and the
Require: Training data set D = ability to handle more complex types of data. C5.0
{(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )} maintains the use of the information gain ratio for
Ensure: Decision tree T . selecting attributes but optimises the algorithm’s
1: function ID3(D) execution and the resulting decision tree’s size.
2: if D is empty then return a terminal node
with default class cdef ault C. CLASSIFICATION AND REGRESSION TREES
3: end if The CART decision tree was proposed in 1984 by
4: if all instances in D have same class label y Breiman et al. [43]. Unlike C4.5, CART creates bi-
then return a terminal node with class y nary trees irrespective of the type of target variables.
5: end if It uses different splitting criteria for classification
6: if the attribute set J is empty then return a and regression tasks. For classification tasks, it uses
terminal node with the prevalent class in D the Gini index (Equation 1) as a measure to create
7: end if splits [44], [45]. Meanwhile, it employs variance as
8: Select the feature f that best splits the data the splitting criterion in regression tasks [46], [47].
using information gain. The variance reduction for a set S when split on
9: Create a decision node for f . attribute A is calculated as:
10: for each value bi of f do
11: Create a branch for bi . 
12: Let Di be the subset of D where xi = |Slef t |
V R = V (S) − V (Slef t )
bi . |S|
13: Recursively build the subtree for Di . |Sright |

14: Attach the subtree to the branch for bi . + V (Sright ) (6)
|S|
15: end for
16: return the decision node. where V (S) is the variance of the target variable
17: EndFunction in set S, and Slef t and Sright are the subsets of S
after the split on attribute A. In both cases, the goal
The algorithm iterates through every unused at- is to choose the split that maximizes the respective
tribute and calculates the Information Gain for a measure (Gini impurity reduction for classification
dataset split by the attribute’s possible values. The and variance reduction for regression), leading to
attribute with the highest Information Gain is cho- the most homogenous subsets possible. The CART
sen to make the decision at the node, and the dataset algorithm is described in Algorithm 2.
is partitioned accordingly. This process is repeated
recursively for each partitioned subset until one of D. CHI-SQUARED AUTOMATIC INTERACTION
the stopping criteria is met, such as when no further DETECTION
information can be gained, all instances in a subset The CHAID algorithm, developed by Kass [48],
belong to the same class, or there are no more at- performs multi-level splits when computing classifi-
tributes left to consider. Lastly, the ID3’s limitations cation trees. It is particularly robust in the detection
include its inability to directly handle continuous of interaction between variables. CHAID can handle
variables and overfitting. more than two categories for each variable, and
it uses the Chi-Square (χ2 ) test of independence
B. 4.5 AND C5.0 as its splitting criterion [49], [50]. This statistical
Quinlan [19] proposed the C4.5 in 1993 as an exten- test is applied to assess the relationship between
sion of the ID3 algorithm and is designed to handle categorical variables. For a given attribute A with
both continuous and discrete attributes. It introduces different categories and a target class C, the χ2
the concept of information gain ratio, described in statistic is computed as:
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Algorithm 2 CART Algorithm Algorithm 3 CHAID Algorithm


Require: D = {(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )}. Require: D = {(x1 , y1 ), (x2 , y2 ), ..., (xm , ym )}.
Ensure: Decision tree T . Ensure: Decision tree T .
1: function CART(D) 1: function CHAID(D)
2: if D is empty then return a terminal node 2: if D is empty then return a terminal node
with default value or class cdef ault with default class cdef ault
3: end if 3: end if
4: if all instances in D have the same class 4: if all instances in D have the same class
label y then return a terminal node with class label y then return a terminal node with class
y y
5: end if 5: end if
6: if the feature set F is empty then return a 6: if the feature set F is empty then return a
leaf node with the average value of y in D terminal node with the most prevalent class in
7: end if D
8: Select the best feature f and split point s 7: end if
that minimize the cost function. 8: Calculate the chi-squared statistic for each
9: Create a decision node for f and s. feature and its possible values.
10: Partition the data set D into two subsets D1 9: Select the feature and value with the highest
and D2 based on the split. chi-squared value.
11: Recursively build the subtree for D1 and 10: Create a decision node for the selected fea-
D2 . ture and value.
12: Attach the subtrees to the decision node. 11: Partition the data set D based on the se-
13: return the decision node. lected feature and value.
14: end function=0 12: for each subset Di of D do
13: Recursively build the subtree for Di .
14: Attach the subtree to the decision node.
15: end for
r X k
X (Oij − Eij )2 16: return the decision node.
χ2 = (7) 17: end function
i=1 j=1
Eij

where r is the number of categories of the attribute


A, k is the number of different classes in the target in the tree, with m examples and d features. Let Xs
variable C, Oij is the observed frequency in the ith be the subset of d features at node S, and Ys be the
category of attribute A and the j th class of C, and corresponding response values. Let Xj be the j-th
Eij is the expected frequency in the same cell under feature in Xs . Then, the algorithm can be defined
the null hypothesis of independence, calculated as as:
(row_totali ×column_totalj )
Eij = total_samples . The attribute with 1) For each feature Xj in Xs , calculate the p-
the highest χ2 statistic is selected for splitting at value of a statistical test for the null hypoth-
each node. A higher χ2 value indicates a stronger esis that there is no relationship between Xj
association between the attribute and the target vari- and Ys .
able, suggesting that the attribute is a good predictor 2) Choose the feature Xk and split point tk that
for splitting the dataset. Algorithm 3 details the maximize the statistical significance, based on
working process of the CHAID algorithm. the p-values of the tests.
3) Split the node into two child nodes S1 and S2 ,
E. CONDITIONAL INFERENCE TREES where S1 contains examples with Xk ≤ tk
The conditional inference trees, developed by and S2 contains examples with Xk > tk .
Hothorn et al. [51], is a non-parametric class of 4) Recursively repeat steps 1-3 to every child
decision trees that use statistical tests to determine node until a stopping criterion is reached.
splits, reducing bias and variance and providing a
more statistically sound approach. It is mostly use- F. RANDOM FOREST
ful when solving complex, non-linear relationships the random forest, described in Algorithm 4, is an
that exist between the predictor variables and the ensemble of decision trees [54], [55]. It improves
response variable [52], [53]. Assuming S is a node upon the basic decision tree algorithm by reducing
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

overfitting. Each tree in the forest is built from and L(y, Ft−1 (x)) is the loss function, the GBDT
a sample drawn with replacement (i.e., bootstrap algorithm works as follows:
sample) from the input data [56]. The basic idea 1) Initialize the model with a constant value
behind this algorithm is to generate a set of trees (e.g., the mean of the target variable).
using different subsets of the input samples and 2) For t = 1 to T :
features and then combine their outputs to obtain a) Compute the negative gradient of the
a final prediction. The Random Forest algorithm loss function with respect to the current
uses two main techniques to reduce overfitting and model’s predictions for each instance in
improve accuracy: the training data.
• Bootstrap Sampling: By sampling the data with b) Fit a decision tree to the negative gradi-
replacement, the algorithm generates multiple ent values, using the input data as fea-
training sets that are slightly different from tures and the negative gradient values as
each other. This type of sampling ensures re- target variables.
duced variance and prevents overfitting. c) Update the model by adding the new
• Feature Randomization: Randomly selecting a tree, weighted by a learning rate η, to the
subset of features for each tree ensures the current model.
algorithm decorrelates the trees and reduces 3) Make a prediction for a new instance by sum-
the chance of selecting the same "best" feature ming the predictions from the various trees:
for every tree. This improves the diversity and a) For a regression task, the final prediction
accuracy of the trees. is the sum of the predictions of all the
trees, i.e., f (x) is given by:
Algorithm 4 Random Forest Algorithm
T
1: for t = 1 to T do ▷ Generate T trees X
f (x) = ηht (x) (8)
2: Randomly sample n instances from D with
t=1
replacement
where η is the learning rate.
3: Randomly select m attributes from the total
b) For a classification task, the final pre-
p attributes (where m << p)
diction is the probability of the positive
4: Build a decision tree ht based on the sam-
class, computed by applying a sigmoid
pled instances and attributes
function to the sum of the predictions of
5: end for
all the trees.
6: end for
7: To make predictions for a new instance x:
1
f (x) = PT (9)
1+e − t=1 ηht (x)
8: if classification task then
PT
9: f (x) = argmaxc T1 t=1 I{ht (x) = c} ▷ where η is the learning rate and e is the
Majority vote across trees Euler’s number.
10: else if regression task then
PT
11: f (x) = T1 t=1 ht (x) ▷ Average of tree H. ROTATION FOREST
predictions Rotation forest is a type of decision tree ensemble
12: end if where each tree is trained on the principal com-
13: end if ponents of a randomly selected subset of features
[60], [61]. The core idea behind this algorithm is
to train each classifier in the ensemble on a version
G. GRADIENT BOOSTED DECISION TREES of the training data that has been transformed to
Gradient Boosted Decision Trees (GBDT) is an maintain the correlation between the features and
ensemble learning method that combines multiple introduce diversity among the classifiers. This is
decision trees to create a powerful predictive model achieved through the following steps:
[57]. Unlike Random Forest, which builds inde- 1) For each classifier to be trained, partition the
pendent trees in parallel, GBDT uses a sequential set of features F into k subsets. The partition-
approach to build trees that correct the errors of the ing can be random but is done in such a way
previous trees [58], [59]. It uses gradient descent that each subset contains a different part of the
to minimize errors. Assuming T is the number features.
of trees, ht (x) is the prediction of the t-th tree, 2) For each subset of features, apply PCA to
Ft−1 (x) is the current model’s predictions for x, obtain the principal components. This step
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

transforms the original feature space into a to predict the likelihood of a patient developing a
new space that captures the variance in the specific disease based on their medical history and
data more effectively. lifestyle factors [11], [62], [63]. This information
3) Combine the principal components from all can then be used to implement preventive measures
subsets to form a new set of features for and interventions, ultimately improving patient out-
training the classifier. This effectively rotates comes and reducing healthcare costs.
the axis of the feature space, hence the name Pathak et al. [64] proposed a heart disease pre-
Rotation Forest. diction model using a decision tree. The model was
4) Train each base classifier on the transformed built using a fuzzy rule-based technique combined
dataset. Different classifiers can be used, but with a decision tree, achieving an accuracy of 88%
decision trees are commonly applied. when trained on the Cleveland heart disease dataset
Given a dataset D with n features, the algorithm obtained from the University of California Irvine
partitions the feature set F into k non-overlapping (UCI) machine learning repository. Similarly, Maji
subsets F1 , F2 , . . . , Fk . For each subset Fi , PCA and Arora [65] conducted a study on heart disease
is applied to derive a set of principal components prediction using a different dataset from the UCI
P Ci , capturing the main variance directions of the machine learning repository. The study employed
features in Fi . The transformation for a subset Fi the C4.5 decision tree and a hybrid decision tree
can be represented as: made of C4.5 and artificial neural network (ANN),
where the former achieved an accuracy of 76.66%
Ti = P CA(Fi ) (10) and the latter 78.14%. The study demonstrated the
robustness of hybridising decision trees with neural
where Ti is the transformation matrix obtained from
networks.
PCA on subset Fi . The new feature set for training
Ahmad et al. [66] studied the performance of
the j th classifier, Dj , is obtained by applying the
several algorithms using different heart disease
transformation Ti to each subset Fi and concatenat-
datasets, including Cleveland, Switzerland, and
ing the results:
Long Beach. The algorithms studied include ran-
k
M dom forest, decision tree, support vector machine
Dj = Ti (Fi ) (11) (SVM), k-nearest neighbor (KNN), linear discrim-
i=1 inant analysis, and gradient boosting classifier. The
study employed sequential feature selection (SFS)
L
where denotes the concatenation of the trans-
formed feature subsets. The ensemble’s final out- to obtain the most significant features, which were
put is typically the majority vote (for classification then used to train the models. The study con-
tasks) of the predictions from all base classifiers. cluded that the random forest-SFS and decision tree-
A summary of the different tree-based algorithms SFS achieved the best accuracy. For the Cleveland
is tabulated in Table 1, including their advantages dataset, the random forest and decision tree obtained
and disadvantages. accuracies of 100
In [67], the authors identified the C4.5 and ran-
IV. DECISION TREE APPLICATIONS IN RECENT dom forest as potentially robust algorithms for de-
LITERATURE tecting chronic kidney disease (CKD) stages. The
Decision trees have gained significant attention in study employed a CKD dataset from the UCI ma-
recent literature. This section discusses some pop- chine learning repository, comprising 25 features
ular applications of decision trees in fields such as and 400 samples. The results indicated that the
healthcare and finance. C4.5 achieved an accuracy of 85.5%, outperforming
the random forest, which achieved an accuracy of
A. MEDICAL DIAGNOSIS 78.25%.
Healthcare is one of the prominent areas where Decision tree-based methods have also been em-
decision trees have found extensive use. Researchers ployed to diagnose COVID-19. Yoo et al. [66] pro-
have utilized decision trees to predict disease di- posed a deep learning-based decision tree model to
agnosis, treatment outcomes, and patient progno- detect COVID-19 using chest X-ray images. The
sis. Decision trees are effective in identifying pat- approach consists of three decision trees trained
terns and relationships in medical data, leading to using deep learning architectures, including a con-
more accurate diagnoses and personalized treatment volutional neural network (CNN). One tree classi-
plans. For example, decision trees have been used fies the images as normal or abnormal, another tree
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1. Summary of Decision Tree Algorithms

Algorithm Overview Advantages Disadvantages


ID3 A simple, greedy algorithm using Easy to understand and im- Prone to overfitting; not for
Information Gain as the rule. Ideal plement. continuous variables.
for categorical data.
C4.5 Extends ID3, improves handling of Handles both continuous and More complex than ID3;
continuous data, and introduces tree categorical data; uses pruning slower with large data.
pruning. to reduce overfitting.
C5.0 Evolution of C4.5, focuses on ef- Efficient with large datasets; Can overfit without proper
ficiency and scalability, uses Infor- generates smaller trees. pruning; less interpretable.
mation Gain Ratio.
CART Uses Gini index for classification Versatile for both classifica- Can overfit; needs careful
and variance reduction for regres- tion and regression; simpli- tuning and pruning.
sion, makes binary splits. fies the model.
CHAID Employs Chi-square test for multi- Good for categorical data and Can produce large trees; sen-
level splits, identifies variable inter- detecting interactions. sitive to data changes.
actions.
Conditional Inference Uses statistical tests for splitting to Provides a statistically rig- Computationally intensive;
Trees minimize bias, suitable for categor- orous approach; less biased less intuitive than simpler
ical variables. towards variables with many models.
categories.
Random Forest An ensemble of trees using bagging Reduces overfitting; handles Less interpretable; higher
to reduce overfitting and improve high dimensionality well. computational costs.
accuracy.
Gradient Boosted De- Builds trees sequentially to correct High accuracy; effective with Prone to overfitting; requires
cision Trees errors, using variance reduction cri- various data types. careful tuning.
teria.
Rotation Forest Enhances model diversity by rotat- Improves accuracy by pro- Increased computational
ing the feature space using PCA moting diversity. complexity; challenging
before building each tree. tuning.

detects tuberculosis indicators in the abnormal im- ensemble classifier. The study evaluated the per-
ages, and the last detects COVID-19. The approach formance of ensemble pruning on the imbalanced
achieved an average accuracy of 95%. Ghiasi and glaucoma dataset. The ensemble pruning techniques
Zendehboudi [68] proposed a decision tree-based include pruning by prediction accuracy (using
ensemble classifier for detecting breast cancer. The the Brier Score strategy), pruning by uncertainty-
study used the well-known Wisconsin Breast Cancer weighted accuracy (UWA), and pruning by diversity
dataset and aimed to build a robust breast cancer (using the Double-Fault measure). The experimental
detection framework using the random forest and results indicated that the RF model reached an area
extra trees classifier (ET). The approach resulted in under the receiver operating characteristic curve
an accuracy of 100%. (AUC) of 0.98 for the Brier and double-fault prun-
Mienye and Sun [69] studied the performance ing techniques.
of ML algorithms for heart disease prediction. The Additionally, Mienye et al. [71] employed deci-
study utilized the following algorithms: decision sion tree, SVM, and logistic regression for CKD
tree, XGBoost, random forest, logistic regression, detection. The selected algorithms were also used
and naive Bayes. Firstly, the authors employed the as the base learners in the AdaBoost ensemble. The
Synthetic Minority Oversampling Technique-Edited study reported accuracies of 94% and 100% for the
Nearest Neighbor (SMOTE-ENN) to resample the decision tree and AdaBoost classifier that used a
data and solve the imbalance class problem. Also, decision tree as a based learner. The study demon-
the recursive feature elimination technique was em- strated the robustness of using a decision tree in
ployed to identify the most significant attributes to the AdaBoost over the SVM and logistic regression.
further enhance the classification performance of the Furthermore, Mienye and Sun [72] studied the im-
models. The results showed that the decision tree, pact of cost-sensitive ML in medical diagnosis using
random forest, and XGBoost achieved an accuracy the following algorithms: decision tree, random for-
of 87.7%, 93%, and 95.6%, respectively, with the est, and XGBoost. Cost-sensitive learning involves
XGBoost obtaining the highest accuracy. modifying the algorithm to focus on the minority
Meanwhile, Adler et al. [70] developed a Glau- class samples, thereby enhancing the model’s per-
coma detection method using the random forest formance on the minority class, which in most ap-
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

plications is of higher importance than the majority bust interpretability.


class. When applied for detecting cervical cancer, Alam et al. [78] studied the imbalance class prob-
the cost-sensitive random forest obtained the highest lem in credit risk prediction. The study employed
classification accuracy of 98.8%, outperforming the different credit risk datasets, including the German
other cost-sensitive and standard algorithms. credit approval dataset, the Taiwan dataset, and the
Furthermore, Khan et al. [73] proposed an ensem- European credit card clients dataset. The gradient-
ble approach called optimal trees ensemble (OTE) boosted decision tree model combined with the k-
and applied it to diverse classification problems, means SMOTE technique achieved accuracies of
including hepatitis and Pakinson’s disease detection, 84.6%, 89%, and 87.1% on the German, Taiwan,
achieving error rates of 0.1230 and 0.0861, respec- and European clients datasets, respectively.
tively. The error rates, which translate to 87.7%
Hancock et al. [79] employed gradient-boosted
and 91.4% accuracy, imply the proposed OTE out-
decision tree-based algorithms for detecting health
performed other baseline models, including KNN,
insurance fraud. This is an important ML appli-
LDA, and random forest. Table 2 summarizes the
cation as healthcare fraud is capable of denying
discussed studies on medical diagnosis, indicating
patients the needed medical attention. In this study,
how decision trees have been employed in the med-
the authors employed claims data to train the vari-
ical domain, achieving excellent classification per-
ous classifiers, including categorical boosting (Cat-
formance.
Boost), achieving an AUC of 0.775, outperforming
other ML algorithms. The study went further to
B. FINANCE
demonstrate the model’s performance after intro-
Decision trees have also been widely employed in ducing a new variable called Healthcare provider
the field of finance. By analysing historical data and state, leading to the CatBoost obtaining an AUC of
identifying relevant variables, decision trees can ac- 0.882.
curately predict the creditworthiness of individuals.
Wong et al. [80] conducted a comparative study
This information is crucial for banks and lending
of ML algorithms for credit risk prediction. The
institutions in determining the risk associated with
study focused on decision tree, random forest, KNN,
granting loans [74], [75]. Furthermore, decision
logistic regression, and naive Bayes classifiers. The
trees have been used to detect fraudulent activities
aim of the study was to assess which classifier
in financial transactions by examining transactional
would achieve the highest performance in terms
data and identifying suspicious patterns, helping to
of accuracy and other metrics. The experimental
prevent financial losses.
results indicated that the decision tree and random
Yao et al. [76] studied credit risk within an
forest achieved an accuracy of 92.11% and 94.57%,
enterprise setting. The study proposed a decision
with the random forest outperforming the other clas-
tree-based ensemble classifier that uses the SMOTE
sifiers, demonstrating the robustness of tree-based
and AdaBoost algorithms. The proposed model was
ensemble classifiers.
aimed at identifying enterprise credit risk by in-
corporating supply chain information. Other bench- Seera et al. [81] employed a decision tree for
mark models were built using KNN, logistic regres- credit card fraud detection, using credit card trans-
sion, SVM, and random forest. The study indicated action records in Malaysia, obtaining a classifica-
that the proposed decision tree ensemble achieved tion accuracy of 99.96%. Rawat et al. [82] studied
the best and most stable performance, obtaining an the performance of four classifiers on credit credit
AUC of 0.902. card fraud detection. The classifiers include logistic
Liu et al. [77] developed an approach for financial regression, RF, KNN, and AdaBoost. The various
institutions to effectively predict credit risk and en- models achieved classification accuracies of 99%.
hance profitability. The proposed approach uses the Similarly, Adhegaonkar et al. [83] employed de-
gradient-boosting decision tree. While the GBDT cision tree, random forest, logistic regression, and
was efficient in predicting the credit risk, it lacked SVM for credit card fraud detection. The experi-
sufficient interpretability. Therefore, the study in- mental results showed that the decision tree obtained
troduced an enhanced method called tree-based an accuracy of 84.9%. However, the random forest
augmented GBDT, which uses a step-wise feature obtained the best performance with an accuracy
augmentation framework. The proposed approach of 85.2%. A summary of the reviewed papers is
achieved a classification accuracy of 93.78%, out- tabulated in Table 3.
performing the standard GBDT and displaying ro-
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2. Summary of the Medical diagnosis studies

Reference Year Algorithm Application Accuracy(%)


Adler et al. [70] 2016 Random forest ensemble pruning Glaucoma -
Maji and Arora [65] 2018 C4.5 Heart Disease 76.66
Maji and Arora [65] 2018 Hybrid DT of C4.5 and ANN Heart Disease 78.14
Khan et al. [73] 2019 Optimal trees ensemble Hepatitis 87.7
Khan et al. [73] 2019 Optimal trees ensemble Pakinson’s disease 91.4
Pathak et al. [64] 2020 C4.5 Decision Tree Heart Disease 88.0
Yoo et al. [66] 2020 Deep learning-based Decision Tree COVID-19 95.0
Mienye et al. [71] 2021 AdaBoost-DT Chronic Kidney Disease 100
Ilyas et al. [67] 2021 C4.5 Chronic Kidney Disease 85.5
Ilyas et al. [67] 2021 Random forest Chronic Kidney Disease 78.25
Ghiasi and Zendehboudi [68] 2021 Random forest and ET Breast Cancer 100
Mienye and Sun [72] 2021 Cost-sensitive random forest Cervical Cancer 98.8
Mienye and Sun [69] 2021 XGBoost Heart Disease 95.6
Ahmad et al. [66] 2022 Random forest Heart Disease 100
Ahmad et al. [66] 2022 CART Heart Disease 100

TABLE 3. Summary of the Credit Risk and Fraud Detection studies

Reference Year Algorithm Application Accuracy(%)


Nadim et al. [84] 2019 Random forest Credit card fraud detection 98.6
Makki et al. [85] 2019 C5.0 Credit Card Fraud Detection 96.0
Wong et al. [80] 2020 Random forest Credit Risk Prediction 94.6
Wong et al. [80] 2020 Decision tree Credit Risk Prediction 92.1
Alam et al. [78] 2020 GDBT and k-means SMOTE Credit Risk Prediction (German dataset) 84.6
Alam et al. [78] 2020 GDBT and k-means SMOTE Credit Risk Prediction (Taiwan dataset) 89.0
Alam et al. [78] 2020 GDBT and k-means SMOTE Credit card fraud detection 87.1
Seera et al. [81] 2021 CART Credit card fraud detection 99.9
Hancock et al. [79] 2021 CatBoost Healthcare Insurance Fraud -
Yao et al. [76] 2022 Decision tree ensemble with SMOTE Enterprise credit risk -
Liu et al. [77] 2022 Augmented GBDT Credit risk prediction 93.8
Seera et al. [81] 2024 AdaBoost Credit card fraud detection 99.0
Adhegaonkar et al. [83] 2024 Random forest Credit card fraud detection 85.2

V. DISCUSSIONS AND FUTURE RESEARCH number of features, as the tree structure becomes
DIRECTIONS complex and prone to overfitting. Future research
Decision trees have proven to be effective in various could explore techniques to improve the scalability
domains, including healthcare and finance. How- and efficiency of decision trees in high-dimensional
ever, like any other algorithm, decision trees have settings, such as feature selection methods or dimen-
their limitations and areas for improvement. In this sionality reduction techniques.
section, we will explore some potential future re- Furthermore, while decision trees are known for
search directions in decision trees that can enhance their interpretability compared to other machine
their performance and address their limitations. learning algorithms, they can still be difficult to un-
Firstly, the handling of missing data is a crucial derstand and explain, especially when they become
area of potential improvement for decision trees. large and complex. Future research could investigate
Currently, decision trees either ignore instances with methods to simplify decision trees and make them
missing values or use surrogate splits to make pre- more understandable to non-experts, such as rule
dictions [86], [87]. However, these approaches may extraction algorithms or visualisation techniques.
not always be optimal and can lead to biased or Additionally, decision trees are sensitive to outliers
inaccurate results. Future research could focus on and can easily be influenced by noisy data, leading
developing more sophisticated methods to handle to inaccurate predictions [91]. It might be worth
missing data in decision trees, such as advanced examining the robustness of decision trees to out-
imputation techniques or incorporating uncertainty liers and noisy data and exploring methods to make
estimation. decision trees more robust to outliers and noise, such
Another future research direction will be en- as outlier detection techniques or robust splitting
hancing the ability of decision trees to handle criteria.
high-dimensional data [88]–[90]. Decision trees can Lastly, the application of decision trees in emerg-
struggle when faced with datasets that have a large ing fields and domains is a potential future re-
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

search direction. Decision trees have been exten- streak virus disease detection,” Machine Learning with Appli-
sively studied and applied in traditional domains cations, p. 100556, 2024.
[8] R. Rivera-Lopez, J. Canul-Reich, E. Mezura-Montes, and
such as healthcare, finance, and marketing. How- M. A. Cruz-Chávez, “Induction of decision trees as classifica-
ever, there are numerous emerging fields where tion models through metaheuristics,” Swarm and Evolutionary
decision trees can potentially make a significant im- Computation, vol. 69, p. 101006, 3 2022.
[9] O. Sagi and L. Rokach, “Explainable decision forest: Trans-
pact. For example, decision trees could be applied in forming a decision forest into an interpretable tree,” Informa-
the field of autonomous vehicles to aid in decision- tion Fusion, vol. 61, pp. 124–138, 9 2020.
making processes or in the field of natural language [10] L.-a. Dong, X. Ye, and G. Yang, “Two-stage rule extraction
method based on tree ensemble model for interpretable loan
processing to improve sentiment analysis and text evaluation,” Information Sciences, vol. 573, pp. 46–64, 9 2021.
classification tasks. Future research could explore [11] D. Che, Q. Liu, K. Rasheed, and X. Tao, “Decision tree
the potential applications of decision trees in these and ensemble learning algorithms with their applications in
bioinformatics,” in Advances in Experimental Medicine and
emerging fields and investigate their effectiveness in Biology, pp. 191–199, Springer New York, 2011.
solving complex problems. [12] L. Canete-Sifuentes, R. Monroy, and M. A. Medina-Perez, “A
review and experimental comparison of multivariate decision
trees,” IEEE Access, vol. 9, pp. 110451–110479, 2021.
VI. CONCLUSION [13] Anuradha and G. Gupta, “A self explanatory review of decision
Decision trees have shown great potential and ef- tree classifiers,” 2014 Recent Advances and Innovations in
Engineering (ICRAIE), IEEE, 5 2014.
fectiveness in various fields. Their ability to analyse [14] V. G. Costa and C. E. Pedreira, “Recent advances in deci-
complex data and identify patterns and relationships sion trees: an updated survey,” Artificial Intelligence Review,
makes them valuable in the field of machine learn- vol. 56, pp. 4765–4800, 10 2022.
[15] C. Gupta and A. Ramdas, “Distribution-free calibration guar-
ing. This paper presented an overview of the deci-
antees for histogram binning without sample splitting,” in
sion trees, including their early development to the International conference on machine learning, pp. 3942–3952,
recent high-performing tree-based ensemble meth- PMLR, 2021.
ods. The article covers the main decision tree algo- [16] F. Mazurek, A. Tschand, Y. Wang, M. Pajic, and D. Sorin,
“Rigorous evaluation of computer processors with statistical
rithms, such as CART, ID3, C4.5, C5.0, CHAID, model checking,” MICRO ’23: 56th Annual IEEE/ACM Inter-
and conditional inference trees. Their applications national Symposium on Microarchitecture, ACM, 10 2023.
in medical diagnosis, credit risk, and fraud detection [17] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24,
pp. 123–140, 8 1996.
were reviewed. This study will be beneficial to ML [18] J. R. Quinlan, “Induction of decision trees,” Machine Learning,
practitioners and researchers trying to understand vol. 1, pp. 81–106, 3 1986.
decision trees and the widely used tree-based algo- [19] J. R. Quinlan, C4. 5: programs for machine learning. Elsevier,
2014.
rithms. [20] I. D. Mienye, Y. Sun, and Z. Wang, “Prediction performance of
improved decision tree-based algorithms: a review,” Procedia
Manufacturing, vol. 35, pp. 698–703, 2019.
REFERENCES
[21] S. PIRAMUTHU, “Input data for decision trees,” Expert Sys-
[1] J. G. Richens, C. M. Lee, and S. Johri, “Improving the ac- tems with Applications, vol. 34, pp. 1220–1226, 2 2008.
curacy of medical diagnosis with causal machine learning,” [22] S. Hwang, H. G. Yeo, and J.-S. Hong, “A new splitting
Nature Communications, vol. 11, 8 2020. criterion for better interpretable trees,” IEEE Access, vol. 8,
[2] G. Obaido, F. J. Agbo, C. Alvarado, and S. S. Oyelere, “Anal- pp. 62762–62774, 2020.
ysis of attrition studies within the computer sciences,” Ieee [23] J.-S. Hong, J. Lee, and M. K. Sim, “Concise rule induction al-
Access, 2023. gorithm based on one-sided maximum decision tree approach,”
[3] S. Ahmed, M. M. Alshater, A. E. Ammari, and H. Hammami, Expert Systems with Applications, vol. 237, p. 121365, 2024.
“Artificial intelligence and machine learning in finance: A [24] D. Bertsimas and J. Dunn, “Optimal classification trees,” Ma-
bibliometric review,” Research in International Business and chine Learning, vol. 106, pp. 1039–1082, 4 2017.
Finance, vol. 61, p. 101646, 10 2022. [25] L. Rutkowski, M. Jaworski, L. Pietruczuk, and P. Duda, “The
[4] G. Obaido, B. Ogbuokiri, C. W. Chukwu, F. J. Osaye, O. F. cart decision tree for mining data streams,” Information Sci-
Egbelowo, M. I. Uzochukwu, I. D. Mienye, K. Aruleba, ences, vol. 266, pp. 1–15, 5 2014.
M. Primus, and O. Achilonu, “An improved ensemble method [26] C. J. Mantas, J. Abellán, and J. G. Castellano, “Analysis of
for predicting hyperchloremia in adults with diabetic ketoaci- credal-c4.5 for classification in noisy domains,” Expert Sys-
dosis,” IEEE Access, vol. 12, pp. 9536–9549, 2024. tems with Applications, vol. 61, pp. 314–326, 11 2016.
[5] C. Wang, J. Xu, S. Tan, and L. Yin, “Secure decision tree [27] G. S. Reddy and S. Chittineni, “Entropy based c4.5-sho al-
classification with decentralized authorization and access con- gorithm with information gain optimization in data mining,”
trol,” Computer Standards amp; Interfaces, vol. 89, p. 103818, PeerJ Computer Science, vol. 7, p. e424, 4 2021.
4 2024. [28] N. Peker and C. Kubat, “Application of chi-square discretiza-
[6] M. M. Rahman and S. A. Nisher, “Predicting average local- tion algorithms to ensemble classification methods,” Expert
ization error of underwater wireless sensors via decision tree Systems with Applications, vol. 185, p. 115540, 12 2021.
regression and gradient boosted regression,” in Proceedings of [29] L. A. Badulescu, “A chi-square based splitting criterion better
International Conference on Information and Communication for the decision tree algorithms,” 2021 25th International Con-
Technology for Development, pp. 29–41, Springer Nature ference on System Theory, Control and Computing (ICSTCC),
Singapore, 2023. IEEE, 10 2021.
[7] T. O’Halloran, G. Obaido, B. Otegbade, and I. D. Mienye, “A [30] F. Mahan, M. Mohammadzad, S. M. Rozekhani, and
deep learning approach for maize lethal necrosis and maize W. Pedrycz, “Chi-mflexdt:chi-square-based multi flexible

12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

fuzzy decision tree for data stream classification,” Applied Soft [49] S. Kushiro, S. Fukui, A. Inui, D. Kobayashi, M. Saita, and
Computing, vol. 105, p. 107301, 7 2021. T. Naito, “Clinical prediction rule for bacterial arthritis: Chi-
[31] F. J. Mehedi Shamrat, S. Chakraborty, M. M. Billah, P. Das, squared automatic interaction detector decision tree analysis
J. N. Muna, and R. Ranjan, “A comprehensive study on pre- model,” SAGE Open Medicine, vol. 11, p. 205031212311609,
pruning and post-pruning methods of decision tree classifi- 1 2023.
cation algorithm,” in 2021 5th International Conference on [50] H. Prasetyono, A. Abdillah, T. Anita, A. Nurfarkhana, and
Trends in Electronics and Informatics (ICOEI), pp. 1339– A. Sefudin, “Identification of the decline in learning outcomes
1345, 2021. in statistics courses using the chi-squared automatic interaction
[32] Y. Manzali and P. M. E. Far, “A new decision tree pre- detection method,” Journal of Physics: Conference Series,
pruning method based on nodes probabilities,” in 2022 Interna- vol. 1490, p. 012072, 3 2020.
tional Conference on Intelligent Systems and Computer Vision [51] T. Hothorn, K. Hornik, and A. Zeileis, “Unbiased recursive
(ISCV), pp. 1–5, 2022. partitioning: A conditional inference framework,” Journal of
[33] S. Trabelsi, Z. Elouedi, and K. Mellouli, “Pruning belief deci- Computational and Graphical statistics, vol. 15, no. 3, pp. 651–
sion tree methods in averaging and conjunctive approaches,” 674, 2006.
International Journal of Approximate Reasoning, vol. 46, [52] N. Levshina, “Conditional inference trees and random forests,”
pp. 568–595, 12 2007. in A Practical Handbook of Corpus Linguistics, pp. 611–643,
[34] T. Lazebnik and S. Bunimovich-Mendrazitsky, “Decision tree Springer International Publishing, 2020.
post-pruning without loss of accuracy using the sat-pp al- [53] B. Schivinski, “Eliciting brand-related social media engage-
gorithm with an empirical evaluation on clinical data,” Data ment: A conditional inference tree framework,” Journal of
Knowledge Engineering, vol. 145, p. 102173, 2023. Business Research, vol. 130, pp. 594–602, 6 2021.
[35] E. Frantar and D. Alistarh, “SparseGPT: Massive language [54] N. Younas, A. Ali, H. Hina, M. Hamraz, Z. Khan, and S. Al-
models can be accurately pruned in one-shot,” in Proceedings dahmani, “Optimal causal decision trees ensemble for im-
of the 40th International Conference on Machine Learning proved prediction and causal inference,” IEEE Access, vol. 10,
(A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and pp. 13000–13011, 2022.
J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning [55] Z. Khan, A. Gul, O. Mahmoud, M. Miftahuddin, A. Perper-
Research, pp. 10323–10337, PMLR, 23–29 Jul 2023. oglou, W. Adler, and B. Lausen, “An ensemble of optimal trees
[36] B. Mahbooba, M. Timilsina, R. Sahal, and M. Serrano, “Ex- for class membership probability estimation,” in Analysis of
plainable artificial intelligence (xai) to enhance trust manage- large and complex data, pp. 395–409, Springer, 2016.
ment in intrusion detection systems using decision tree model,” [56] I. D. Mienye and Y. Sun, “A survey of ensemble learning: Con-
Complexity, vol. 2021, pp. 1–11, 2021. cepts, algorithms, applications, and prospects,” IEEE Access,
[37] S. J. Oh, B. Schiele, and M. Fritz, “Towards reverse- vol. 10, pp. 99129–99149, 2022.
engineering black-box neural networks,” Explainable AI: in- [57] Z. Zhang and C. Jung, “Gbdt-mo: Gradient-boosted decision
terpreting, explaining and visualizing deep learning, pp. 121– trees for multiple outputs,” IEEE Transactions on Neural Net-
144, 2019. works and Learning Systems, vol. 32, no. 7, pp. 3156–3167,
[38] E. Zihni, V. I. Madai, M. Livne, I. Galinovic, A. A. Khalil, 2021.
J. B. Fiebach, and D. Frey, “Opening the black box of artificial [58] M.-J. Jun, “A comparison of a gradient boosting decision tree,
intelligence for clinical decision support: A study predicting random forests, and artificial neural networks to model urban
stroke outcome,” Plos one, vol. 15, no. 4, p. e0231166, 2020. land use changes: the case of the seoul metropolitan area,”
[39] C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and International Journal of Geographical Information Science,
A. Zeileis, “Conditional variable importance for random vol. 35, pp. 2149–2167, 3 2021.
forests,” BMC bioinformatics, vol. 9, pp. 1–11, 2008. [59] V. A. Dev and M. R. Eden, “Formation lithology classification
[40] S. Syed Mustapha, “Predictive analysis of students’ learning using scalable gradient boosted decision trees,” Computers
performance using data mining techniques: A comparative Chemical Engineering, vol. 128, pp. 392–404, 2019.
study of feature selection methods,” Applied System Innova- [60] S. Demir and E. K. Sahin, “Comparison of tree-based machine
tion, vol. 6, no. 5, p. 86, 2023. learning algorithms for predicting liquefaction potential using
[41] S. Ben Jabeur, N. Stef, and P. Carmona, “Bankruptcy pre- canonical correlation forest, rotation forest, and random forest
diction using the xgboost algorithm and variable importance based on cpt data,” Soil Dynamics and Earthquake Engineer-
feature engineering,” Computational Economics, vol. 61, no. 2, ing, vol. 154, p. 107130, 3 2022.
pp. 715–741, 2023. [61] E. K. Sahin, I. Colkesen, and T. Kavzoglu, “A comparative
[42] J. R. Quinlan, “Data mining tools see5 and c5. 0,” http://www. assessment of canonical correlation forest, random forest, ro-
rulequest. com/see5-info. html, 2004. tation forest and logistic regression methods for landslide sus-
[43] L. Breiman, Classification and regression trees. Routledge, ceptibility mapping,” Geocarto International, vol. 35, pp. 341–
2017. 363, 10 2018.
[44] M.-M. Chen and M.-C. Chen, “Modeling road accident sever- [62] F. L. Seixas, B. Zadrozny, J. Laks, A. Conci, and D. C. M.
ity with comparisons of logistic regression, decision tree and Saade, “A Bayesian network decision model for supporting the
random forest,” Information, vol. 11, no. 5, 2020. diagnosis of dementia, Alzheimer s disease and mild cognitive
[45] D.-H. Lee, S.-H. Kim, and K.-J. Kim, “Multistage mr-cart: impairment,” Computers in biology and medicine, vol. 51,
Multiresponse optimization in a multistage process using a pp. 140–158, 2014.
classification and regression tree method,” Computers amp; [63] G. Obaido, B. Ogbuokiri, I. D. Mienye, and S. M. Kasongo,
Industrial Engineering, vol. 159, p. 107513, 9 2021. “A voting classifier for mortality prediction post-thoracic
[46] E. Belli and S. Vantini, “Measure inducing classification and surgery,” in International Conference on Intelligent Systems
regression trees for functional data,” Statistical Analysis and Design and Applications, pp. 263–272, Springer, 2022.
Data Mining: The ASA Data Science Journal, vol. 15, no. 5, [64] A. K. Pathak and J. Arul Valan, “A predictive model for
pp. 553–569, 2022. heart disease diagnosis using fuzzy logic and decision tree,” in
[47] H. Ishwaran, “The effect of splitting on random forests,” Advances in Intelligent Systems and Computing, pp. 131–140,
Machine learning, vol. 99, pp. 75–118, 2015. Springer Singapore, 12 2019.
[48] G. V. Kass, “An exploratory technique for investigating large [65] S. Maji and S. Arora, “Decision tree algorithms for predic-
quantities of categorical data,” Journal of the Royal Statistical tion of heart disease,” in Information and Communication
Society: Series C (Applied Statistics), vol. 29, no. 2, pp. 119– Technology for Competitive Strategies, pp. 447–454, Springer
127, 1980. Singapore, 8 2018.

VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[66] G. N. Ahmad, S. Ullah, A. Algethami, H. Fatima, and S. M. H. card fraud detection,” in 2024 2nd International Conference on
Akhter, “Comparative study of optimum medical diagnosis Disruptive Technologies (ICDT), pp. 567–570, 2024.
of human heart disease using machine learning technique [83] V. R. Adhegaonkar, A. R. Thakur, and N. Varghese, “Advanc-
with and without sequential feature selection,” IEEE Access, ing credit card fraud detection through explainable machine
vol. 10, pp. 23808–23828, 2022. learning methods,” in 2024 2nd International Conference on
[67] H. Ilyas, S. Ali, M. Ponum, O. Hasan, M. T. Mahmood, Intelligent Data Communication Technologies and Internet of
M. Iftikhar, and M. H. Malik, “Chronic kidney disease diagno- Things (IDCIoT), pp. 792–796, 2024.
sis using decision tree algorithms,” BMC Nephrology, vol. 22, [84] A. H. Nadim, I. M. Sayem, A. Mutsuddy, and M. S. Chowd-
8 2021. hury, “Analysis of machine learning techniques for credit card
[68] M. M. Ghiasi and S. Zendehboudi, “Application of decision fraud detection,” in 2019 International Conference on Machine
tree-based ensemble learning in the classification of breast can- Learning and Data Engineering (iCMLDE), pp. 42–47, 2019.
cer,” Computers in Biology and Medicine, vol. 128, p. 104089, [85] S. Makki, Z. Assaghir, Y. Taher, R. Haque, M.-S. Hacid,
1 2021. and H. Zeineddine, “An experimental study with imbalanced
[69] I. D. Mienye and Y. Sun, “Effective feature selection for classification approaches for credit card fraud detection,” IEEE
improved prediction of heart disease,” in Pan-African Artifi- Access, vol. 7, pp. 93010–93022, 2019.
cial Intelligence and Smart Systems (T. M. N. Ngatched and [86] S. Nijman, A. Leeuwenberg, I. Beekers, I. Verkouter, J. Jacobs,
I. Woungang, eds.), (Cham), pp. 94–107, Springer Interna- M. Bots, F. Asselbergs, K. Moons, and T. Debray, “Missing
tional Publishing, 2022. data is poorly handled and reported in prediction model studies
using machine learning: a literature review,” Journal of Clinical
[70] W. Adler, O. Gefeller, A. Gul, F. K. Horn, Z. Khan, and
Epidemiology, vol. 142, pp. 218–229, 2022.
B. Lausen, “Ensemble pruning for glaucoma detection in an
[87] R. V. McCarthy, M. M. McCarthy, W. Ceccucci, and L. Halawi,
unbalanced data set,” Methods of Information in Medicine,
Predictive Models Using Decision Trees, pp. 123–144. Cham:
vol. 55, no. 06, pp. 557–563, 2016.
Springer International Publishing, 2019.
[71] I. D. Mienye, G. Obaido, K. Aruleba, and O. A. Dada, “En- [88] A. Mhasawade, G. Rawal, P. Roje, R. Raut, and A. Devkar,
hanced prediction of chronic kidney disease using feature “Comparative study of svm, knn and decision tree for diabetic
selection and boosted classifiers,” in International Conference retinopathy detection,” in 2023 International Conference on
on Intelligent Systems Design and Applications, pp. 527–537, Computational Intelligence and Sustainable Engineering So-
Springer, 2021. lutions (CISES), pp. 166–170, 2023.
[72] I. D. Mienye and Y. Sun, “Performance analysis of cost- [89] T. Wang, R. Gault, and D. Greer, “Cutting down high dimen-
sensitive learning methods with application to imbalanced sional data with fuzzy weighted forests (fwf),” in 2022 IEEE
medical data,” Informatics in Medicine Unlocked, vol. 25, International Conference on Fuzzy Systems (FUZZ-IEEE),
p. 100690, 2021. pp. 1–8, 2022.
[73] Z. Khan, A. Gul, A. Perperoglou, M. Miftahuddin, O. Mah- [90] Z. Azam, M. M. Islam, and M. N. Huda, “Comparative analy-
moud, W. Adler, and B. Lausen, “Ensemble of optimal trees, sis of intrusion detection systems and machine learning-based
random forest and random projection ensemble classification,” model analysis through decision tree,” IEEE Access, vol. 11,
Advances in Data Analysis and Classification, vol. 14, pp. 97– pp. 80348–80391, 2023.
116, 6 2019. [91] Y. Xia, “A novel reject inference model using outlier detection
[74] V. García, A. I. Marqués, and J. S. Sánchez, “Exploring the and gradient boosting technique in peer-to-peer lending,” IEEE
synergetic effects of sample types on the performance of Access, vol. 7, pp. 92893–92907, 2019.
ensembles for credit risk and corporate bankruptcy prediction,”
Information Fusion, vol. 47, pp. 88–101, 2019.
[75] N. Arora and P. D. Kaur, “A bolasso based consistent feature
selection enabled random forest classification algorithm: An
application to credit risk assessment,” Applied Soft Comput-
ing, vol. 86, p. 105936, 2020.
[76] “Enterprise credit risk prediction using supply chain informa-
tion: A decision tree ensemble model based on the differential
sampling rate, synthetic minority oversampling technique and
<scp>adaboost,” Expert Systems, no. 6.
[77] W. Liu, H. Fan, and M. Xia, “Credit scoring based on tree-
enhanced gradient boosting decision trees,” Expert Systems IBOMOIYE DOMOR MIENYE re-
with Applications, vol. 189, p. 116034, 2022. ceived the B.Eng. degree in electri-
[78] T. M. Alam, K. Shaukat, I. A. Hameed, S. Luo, M. U. Sarwar, cal and electronic engineering and the
S. Shabbir, J. Li, and M. Khushi, “An investigation of credit M.Sc. degree (cum laude) in computer
card default prediction in the imbalanced datasets,” IEEE systems engineering from the Univer-
Access, vol. 8, pp. 201173–201198, 2020. sity of East London, in 2012 and 2014,
[79] J. T. Hancock and T. M. Khoshgoftaar, “Gradient boosted respectively, and the PhD degree in
decision tree algorithms for medicare fraud detection,” SN electrical and electronic engineering
Computer Science, vol. 2, 5 2021. from the University of Johannesburg,
[80] Y. Wang, Y. Zhang, Y. Lu, and X. Yu, “A comparative assess- South Africa. His research interests in-
ment of credit risk model based on machine learning ——a clude machine learning and deep learning for finance and health-
case study of bank loan data,” Procedia Computer Science, care applications.
vol. 174, pp. 141–149, 2020. 2019 International Conference
on Identification, Information and Knowledge in the Internet
of Things.
[81] M. Seera, C. P. Lim, A. Kumar, L. Dhamotharan, and K. H.
Tan, “An intelligent payment card fraud detection system,”
Annals of Operations Research, vol. 334, pp. 445–467, 6 2021.
[82] A. Rawat, S. S. Aswal, S. Gupta, A. P. Singh, S. P. Singh, and
K. C. Purohit, “Performance analysis of algorithms for credit

14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3416838

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

NOBERT JERE received the Ph.D


and M.Sc degrees in Computer Sci-
ence from the University of Fort Hare,
South Africa in 2013 and 2009, re-
spectively. He is currently an Associate
Professor in the Department of Infor-
mation Technology at Walter Sisulu
University, South Africa. He has au-
thored and coauthored numerous peer-
reviewed journal papers and confer-
ence proceedings, chaired/co-chaired international conferences,
and serves as reviewer in numerous reputable journals. His main
research interests are in ICT for sustainable development.

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like