A Modified ID3 Decision Tree Algorithm Based On Cumulative
A Modified ID3 Decision Tree Algorithm Based On Cumulative
A R T I C L E I N F O A B S T R A C T
Keywords: In this paper, we propose a new modification of the traditional ID3 decision tree algorithm through cumulative
Cumulative Residual Entropy (CRE) residual entropy (CRE). We discuss the principles of decision trees, including entropy and information gain, and
ID3 Decision tree algorithm introduce the concept, properties, and advantages of CRE as an alternative measure of Shannon entropy in de-
Information gain
cision trees. Also, by running the proposed decision tree algorithm (named CRDT) and ID3 decision tree algo-
Machine learning
rithm on ten real datasets, we evaluate and compare the accuracy and efficiency of the two decision trees
algorithm using appropriate criteria which indicates that the performance of the CRDT is much more accurate
and closer to reality than the ID3. Furthermore, we compare the performance of three decision tree-based al-
gorithms, CRDT, CART, and Random Forest via MSE, RMSE, R-Square and training time criteria. The results
show the superiority of new proposed model compared to alternatives.
1. Introduction hyperplane of the tree division is parallel to the axis or diagonal, en-
compasses two subjects: univariate decision trees (UDTs) and multi-
Over recent years, the progress of technology and the widespread variate decision trees (MDTs). To construct univariate decision trees,
utilization of computers for data recording have significantly augmented numerous division criteria have been put forth. Noteworthy examples of
the volume of data generated across various disciplines. Accordingly, such measures can be found in the well-known algorithms ID3, C4.5,
researchers leverage these recorded data to advance and refine their CART, and CHAID. The nodes within a decision tree can be categorized
undertakings. Consequently, there exists a demand for an intelligent tool into two distinct types: decision nodes and leaf nodes. At each decision
that can effectively organize the data and extract pertinent information. node, the most optimal local feature is chosen to partition the data into
This concern has instigated the emergence of scientific disciplines such child nodes. This process is iteratively carried out until a leaf node is
as data mining. It is evident that as the volume of data expands, the reached, at which point further partitioning is rendered unfeasible. The
significance of these tools proportionally escalates. Generally, data selection of the best feature is contingent upon a criterion that evaluates
mining methods can be categorized into two distinct groups, each the effectiveness of a segmentation. More details can be found in Mai-
providing a tailored solution based on the nature of the problem. See, e. mon and Rokach (2014).
g. Han et al. (2012) and Coenen (2011). One of the most commonly used measures for dividing data into
An exemplary instance of a data mining method is the decision tree. segments is information gain (IG), based on impurity. Decision trees
This method is regarded as a valuable and potential tool in categorizing based on IG exhibit exceptional performance when handling datasets
datasets for forecasting (Quinlan, 1986). The decision tree ultimately with a balanced distribution of classes. However, in the case of an
showcases the identified patterns in the form of regulations. Generally, imbalanced dataset, IG tends to favor the majority class due to its reli-
the decision tree is deemed such that its output can be observed as a tree ance on the probability of the previous type. Following Drummond and
structure encompassing a collection of nodes and leaves. In a broad Holte (2000), this phenomenon is commonly referred to as Chula
sense, the establishment of trees, contingent upon whether the sensitivity. To tackle the issue of imbalanced classes and improve the
* Corresponding author.
E-mail addresses: [email protected] (S. Abolhosseini), [email protected] (M. Khorashadizadeh), [email protected]
(M. Chahkandi), [email protected] (M. Golalizadeh).
https://doi.org/10.1016/j.eswa.2024.124821
Received 3 April 2024; Received in revised form 4 July 2024; Accepted 12 July 2024
Available online 15 July 2024
0957-4174/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
performance of decision trees, Akash et al. (2019) introduced a parti- definitions of the decision tree are mentioned. The Section 3 introduces
tioning criterion known as Hellinger inter-node distance (iHD). This important and recent decision tree algorithms. The main results of the
criterion assesses the distance between the parent and child nodes using article is discussed in Section 4. The evaluation methods of the presented
Hellinger distance measures. models are mentioned in Section 5. In Section 6, a series of real dataset is
The commonly employed measure of uncertainty utilized in formu- used to show the capability of the new decision tree.
lating decision trees, such as information gain, is Shannon’s entropy, as
established by Quinlan (1986). Highlighting the limitations of decision 2. Preliminaries
trees constructed using Shannon’s entropy, Maszczyk and Duch (2008)
proposed the Renyi and Tsallis entropies as alternatives to enhance the The decision tree possesses several properties. A brief discussion on
overall performance of the C4.5 decision tree. This particular approach them are given below. The reader can consult for more details on Mai-
can be applied to any decision tree and can subsequently be considered mon and Rokach (2014).
in the information selection algorithm. Additionally, Sharma et al.
(2013) conducted a comparative analysis of the C4.5 decision tree al- 1- The decision tree partitions the data into distinct groups, ensuring no
gorithm using various entropies to assess their respective performances. data is eliminated during the classification process. Moreover, the
To enhance the classification precision of the ID3 algorithm, Ade- data set in the parent node remains identical to the aggregate data in
wole and Udeh (2018) implemented the algorithm as mentioned above, the child nodes.
using quadratic entropy placement. The results reported by them indi- 2- Due to its graphical depiction, this method facilitates comprehension
cated that the utilization of quadratic entropy in the ID3 algorithm led to of the obtained outcomes for individuals of any background,
a noteworthy improvement in its accuracy domains when compared to enhancing the popularity of this approach.
Shannon entropy. Wang et al. (2015) and Wang and Xia (2017) 3- A decision tree can be effectively employed for both continuous and
demonstrated that the division criteria employed in ID3 decision tree discrete data.
algorithms can be standardized within the Tsallis entropy framework, 4- In supervised learning, the presence of the target variable necessi-
alongside with C4.5 and CART. Furthermore, to augment the perfor- tates the identification of explanatory variables that exert a signifi-
mance of the decision tree, they introduced a novel decision tree rooted cant influence on the classification of the predictive model. As
in Tsallis entropy, dubbed UTCDT. pointed out in Han et al. (2012), to ascertain the variables that
The ID3 algorithm is prone to favoring attributes with numerous possess a substantial impact on prediction and classification, a de-
values. Jin et al. (2009) addressed this issue by incorporating an asso- cision tree can be employed.
ciation function (AF) to enhance feature selection. Their experimental
findings demonstrate the efficacy of this approach in rectifying the Suppose a dataset comprises a collection of characteristics as
limitations of ID3 and generating more rational and impactful rules. explanatory variables and a distinctive label as a target variable.
Cheng et al. (1988) introduces GID3, a generalized version of the ID3 Depending on whether the target variable is continuous or discrete,
algorithm, addressing the issue of overspecialization in ID3. By identi- decision trees can be categorized as follows (Maimon & Rokach, 2014):
fying two causes of overspecialization, the authors develop GID3 and
apply it to automate the Reactive Ion Etching (RIE) process in semi- 1- Classification trees: These trees yield a discrete set as the output.
conductor manufacturing. The empirical results demonstrate GID3′s 2- Regression trees: These trees yield an actual number as the output.
superiority over ID3 across various performance measures, with mini-
mal increase in computational complexity. Xu et al. (2006) introduces The structure of a decision tree consists of a root node positioned at
ID3+, an enhanced decision tree algorithm designed to overcome limi- the top and leaves situated at the bottom of the tree. The initial dataset is
tations of the traditional ID3 approach. By incorporating techniques like placed in the root node, followed by a test to divide the node. Several
autonomous backtracking and handling unknown attribute values, ID3 methods are available to select the initial test, all with the same objec-
+ demonstrates improved robustness and efficacy in decision tree tive. These methods strive to choose the most effective way for sepa-
learning systems, as evidenced by empirical experiments. rating the target classes. This process continues until the sample arrives
Wang et al. (2017) put forth a binomial entropy optimization (TEIM) at a leaf node. All samples grouped in a leaf are regarded as a distinct
algorithm that incorporates a novel splitting criterion and a fresh con- class. Thus, decision tree-building algorithms utilize the divide-and-
struction method for decision trees. The novel partitioning criterion is conquer approach to construct the decision tree. See, e.g., (Han et al.,
rooted in the conditional entropy of the Tsallis binomial, which out- 2012) for more details.
performs the conventional monomial partitioning criterion. Chaji In decision tree algorithms, it is crucial to consider the following
(2023) introduced a partitioning approach founded on the t-entropy inquiries:
measure. The efficacy of the proposed approach was examined on three
data sets, and the outcomes demonstrated that this approach exhibits a 1- How can the most suitable characteristic be chosen to partition the
more precise performance compared to the renowned Gini index, Tsallis dataset at each node?
entropies, Shannon, and Rényi methods. Singh and Chhabra (2021) 2- How does the algorithm for constructing the tree determine when to
devised a novel partitioning approach that combines the Gini index and stop?
entropy to generate a decision tree. This innovative approach has been
labeled as EGIA. The prevalent test design utilized in classification models involves
In this paper, we propose a novel modification to the ID3 decision the complete random partitioning of the dataset into two subsets:
tree algorithm using cumulative residual entropy (CRE). In these mod- training data and testing data. The model is trained using the training
ifications, the main goal is to circumvent the need to discretize contin- data, which consists of a set of input features and corresponding class
uous target variable, which reduces the information. In other words, the labels. Subsequently, the class label of a sample from the test set is
new model is designed in such a way that the information in all indi- predicted. The process of dividing the existing dataset into training and
vidual observations of the target variable is used to create a decision tree test groups is employed after the creation and evaluation of the tree
without discretization. It should be noted that in all generalizations and model (Maimon & Rokach, 2014).
expansions of the ID3 decision tree algorithm, the problem of dis- Decision tree algorithms constantly endeavor to select the optimal
cretization and information reduction still exists. Definitely, our feature from the available features. As discussed in (Han et al., 2012),
approach will increase the efficiency and accuracy of the tree. The rest of the most commonly used criteria for feature selection include informa-
the paper is organized as follow. In the Section 2, the basic concepts and tion gain, Gini index, gain ratio, and likelihood ratio.
2
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Information gain is a well-known criterion utilized in the construc- current node falls below a specified threshold. It is important to note that
tion of decision trees. It is expressed based on Shannon entropy. Math- the CHAID algorithm employs no heuristic method and cannot handle
ematically, we can write: incomplete values (Han et al., 2012).
InformationGain(A) = Entropy(D) − EntropyA (D). (1)
3. Main results
Eq. (1) computes the intended measure for a characteristic, denoted by
A, while D signifies the objective variable found within the training In various algorithms, such as ID3 and C4.5, discretization becomes
dataset. The initial and subsequent components within the assertion necessary when dealing with continuous variables, leading to the loss of
above, are evaluated according to the subsequent method: certain information. To circumvent the need for discretizing the target
variable and enhance the accuracy of the decision tree, we propose a
c
∑
Entropy(D) = − pi × log2 pi , novel algorithm. This algorithm eliminates the requirement of dis-
i=1 cretizing the target variable and results in a tree that exhibits improved
⃒ ⃒ accuracy and efficiency.
∑ Dj
ν ⃒ ⃒ ( )
EntropyA (D) = × Entropy Dj . Rao et al. (2004) introduced an alternative metric for measuring
|D|
j=1 uncertainty in dealing with the continuous and discrete variables. To
give more details on this, we provide some background materials as
Where Entropy(D) is Shannon’s entropy of the target variable, c de- follows.
termines the number of classes in the training data set, pi is the proba-
bility of a data sample belonging to the i-th class, and EntropyA (D) is the Definition 3.1. (Suppose X be a non-negative continuous random variable
conditional entropy of the target variable under the condition of the with survival function F(x) = P(X > x). The cumulative residual entropy
explanatory variable A where, ν is the number of branches, and Dj is a (CRE) is defined as follows:)
part of the primary data whose characteristic value is vi . Also |D| means ∫∞
the size of D. ε(X) = − F(x)logF(x)dx.
0
2.1. Decision tree algorithms Also, if X is a discrete non-negative random variable with survival
function F(x), x0 <x1 <⋯<xb and b≤∞, then the CRE is defined as fol-
There exist numerous algorithms for constructing decision trees. In lows (Baratpour & Bami, 2012):
this section, we will provide some of the most well-known algorithms.
b
∑
2.1.1. The ID3 algorithm ε(X) = − P(X⩾xi )logP(X⩾xi )(xi − xi− 1 ).
i=1
This particular algorithm, initially introduced by Quinlan (1986), is
one of the most prevalent and simplistic methods for constructing de-
Properties of the CRE can be found in Rao (2005), Di Crescenzo and
cision trees. The criterion for feature selection in this algorithm is known
Longobardi (2009), and Navarro et al. (2010).
as information gain. The ID3 algorithm terminates either when all
The main idea of this article is to use the cumulative residual entropy
remaining samples belong to the same class or when the value of the best
(CRE) instead of the Shannon entropy in the ID3 algorithm decision tree
information gain criterion has been determined. Notably, this algorithm
without the need to discretize (loss of information) the target variable.
does not employ a pruning technique and can handle quantitative fea-
Because the CRE calculation is based on all data, it can be seen that the
tures and incomplete data.
new method is closer to reality and make sense while employing in real
application. The main reasons to consider the CRE in our research are as
2.1.2. The C4.5 algorithm
follows:
As a generalization of the ID3 algorithm, the C4.5 algorithm, intro-
duced by Quinlan (1996), utilizes the Gain ratio index to select features
• Although Shannon’s entropy for the continuous random variables
for dividing and constructing a decision tree. When the number of ∫
has also been expanded with H(F) = − f(x)logf(x)dx, it should be
samples is less than a specified threshold, the C4.5 algorithm provides
mentioned that, this entropy can be calculated provided the density
some sensible solutions. In contrast to the ID3 algorithm, this algorithm
function of the data has a closed form. Moreover, estimating the
incorporates post-pruning methods. Furthermore, similar to the ID3 al-
density function is complicated and sometimes deriving it somehow
gorithm, it can accommodate small amounts of data as input, with the
impossible. Also, the Shannon entropy of a discrete distribution is
ability to adapt to incomplete data with slight modifications.
always positive. However, the differential entropy of a continuous
variable may take negative values. In addition, the differential en-
2.1.3. The CART algorithm
tropy estimator is inconsistent. To summarize some essential prop-
Breiman (2017) introduced the CART algorithm to establish a
erties of the CRE, we can highlight:
connection between regression and classification trees. This algorithm
• For continuous and discrete variables, the CRE estimator is
produces a binary decision tree, where each node has two branches
consistent.
based on the splitting criterion. A pruning technique is incorporated into
• CRE is always non-negative.
this algorithm. Additionally, the CART algorithm can generate regres-
• The CRE can be easily calculated from the sample data, and the
sion trees, where the leaf nodes predict a real number as a class label
calculations converge asymptotically to the actual values.
(Coenen, 2011).
• There is no need to estimate the density function to calculate CRE. As
a result, the work can be carried out with high accuracy without
2.1.4. The CHAID algorithm
losing data (Rao et al., 2004).
In the 1970s, applied statistics researchers developed several algo-
rithms for generating and constructing decision trees. Notable examples
A relationship can be established between the CRE and the Shan-
include AID, MAID, THAID, and CHAID. The CHAID algorithm (Kiss,
non’s entropy by the equilibrium distribution of X, with density function
2003) was initially designed for nominal variables. This algorithm em-
F(x)
ploys various statistical tests depending on the type of class label. The fe (x) = μ , where μ = E(X). The concept of equilibrium distribution,
termination condition for the CHAID algorithm is either reaching a which originated from renewal theory, has been pivotal in reliability
predetermined maximum depth or when the number of samples in the theory and various other fields. It serves as a fundamental tool for
3
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
So, we can have the following estimator of CRE based on the sample Step 4. The feature with maximum information gain will be selected
data: as the dividing criterion at the root of the tree and the training data set
will split due to number of levels of root of the tree and the same process
n
∑ ( ) ( ( ) )( ) of Step 1 to 4 continues until the final tree is obtained.
ε̂
(X) = − F n x(i) log ̂
̂ F n x(i) x(i+1) − x(i)
i=1
(2) 4. Assessment
n ( ) ( )
∑ i i ( )
=− 1− log 1 − x(i+1) − x(i) .
n n
i=1 One of the critical challenges in the KDD (Knowledge Discovery in
Databases) process lies in formulating effective measures to assess the
Following Kaplan and Meier (1958), another estimator of CRE can be
quality of analytical results. In addition, decision tree performance
written based on the Kaplan-Meier estimator of survival function given
∏ evaluation represents a fundamental aspect of analysis based on ma-
as (F(t) = ti ≤t ni n− i di ). chine learning principles. Although there are several criteria for evalu-
ating the predictive performance of decision trees, additional factors
3.1. New modified ID3 decision tree algorithm such as computational complexity and comprehensibility of the result-
ing tree may also be important. In the following, some of these criteria
Suppose in a scientific investigation, the target variable (Y), and the are used to evaluate the performance of the models. Criteria such as
k features represented as X1 , …, Xk building the training data set of size n accuracy, confusion matrix, F-score and ROC curve are used to check
are given as follows: and evaluate the performance between ID3 and CRDT, and criteria of
⎛ ⎞ MSE, RMSE, R-squared and training time are used to check and evaluate
y1 x11 x21 ⋯ xk1
⎜ y2 x12 x22 ⋯ xk2 ⎟ the performance of Random Forest, CRDT and CART. To read more
⎜
⎝⋮ ⋮ ⋮ ⋱ ⋮ ⎠
⎟. about these criteria, refer to (Han et al., 2012).
yn x1n x2n ⋯ xkn
5. Comparisons and application to real datasets
To make model construction precise, suppose that Y is a continuous
random variable and Xi ’s are qualitative (categorical) random variables. In this section, in two separate subsections, the proposed decision
In our proposed new decision tree algorithm, which we denoted by tree (CRDT) is compared with ID3, CART and Random Forest based on
CRDT, the criterion of information gain Xi is calculated based on the CRE numerous real datasets and different evaluation indices.
as follows:
InformationGain(Xi ) = CRE(Y) − CREXi (Y), (3) 5.1. Comparing CRDT with ID3
∑n ) ( ( )( )
where CRE(Y) = − i=1 P Y⩾yi logP Y⩾yi yi − yi− 1 is the cumulative In this section, we conducted an in-depth analysis using ten diverse
residual entropy of the target variable in the training data set, and datasets, each exhibiting variations in record count and field composi-
CREXi (Y) is the conditional cumulative residual entropy, which is tion. To ensure the reliability and integrity of our findings, we embarked
calculated as follows: on a comprehensive data preparation journey, meticulously cleaning the
datasets to eliminate any inconsistencies or outliers. Additionally, we
employed the technique of cross-validation for evaluation of robust
model. By systematically partitioning the data and training our models
Table 1
The first ten observations of the data.
Student Sex Address Medu Fedu Mjob Paid Romantic Goout Score
1 F C 4 4 housewife No No 4 6
2 F C 1 1 housewife No No 3 6
3 F C 1 1 housewife Yes No 2 10
4 F C 4 2 health Yes Yes 2 15
5 F C 3 3 other Yes No 2 10
6 F C 4 3 services Yes No 2 15
7 M C 2 2 other No No 4 11
8 M C 4 4 other No No 4 6
9 F C 3 2 services Yes No 2 19
10 F C 3 4 other Yes No 1 15
4
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
5
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 3. ROC curve of ID3 (a) and CRDT (b) (First dataset).
Table 2
Comparing and assessment criteria of the ID3 and CRDT decision tree (First
dataset).
Accuracy Precision Recall F1-score
The dataset is divided into two separate groups, namely training and
testing. The distribution ratio between these groups is determined Using testing part of data, the predictive performance of decision
according to the amount of data in each data set. trees is evaluating based on criterion studied on Section 4.
Performs Steps 1 to 4 described in Section 3.1 for the training part of
data of the previous step. To analyze the proposed dataset, both ID3 and CRDT decision trees
Ultimately, the target variable is categorized after the algorithm, were coded and implemented using R software (Appendix A present the
resulting in the creation of the final leaf of the decision tree. R’s codes of CRDT). The results of these implementations are presented
below.
6
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 6. ROC curve of ID3 (a) and CRDT (b) (Second dataset).
1- Score: The score ranges from 0 to 20, serving as the target variable
for this study.
2- Sex: The sex variable is a categorized variable set as either female (F)
or male (M).
Fig. 8. The ID3 decision tree (Third dataset).
7
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 9. ROC curve of ID3 (a) and CRDT (b) (Third dataset).
Table 4
Comparing and assessment criteria of the ID3 and CRDT decision tree (Third
dataset).
Accuracy Precision Recall F1-score
8
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 12. ROC curve of ID3 (a) and CRDT (b) (Fourth dataset).
Table 5
Comparing and assessment criteria of the ID3 and CRDT decision tree (Fourth
dataset).
Accuracy Precision Recall F1-score
9
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 15. ROC curve of ID3 (a) and CRDT (b) (Fifth dataset).
Table 6
Comparing and assessment criteria of the ID3 and CRDT decision tree (Fifth
dataset).
Accuracy Precision Recall F1-score
10
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 18. ROC curve of ID3 (a) and CRDT (b) (Sixth dataset).
Table 7
Comparing and assessment criteria of the ID3 and CRDT decision tree (Sixth
dataset).
Accuracy Precision Recall F1-score
indicates the better performance and efficiency of CRDT decision tree 5.1.9. Ninth data set
(Figs.16–18 and Table 7). The ninth dataset contains information on births and deaths (Stats,
2024) in New Zealand for the year ending December 2022. The dataset
5.1.7. Seventh data set contains 897 records with 4 attributes. The similar outputs of previous
The Seventh dataset used is related to the “Energy Efficiency” dataset datasets are presented below, which indicates the better performance
(Tsanas & Xifara, 2012), which is used to assess the heating load and and efficiency of CRDT decision tree (Figs. 25–27 and Table 10).
cooling load requirements of buildings (that is, energy efficiency) as a
function of building parameters. This dataset contains 768 samples with 5.1.10. Tenth data set
8 features. The similar outputs of previous datasets are presented below, The tenth dataset contains information about the performance of the
which indicates the better performance and efficiency of CRDT decision insurance agency (Moneystore, 2022). The dataset contains details of an
11
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 21. ROC curve of ID3 (a) and CRDT (b) (Seventh dataset).
12
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 24. ROC curve of ID3 (a) and CRDT (b) (Eighth dataset).
13
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
14
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 27. ROC curve of ID3 (a) and CRDT (b) (Ninth dataset).
Table 10
Comparing and assessment criteria of the ID3 and CRDT decision tree (Ninth
dataset).
Accuracy Precision Recall F1-score
relationship effectively.
Divergent importance ratings: While the top features were similar,
the relative importance ratings of the remaining features often differed
between the CRDT and Random Forest models. This suggests that
models may focus on different aspects of the data or capture distinct
relationships between features and the target variable.
Complementary Insight: By examining differences in feature impor-
tance ratings, we can identify areas where the models provide comple-
mentary insight. For example, in the lastmat dataset, the CRDT model
gives more importance to characteristics related to gender and educa-
tion, while the random forest model emphasizes characteristics related
to occupation. Combining these perspectives can lead to a more
comprehensive understanding of the underlying drivers of the target
variable.
Performance Implications: Feature importance analysis correlates
Fig. 28. The CRDT decision tree (Tenth dataset).
with observed model performance metrics, such as MSE, RMSE, and R-
squared. For example, in the concrete-compression-strength dataset, the
15
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 30. ROC curve of ID3 (a) and CRDT (b) (Tenth dataset).
Table 12
Comparing CRDT, CART and Random Forest in five real datasets via MSE, RMSE, R-squared and Training time.
Dataset name Type of Tree MSE RMSE R-squared Training time (Second)
16
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 31. Feature Importance of CRDT (a) and Random Forest (b) (lastmat dataset).
Fig. 32. Feature Importance of CRDT (a) and Random Forest (b) (Wine-Quality dataset).
17
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
Fig. 33. Feature Importance of CRDT (a) and Random Forest (b) (Concrete-Compressive-Strength dataset).
related trees such as CART and Random Forest based on criterions MSE, Supervision, Validation, Investigation, Writing – review & editing,
RMSE, R-Square and training time in five real datasets, and the results Project administration. Majid Chahkandi: Conceptualization, Formal
indicated that the performance of CRDT was better than and sometimes analysis, Supervision, Validation. Mousa Golalizadeh: Conceptualiza-
similar to the other trees. Therefore, overall, CRDT is a suitable alter- tion, Formal analysis, Supervision, Validation.
native to ID3 and a strong competitor for other related trees, both in
terms of being close to reality and in terms of various evaluation Declaration of competing interest
indicators.
Future research could focus on comparative analyses with other data The authors declare that they have no known competing financial
mining techniques and further exploration of the scalability and interests or personal relationships that could have appeared to influence
adaptability of the CRDT algorithm to larger datasets. Another future the work reported in this paper.
issue of the research is to develop the model of this research to provide
an algorithm to prevent the discretization of feature variables based on Data availability
conditional cumulative residual entropy.
Data will be made available on request.
CRediT authorship contribution statement
Acknowledgment
Somayeh Abolhosseini: Conceptualization, Software, Formal
analysis, Validation, Writing – original draft. Mohammad Khora- The authors greatly appreciate the reviewers’ suggestions and the
shadizadeh: Conceptualization, Methodology, Formal analysis, editor’s encouragement.
Appendix A
18
S. Abolhosseini et al. Expert Systems With Applications 255 (2024) 124821
(continued )
#R codes for CRDT
CREx = 0
tn = table(d[[j + 1]])/n
k = length(table(d[[j + 1]]))
for (i in 1:k){
CREx = CREx + tn[i]*NCRE(y[d[[j + 1]]==i])
}
CREz[j] = CREx
}
return(list(c(“CREY=”, “CREZi’s=”), c(CREy,CREz)))
}
References International Conference Zakopane, Poland, June 22-26, 2008 Proceedings 9 (pp. 643-
651). Springer Berlin Heidelberg.
Moneystore.)2022). Agency Performance. Kaggle. https://www.kaggle.com/datasets/
Adewole, A. P., & Udeh, S. N. (2018). The quadratic entropy approach to implement the
moneystore/agencyperformance.
Id3 decision tree algorithm. Journal of Computer Science and Information Technology, 6
Nash, W., Sellers, T., Talbot, S., Cawthorn, A., & Ford, W. (1995). Abalone UCI Machine
(2), 23–29. https://doi.org/10.15640/jcsit.v6n2a3
Learning Repository. https://doi.org/10.24432/C55C7W
Akash, P. S., Kadir, M. E., Ali, A. A., & Shoyaib, M. (2019, August). Inter-node Hellinger
Navarro, J., del Aguila, Y., & Asadi, M. (2010). Some new results on the cumulative
Distance based Decision Tree, IJCAI-19, 1967–1973.
residual entropy. Journal of Statistical Planning and Inference, 140(1), 310–322.
Baratpour, S., & Bami, Z. (2012). On the discrete cumulative residual entropy. Journal of
https://doi.org/10.1016/j.jspi.2009.07.015
the Iranian Statistical Society, 2(11), 203–215. https://jirss.irstat.ir/article_253690.
Pace, R. K., & Barry, R. (1997). Sparse spatial autoregressions. Statistics & Probability
html.
Letters, 33(3), 291–297. https://doi.org/10.1016/S0167-7152(96)00140-X
Breiman, L. (2017). Classification and regression trees. Routled.
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Chaji, A. (2023). Introducing a new method for the split criteria of decision trees. Journal
Quinlan, J. R. (1996). Improved use of continuous attributes in C4. 5. Journal of Artificial
of Statistical Sciences, 16(2), 331–348. https://doi.org/10.52547/jss.16.2.331
Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279
Chatterjee, A., & Mukherjee, S. P. (2001). Equilibrium distribution-its role in reliability
Rao, M., Chen, Y., Vemuri, B. C., & Wang, F. (2004). Cumulative residual entropy: A new
theory. Handbook of Statistics, 20.
measure of information. IEEE Transactions on Information Theory, 50(6), 1220–1228.
Cheng, J., Fayyad, U. M., Irani, K. B., & Qian, Z. (1988). Improved decision trees: A
https://doi.org/10.1109/TIT.2004.828057
generalized version of id3. In Machine Learning Proceedings (pp. 100–106). Morgan
Rao, M. (2005). More on a new concept of entropy and information. Journal of Theoretical
Kaufmann. https://doi.org/10.1016/B978-0-934613-64-4.50016-5.
Probability, 18, 967–981.
Coenen, F. (2011). Data mining: Past, present and future. The Knowledge Engineering
Rathod, V. (2022). Fish Market. Kaggle. https://www.kaggle.com/datasets/vipullrathod/
Review, 26(1), 25–29. https://doi.org/10.1017/S0269888910000378
fish-market.
Cortez, P., & Morais, A. D. J. R. (2007). A data mining approach to predict forest fires
Singh, M., & Chhabra, J. K. (2021). EGIA: A new node splitting method for decision tree
using meteorological data.
generation: Special application in software fault prediction. Materials Today:
[dataset] Cortez, P., & Silva, A. M. G. (2008). Using data mining to predict secondary
Proceedings. https://doi.org/10.1016/j.matpr.2021.05.325.
school student performance. https://doi.org/10.24432/C5TG7T.
[dataset] Stats NZ. (2024). Births and deaths: Year ended December 2022 – CSV. https://
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine
www.stats.govt.nz/large-datasets/csv-files-for-download.
preferences by data mining from physicochemical properties. Decision Support
Sharma, S., Agrawal, J., & Sharma, S. (2013). Classification through machine learning
Systems, 47(4), 547–553. https://doi.org/10.1016/j.dss.2009.05.016
technique: C4. 5 algorithm based on various entropies. International Journal of
Di Crescenzo, A., & Longobardi, M. (2009). On cumulative entropies. Journal of Statistical
Computer Applications, 82(16), 20–27. https://doi.org/10.5120/14249-2444
Planning and -Inference, 139(12), 4072–4087. https://doi.org/10.1016/j.
Tsanas, A., & Xifara, A. (2012). Accurate quantitative estimation of energy performance
jspi.2009.05.038
of residential buildings using statistical machine learning tools. Energy and buildings,
Drummond, C., & Holte, R. C. (2000, June). Exploiting the cost (in) sensitivity of decision
49, 560. https://doi.org/10.24432/C51307
tree splitting criteria. Proceedings of Seventeenth International Conference on
Wang, Y., Song, C., & Xia, S. T. (2015). Unifying decision trees split criteria using tsallis
Machine Learning, Stanford University, California, United States.
entropy. arXiv preprint arXiv:1511.08136. https://doi.org/10.48550/
Han, J., Kamber, M., & Pei, J. (2012). Data mining concepts and techniques third edition.
arXiv.1511.08136.
University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser
Wang, Y., & Xia, S. T. (2017, March). Unifying attribute splitting criteria of decision trees
University.
by Tsallis entropy. In 2017 IEEE International Conference on Acoustics, Speech and
Jin, C., De-Lin, L., & Fen-Xiang, M. (2009 July). In An improved ID3 decision tree algorithm
Signal Processing (ICASSP) (pp. 2507–2511). IEEE. https://doi.org/10.1109/
(pp. 127–130). IEEE. https://doi.org/10.1109/ICCSE.2009.5228509.
ICASSP.2017.7952608.
Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete
Wang, Y., Xia, S. T., & Wu, J. (2017). A less-greedy two-term Tsallis Entropy Information
observations. Journal of the American statistical association, 53, 457–481. http://
Metric approach for decision tree classification. Knowledge-Based Systems, 120,
www.jstor.org/stable/2281868.
34–42. https://doi.org/10.1016/j.knosys.2016.12.021
Kiss, F. (2003). Credit scoring processes from a knowledge Management perspective.
Xu, M., Wang, J. L., & Chen, T. (2006). Improved decision tree algorithm: ID3+. In
Periodica Polytechnica Social and Management Sciences, 11(1), 95-110. https://www.
Intelligent Computing in Signal Processing and Pattern Recognition: International
pp.bme.hu/so/article/view/1683.
Conference on Intelligent Computing, ICIC 2006 Kunming, China, August 16–19, 2006
Maimon, O. Z., & Rokach, L. (2014). Data mining with decision trees: theory and
(pp. 141-149). Springer Berlin Heidelberg.
applications (Vol. 81). World scientific. (Chapter 1, 2, 3, 4, 5 & 6).
Yeh, I. C. (1998). Modeling of strength of high-performance concrete using artificial
Maszczyk, T., & Duch, W. (2008). Comparison of Shannon, Renyi and Tsallis entropy
neural networks. Cement and Concrete research, 28(12), 1797–1808.
used in decision trees. In Artificial Intelligence and Soft Computing–ICAISC 2008: 9th
19