Article
Article
Keywords: In supervised classification, decision trees are one of the most popular learning algorithms that are employed
Classification in many practical applications because of their simplicity, adaptability, and other perks. The development
Preordonance of effective and efficient decision trees remains a major focus in machine learning. Therefore, the scientific
Decision tree
literature provides various node splitting measurements that can be utilized to produce different decision
Node splitting
trees, including Information Gain, Gain Ratio, Average Gain, and Gini Index. This research paper presents a
Impurity measure
Map-Reduce
new node splitting metric that is based on preordonance theory. The primary benefit of the new split criterion
Data set is its ability to deal with categorical or numerical attributes directly without discretization. Consequently, the
Performance Preordonance-based decision tree’’ (P-Tree) approach, a powerful technique that generates decision trees using
the suggested node splitting measure, is developed. Both multiclass classification problems and imbalanced data
sets can be handled by the P-Tree decision tree strategy. Moreover, the over-partitioning problem is addressed
by the P-Tree methodology, which introduces a threshold 𝜖 as a stopping condition. If the percentage of
instances in a node falls below the predetermined threshold, the expansion of the tree will be halted. The
performance of the P-Tree procedure is evaluated on fourteen benchmark data sets with different sizes and
contrasted with that of five already existing decision tree methods using a variety of evaluation metrics. The
results of the experiments demonstrate that the P-Tree model performs admirably across all of the tested
data sets and that it is comparable to the other five decision tree algorithms overall. On the other hand,
an ensemble technique called ‘‘ensemble P-Tree’’ offers a reliable remedy to mitigate the instability that is
frequently associated with tree-based algorithms. This ensemble method leverages the strengths of the P-Tree
approach to enhance predictive performance through collective decision-making. The ensemble P-Tree strategy
is comprehensively evaluated by comparing its performance to that of two top-performing ensemble decision
tree methodologies. The experimental findings highlight its exceptional performance and competitiveness
against other decision tree procedures. Despite the excellent performance of the P-Tree approach, there are still
some obstacles that prevent it from handling larger data sets, such as memory restrictions, time complexity,
or data complexity. However, parallel computing is effective in resolving this kind of problem. Hence, the
MR-P-Tree decision tree technique, a parallel implementation of the P-Tree strategy in the Map-Reduce
framework, is further designed. The three parallel procedures MR-SA-S, MR-SP-S, and MR-S-DS for choosing
the optimal splitting attributes, choosing the optimal splitting points, and dividing the training data set in
parallel, respectively, are the primary basis of the MR-P-Tree methodology. Furthermore, several experimental
studies are carried out on ten additional data sets to illustrate the viability of the MR-P-Tree technique and
its strong parallel performance.
∗ Corresponding author.
E-mail addresses: [email protected] (H. Chamlal), [email protected] (F. Aaboub), [email protected] (T. Ouaderhman).
https://doi.org/10.1016/j.asoc.2024.112261
Received 7 March 2024; Received in revised form 23 June 2024; Accepted 8 September 2024
Available online 23 September 2024
1568-4946/© 2024 Elsevier B.V. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
types of nodes: internal and leaf nodes that are connected via edges. contrast, the MR-S-DS is used to split the training data set in a parallel
The internal nodes represent testing nodes that can be divided into two way.
or more child nodes depending on some chosen splitting attributes, and Briefly stated, the major contributions of this manuscript are listed
the leaf nodes represent decision nodes that are connected to some class as follows:
labels [2].
• A novel impurity criterion that can directly deal with either cate-
Hunt’s Concept Learning System (CLS) [3], a top-down partitioning
gorical or numerical attributes without resorting to discretization
strategy, is employed in almost all of the existing decision tree method-
strategies has been established based on the preordonance theory.
ologies. The Hunt’s CLS structure is typically composed of two basic
• Based on the predefined impurity metric, the preordonance based
phases. The first step aims to identify the most significant attribute
decision tree (P-Tree) algorithm is displayed.
using an attribute selection procedure that will be utilized as the node
• The ensemble P-Tree approach, which is a dependable method
splitting attribute, while the second step aims to split the data set into
for reducing the instability that is frequently associated with
disjoint descendant subsets based on the preselected splitting attribute.
tree-based algorithms, is described.
The attribute selection and data splitting processes are repeated itera-
• The MR-P-Tree decision tree method, a parallelization of the P-
tively for each non-empty subset until either all instances of each subset
Tree algorithm in the Map-Reduce structure, is designed to over-
belong to the same class or a certain stopping condition is satisfied [4].
come memory constraints, time complexity, and data complexity
Identifying an ‘‘attribute selection technique’’ that will be used to problems.
pick the most crucial splitting attribute at each level of the tree presents • Many empirical studies are conducted to evaluate the perfor-
the principal challenge in creating decision trees. Therefore, the effec- mance of the P-Tree, ensemble P-Tree, and MR-P-Tree strategies
tiveness of the attribute selection method that is used influences the on several real-world data sets with various dimensionalities.
performance of the generated decision tree.
Researchers have proposed a variety of splitting strategies to pro- The remaining sections of this paper are arranged as follows. First,
duce various decision trees following Hunt’s CLS structure. Information Section 2 provides a summary of many decision tree methods that can
Gain (IG) [1], Gain Ratio (GR) [5], Average Gain (AG) [6], and Gini be found in the scholarly literature. In Section 3, some fundamental
Index (GI) [7] are some popular splitting criteria. The Iterative Di- concepts are presented. Following that, Section 4 provides the proposed
chotomizer 3 (ID3) decision tree method [1] employs the IG metric as approaches, including the non-parallel methodology P-Tree for creating
an attribute evaluation measure to assess the importance of attributes. powerful decision trees and the parallelized procedure MR-P-Tree for
The best attribute for splitting is then determined to be the one with handling large data sets. Section 5 depicts the various experimental
the highest IG value. However, the ID3 approach has some restrictions, studies and the analysis of the obtained results. Finally, Section 6
including the fact that it can only deal with categorical attributes concludes the paper and displays the future directions of the work.
and that it favors attributes with a wide range of values for splitting
purposes [8]. Consequently, the C4.5 decision tree strategy [5], which 2. Background on decision tree algorithms
utilizes the GR criterion as an impurity measure, is developed to
Despite the existence of numerous well-known decision tree tech-
overcome the limitations of the ID3 algorithm. In various levels of tree
niques, none of them can produce the best decision trees for various
construction, the attribute that maximizes the GR metric represents the
data sets. Consequently, decision tree development remains challenging
splitting attribute. Furthermore, the AG measurement, a variant of the
for researchers, which has prompted the creation of more decision tree
GR criterion that addresses the issue that arises when the GR measure
algorithms that are based on different attribute selection strategies.
is undetermined, i.e., when the split information in the GR measure (its
A decision tree methodology was devised by Chandra et al. in
denominator) equals zero, can be employed as a node split criterion for
2010 using a node splitting criterion that relies on the concept of
developing decision trees. On the other hand, the Classification And
distinct classes in a partition. The splitting measure is a combination of
Regression Trees (CART) procedure [7] employs the GI metric as a
two terms and is known as the Distinct Class-based Splitting Measure
splitting criterion to create decision trees, and the attribute with the
(DCSM) [15]. The first term of the DCSM metric deals with the number
lowest GI value is recognized as the partitioning attribute at each level.
of distinct classes in every child partition, while the second deals
The key strength of the CART methodology is its capacity to maintain
with the number of instances that are members of different classes.
data sets with mixed-type attributes.
Twenty-two data sets were used in the experimental investigation to
In supervised feature selection, it is demonstrated that one of the demonstrate that decision trees generated using the DCSM node split-
most effective relevance measures for assessing the significance of at- ting measurement are compact and more accurate than those created
tributes in a data set is the 𝛹 metric [9–14]. Consequently, it can be em- using the GR and GI criteria.
ployed as a node splitting criterion to describe an efficient technique for The Fuzzy Rule-based Decision Tree (FRDT) strategy [16] was pro-
creating optimal decision trees. In this framework, the Preordonance- posed by Wang et al. in 2015, specifically for data sets that consist
based decision Tree algorithm (abbreviated P-Tree), which utilizes the of numerical features. It is a fuzzy decision tree methodology that
𝛹 measure as an impurity metric, is fully described. The ability to utilizes an architecture based on fuzzy ‘‘if-then’’ principles. For the
process data sets that contain heterogeneous attributes without even FRDT algorithm, each node requires many features for partitioning,
utilizing discretization techniques is the most significant advantage of in contrast to the typical axis-parallel (univariate) decision tree ap-
the proposed decision tree strategy. Additionally, a dependable solution proaches, which entail a single feature per node. The performance
to mitigate the instability that is often linked to tree-based algorithms of the FRDT technique is evaluated on a few popular data sets and
is provided by an ensemble approach known as ‘‘ensemble P-Tree’’. contrasted with that of some conventional decision tree procedures.
On the other side, due to memory constraints and computational com- The experimental results show that the FRDT is an effective feature
plexity, the P-Tree method presents a big challenge when applied to selection method in terms of both the testing accuracy and the depth
large data sets. Therefore, the MR-P-Tree procedure, a parallelized of the produced trees.
preordonance based decision tree algorithm that is implemented based In 2018, Mu et al. established a strategy for creating decision trees
on Map-Reduce, is presented to address this kind of data classification known as the Pearson’s Correlation Coefficient-based decision Tree
problem. The MR-P-Tree methodology contains three distributed com- (PCC-Tree) methodology [2]. In the process of building decision trees,
putations: MR-SA-S, MR-SP-S, and MR-S-DS. The MR-SA-S and MR-SP-S it employs Pearson’s correlation coefficient to assess the relevance of
are parallel methods based on the Map-Reduce framework for selecting features and determine the most optimal ones for splitting nodes. The
the best splitting attribute and the best splitting point, respectively. In primary benefit of the PCC-Tree algorithm is its capacity to handle
2
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
mixed-type data sets. On the other hand, twelve data sets are used and Spark, are among the most efficient and productive models for
to demonstrate the efficiency of the PCC-Tree approach in terms of distributed computing, used to implement parallel data management
four different evaluation criteria. The empirical results confirm the processes through a cluster of standard machines [29]. Several parallel
outstanding performance of the PCC-Tree technique. decision tree methodologies are available in the scientific literature.
A Fast Rank Mutual Information-based Decision Tree (FRMIDT) In 2013, Zhu et al. began by investigating the characteristics of
method [17] was introduced by Mu et al. in 2018. By applying two the GI metric, and then they established some techniques that can be
strategies, it seeks to expedite the tree node construction in the classic applied in order to determine the best splitting points for developing
rank mutual information-based decision trees [18]. Initially, the mRMR decision trees inside the Map-Reduce computing framework [30]. In
procedure is utilized to discard redundant features in the instances fact, the authors presented both precise and approximate strategies.
that are contained in the current node. For additional acceleration, the Although the GI measure was the main emphasis of this work, the
renowned fuzzy c-means (FCM) algorithm [19] is further performed to concepts might easily be extended to other impurity criteria. Thorough
validate the partitioning point for the current node. Various experimen- studies that were conducted on a real-world data set and other synthetic
tal studies were conducted to figure out the effectiveness of the FRMIDT data sets showed the effectiveness and versatility of these methods.
technique. The gathered results show that it can accomplish acceptable Massive data sets cannot be directly handled by the traditional C4.5
levels of classification accuracy while effectively reducing processing decision tree technique due to many obstacles, including memory con-
time. straints, temporal complexity, data complexity, and other challenges.
Roy et al. developed in 2019 the Dispersion Ratio-based Decision Therefore, to deal with large data sets, Mu and his co-workers pro-
Tree (DRDT) technique [20], another successful decision tree process. posed in 2017 the MR-C4.5-Tree strategy [31], which is a parallelized
The DRDT approach includes two successive steps. In the first phase, C4.5 decision tree methodology based on the Map-Reduce architecture.
all of the numerical features in the data set are discretized using Furthermore, the depth of the generated decision tree, the number of
the K-means strategy. In the second step, a node impurity measure instances, and the maximal class probability in each tree node were
called the dispersion ratio (DR), a variant of the correlation ratio (CR) employed as stopping criteria to address the over-partitioning task.
metric [21], is employed to assess features and determine which are im- Twenty data sets were used in the experimental investigations, which
portant enough to split nodes. The performance of the DRDT algorithm proved the validity and excellent performance of the MR-C4.5-Tree al-
is examined across several data sets with varying dimensionalities. The gorithm. Moreover, the empirical study pertaining to Speedup, Scaleup,
obtained results prove that the DRDT methodology is effective and can and Sizeup measures verified the outstanding parallel performance of
rival some existing decision tree models. the MR-C4.5-Tree approach.
In 2019, Karabadji et al. presented the PSO-DT (Particle Swarm The MR-PCC-Tree algorithm [2], which is a parallel implementation
Optimization-based Decision Tree) technique [22], a potent algorithm of the PCC-Tree tactic in the Map-Reduce programming technology,
for building decision trees. This method is based on the concept of was designed by Mu et al. in 2018 to handle large-scale data classifi-
merging subsets of training instances and features. More specifically, cation problems. In fact, during the process of building decision trees,
the PSO-DT approach attempts to integrate the processes of feature se- the MR-PCC-Tree method applied the Map-Reduce paradigm to every
lection and data sampling using the PSO strategy, as its name suggests. part of the PCC-Tree strategy for parallel computing. The MR-PCC-Tree
To analyze the performance of the PSO-DT procedure, a comprehensive technique has been evaluated through an extensive experimental study
series of experiments was performed on twenty-two data sets from that used eight data sets. The obtained results demonstrated that this
various disciplines. The results indicate that the PSO-DT methodology approach is an effective procedure that can achieve high parallel per-
can effectively generate decision trees that are more accurate than formance on reducing the processing time needed to manage large-scale
those created by some conventional decision tree models. classification challenges.
Entropy Gini Integrated Approach (EGIA) [23] is a node split- The MR-FRMIDT technique [17], which is a parallelization of the
ting criterion that integrates two node impurity measurements. It was FRMIDT procedure via the Map-Reduce model, was further established
created by Singh et al. in 2021 in order to develop a decision tree by Mu et al. in 2018 to address medium- or large-scale data classifi-
generation process. The EGIA measure is a combination of the GI and cation challenges. Indeed, the MR-FRMIDT approach applied parallel
entropy metrics using two parameters called entropy and GI factors that computing technology to each individual component of the unparallel
range from 0 to 1, and their total needs to be 1. Using eighteen data sets, FRMIDT methodology. According to the results of an experimental anal-
the performance of the EGIA-based decision tree generating strategy ysis that was conducted using six different data sets, the MR-FRMIDT
is evaluated and contrasted with two decision tree algorithms that are strategy is an efficient algorithm with high parallel performance in
based on the GI and IG measures. The results of the experimental study terms of cutting down execution time and overcoming memory limits.
demonstrate that the EGIA metric significantly outperforms the GI and Mu et al. designed in 2019 a strategy for building fuzzy decision trees
IG measurements. using a parallel tree Node Splitting Criterion (MR-NSC) [32], which
A Decision Tree algorithm based on Feature Weight (FWDT) [24], was based on fuzzy information gain that was guided by the Map-
which is a two-step strategy, was designed by Zhou et al. in 2021. Reduce model. Massive data sets may be effectively processed using
Initially, the ReliefF technique is employed as a pre-processing stage the MR-NSC technique, which can solve the memory and temporal
to eliminate duplicate or unnecessary features from the feature set. constraints that plague the traditional fuzzy decision tree approaches.
The decision tree can then be created in the second step based on The results of the experimental investigations that were conducted
the reduced feature space by employing the concept of feature weight on twenty-four benchmark data sets proved the viability and strong
in order to identify the ideal splitting features. According to four parallel performance of the MR-NSC algorithm.
assessment metrics, the FWDT is an effective decision tree technique, In 2020, Mu and his colleagues suggested a parallel Fuzzy Rule-
as demonstrated by a series of experimental investigations that were Base based Decision Tree (MR-FRBDT) work in the paradigm of Map-
performed on twelve different data sets. Reduce [33] in order to improve the original FRDT strategy, which is
Nevertheless, only small data sets can be handled by all of the limited to processing small-scale data sets. The primary benefit of the
aforementioned decision tree techniques. Consequently, with the ex- MR-FRBDT approach is its ability to directly address large-scale data
plosion of data, it is difficult to generate decision trees from massive classification problems. The performance of the MR-FRBDT method was
data sets due to memory capacity, complexity in terms of time and tested on twenty-three real-world benchmark data sets. The analysis of
data, and other factors [25]. Therefore, it appears that using parallel the results confirmed the efficacy of the MR-FRBDT technique, which
computing to process large data sets is a widespread and depend- also exhibited strong parallel performance in terms of both computation
able treatment [26]. Map-Reduce frameworks [27,28], such as Hadoop time reduction and memory constraint prevention.
3
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
In order to improve the effectiveness of the ID3 decision tree Definition 1. A total preorder relation defined on 𝐻 is used to describe
methodology in terms of classification accuracy, execution time, and a preordonance 𝑃 specified on 𝐸 [37].
capacity to address large data sets, Es-Sabery et al. recommended in
2021 an innovative Map-Reduce improved weighted ID3 decision tree The ternary coding 𝑇𝑃 outlined below can be employed to express
classification strategy [34] that was implemented in a parallel manner a preordonance 𝑃 for each 𝑎, 𝑏, 𝑐, 𝑑 ∈ 𝐸:
using the Hadoop big data framework. Multiple experiences were per- ⎧ 1 𝑖𝑓 (𝑎, 𝑏) >𝑃 (𝑐, 𝑑)
formed to evaluate the efficacy of this approach. The empirical results ⎪
𝑇𝑃 (𝑎, 𝑏, 𝑐, 𝑑) = ⎨ 0 𝑖𝑓 (𝑎, 𝑏) =𝑃 (𝑐, 𝑑) (1)
revealed that the improved ID3 data technique performs admirably
⎪ −1 𝑖𝑓 (𝑎, 𝑏) <𝑃 (𝑐, 𝑑),
across several evaluation criteria. ⎩
The Map-Reduce-based Heart Disease Prediction System (MRHDP) where the symbols >𝑃 , =𝑃 , and <𝑃 indicate that the pair (𝑎, 𝑏) precedes
[35] is an effective decision tree methodology that Fathimabi and the pair (𝑐, 𝑑) in the order induced by the preordonance 𝑃 on 𝐻, the
his group introduced in 2021 in order to forecast heart disease by two pairs are tied, and the pair (𝑎, 𝑏) is preceded by the pair (𝑐, 𝑑) in
utilizing the Map-Reduce programming paradigm on the Hadoop plat-
the order induced by the preordonance 𝑃 on 𝐻, respectively.
form. Moreover, the MRHDP algorithm was presented to address the
On the other hand, an explanatory variable 𝑉 of any type has the
splitting and time-consuming problems that are related to managing
capacity to produce a preordonance, denoted by 𝑃𝑉 . Table 1 displays
large health data sets. On the other hand, the performance of this
the preordonances and their associated ternary codings induced by
approach was tested on several massive data sets, and the experimental
numerical, ordinal, and categorical variables.
results showed that the MRHDP strategy may produce more accurate
heart disease prediction results.
The P-Tree and MR-P-Tree methodologies represent significant ad- 3.2. New splitting criterion: definition and properties
vancements in decision tree algorithms, addressing critical challenges
and improving preexisting models. The P-Tree strategy does not require Consider a data set 𝐷 = (𝑋, 𝑌 ), where 𝑋 is a data matrix with 𝑛
pretreatment procedures such as attribute transformation or discretiza- instances {𝑥𝑖 }𝑛𝑖=1 which are characterized by 𝑝 attributes {𝐴𝑘 }𝑝𝑘=1 , and
tion, which are common limitations in traditional approaches based on 𝑌 is a class attribute. Table 2 depicts the structure of the data set 𝐷,
information gain, gain ratio, or Gini index. Instead, it develops a novel where 𝐴𝑘 (𝑖) represents the value of the instance 𝑥𝑖 on the attribute 𝐴𝑘 .
impurity metric (𝛹 ) that is based on preordonance theory, enabling Assume that 𝐴𝑠𝑒𝑡 = {𝐴1 , 𝐴2 , … , 𝐴𝑝 } represents the full set of attributes.
seamless management of heterogeneous data sets. However, large data
Suppose that two attributes 𝐴𝑘 and 𝑌 induce two preordonances
sets provide scaling challenges for the P-Tree technique. Therefore, by
𝑃𝐴𝑘 and 𝑃𝑌 , respectively, and that 𝑇𝑃𝐴 and 𝑇𝑃𝑌 are the ternary codings
employing the Map-Reduce architecture to parallelize computations, 𝑘
that are associated with these two preordonances. Assume that 𝑟𝑃𝐴 and
the MR-P-Tree method expands the capabilities of the P-Tree procedure, 𝑘
𝑟𝑃𝑌 are the rank variables that 𝑃𝐴𝑘 and 𝑃𝑌 , respectively, induce on 𝐻,
successfully tackling difficulties with scalability and performance. The
and that they have the following definitions for each ((𝑎, 𝑏), (𝑐, 𝑑)) ∈
innovative aspects of the proposed techniques include their capacity
𝐻 × 𝐻 ⧵ {(𝑎, 𝑏)}:
to effectively manage heterogeneous data seamlessly, eliminate pre-
processing processes, and improve scalability, thereby filling important ⎧ 𝑟𝑃𝐴 (𝑎, 𝑏) − 𝑟𝑃𝐴 (𝑐, 𝑑) < 0 ⇔ 𝑇𝑃𝐴 (𝑎, 𝑏, 𝑐, 𝑑) = 1
gaps in state-of-the-art strategies. Moreover, the theoretical and prac- ⎪ 𝑘 𝑘 𝑘
⎨ 𝑟𝑃𝐴𝑘 (𝑎, 𝑏) − 𝑟𝑃𝐴𝑘 (𝑐, 𝑑) = 0 ⇔ 𝑇𝑃𝐴𝑘 (𝑎, 𝑏, 𝑐, 𝑑) = 0 (2)
tical significance of this work is highlighted by developing decision ⎪ 𝑟 (𝑎, 𝑏) − 𝑟 (𝑐, 𝑑) > 0 ⇔ 𝑇 (𝑎, 𝑏, 𝑐, 𝑑) = −1,
tree techniques that address contemporary data challenges, improving ⎩ 𝑃𝐴𝑘 𝑃𝐴
𝑘
𝑃𝐴
𝑘
classification performance, efficiency, and applicability across diverse ⎧ 𝑟 (𝑎, 𝑏) − 𝑟 (𝑐, 𝑑) < 0 ⇔ 𝑇 (𝑎, 𝑏, 𝑐, 𝑑) = 1
fields such as healthcare, finance, and marketing, where data diversity ⎪ 𝑃𝑌 𝑃𝑌 𝑃𝑌
⎨ 𝑟𝑃𝑌 (𝑎, 𝑏) − 𝑟𝑃𝑌 (𝑐, 𝑑) = 0 ⇔ 𝑇𝑃𝑌 (𝑎, 𝑏, 𝑐, 𝑑) = 0 (3)
and volume present significant hurdles. ⎪ 𝑟𝑃 (𝑎, 𝑏) − 𝑟𝑃 (𝑐, 𝑑) > 0 ⇔ 𝑇𝑃 (𝑎, 𝑏, 𝑐, 𝑑) = −1.
⎩ 𝑌 𝑌 𝑌
3. Preliminary concepts In the literature, several criteria are employed to assess the impor-
tance of attributes. In this research, the relevance measure suggested
Some fundamental concepts that will be employed to carry out to evaluate the attributes is based on a non-parametric statistical co-
the study are explained in this section, including the preordonance efficient. The main benefit of the recommended relevance metric is its
theory, the new splitting criterion and its characteristics, as well as the flexibility to handle both categorical and numerical attributes.
Map-Reduce framework.
Assume that: Definition 2. The relevance of the attribute 𝐴𝑘 with respect to the
𝐸 = {1, 2, … , 𝑛}: a set of 𝑛 instances. class attribute 𝑌 is determined by the following equation [12]:
𝐻 = {(𝑎, 𝑏) ∈ 𝐸 2 ∕𝑎 < 𝑏}: the set of pairs of elements of 𝐸.
𝑀 =∣ 𝐻 ∣= 𝑛(𝑛−1)
2
. 𝛹 (𝐴𝑘 , 𝑌 ) = 𝜏1 (𝑟𝑃𝐴 , 𝑟𝑃𝑌 ), (4)
𝑘
4
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 1
Preordonances and their associated ternary codings induced by numerical, ordinal, and categorical variables.
𝑉 (𝑎, 𝑏) 𝑃𝑉 (𝑐, 𝑑) 𝑇𝑃𝑉 (𝑎, 𝑏, 𝑐, 𝑑)
⎧ 1 𝑖𝑓 ∣ 𝑉 (𝑎) − 𝑉 (𝑏) ∣ < ∣ 𝑉 (𝑐) − 𝑉 (𝑑) ∣
⎪
Numerical ∣ 𝑉 (𝑎) − 𝑉 (𝑏) ∣ < ∣ 𝑉 (𝑐) − 𝑉 (𝑑) ∣ ⎨ 0 𝑖𝑓 ∣ 𝑉 (𝑎) − 𝑉 (𝑏) ∣ = ∣ 𝑉 (𝑐) − 𝑉 (𝑑) ∣
⎪ −1 𝑖𝑓 ∣ 𝑉 (𝑎) − 𝑉 (𝑏) ∣ > ∣ 𝑉 (𝑐) − 𝑉 (𝑑) ∣
⎩
⎧ 1 𝑖𝑓 ∣ 𝑟(𝑎) − 𝑟(𝑏) ∣ < ∣ 𝑟(𝑐) − 𝑟(𝑑) ∣
⎪
Ordinal ∣ 𝑟(𝑎) − 𝑟(𝑏) ∣ < ∣ 𝑟(𝑐) − 𝑟(𝑑) ∣ ⎨ 0 𝑖𝑓 ∣ 𝑟(𝑎) − 𝑟(𝑏) ∣ = ∣ 𝑟(𝑐) − 𝑟(𝑑) ∣
⎪ −1 𝑖𝑓 ∣ 𝑟(𝑎) − 𝑟(𝑏) ∣ > ∣ 𝑟(𝑐) − 𝑟(𝑑) ∣
⎩
where 𝑟 is the rank induced by 𝑉 on 𝐸
⎧
⎪ 1 𝑖𝑓 𝑉 (𝑎) = 𝑉 (𝑏) & 𝑉 (𝑐) ≠ 𝑉 (𝑑)
⎪ ⎧ 𝑉 (𝑎) = 𝑉 (𝑏) & 𝑉 (𝑐) = 𝑉 (𝑑)
⎪ ⎪
Categorical 𝑉 (𝑎) = 𝑉 (𝑏) & 𝑉 (𝑐) ≠ 𝑉 (𝑑) ⎨ 0 𝑖𝑓 ⎨ 𝑜𝑟
⎪ ⎪ 𝑉 (𝑎) ≠ 𝑉 (𝑏) & 𝑉 (𝑐) ≠ 𝑉 (𝑑)
⎪ ⎩
⎪ −1 𝑖𝑓 𝑉 (𝑎) ≠ 𝑉 (𝑏) & 𝑉 (𝑐) = 𝑉 (𝑑)
⎩
Example 1. Two attributes, 𝐴1 and 𝑌 , are displayed in Table 3, where ∀(𝑎, 𝑏) ∈ 𝐻, 𝐴𝑘 (𝑎) ≠ 𝐴𝑘 (𝑏) ⇔ 𝑌 (𝑎) = 𝑌 (𝑏). (11)
𝐴1 is a numerical attribute and 𝑌 is a class attribute.
To compute the relevance of the attribute 𝐴1 , it is necessary to start Proposition 2. The relevance measure reaches its greatest value for a
by determining the ternary codings 𝑇𝑃𝐴 and 𝑇𝑃𝑌 of the preordonances numerical variable 𝐴𝑘 that fulfills the following requirement:
1
induced by the attributes 𝐴1 and 𝑌 , respectively. For example:
{ max ∣ 𝐴𝑘 (𝑎) − 𝐴𝑘 (𝑏) ∣ < min ∣ 𝐴𝑘 (𝑐) − 𝐴𝑘 (𝑑) ∣ . (12)
𝑌 (𝑎)=𝑌 (𝑏) 𝑌 (𝑐)≠𝑌 (𝑑)
| 𝐴1 (1) − 𝐴1 (2) | < | 𝐴1 (1) − 𝐴1 (3) | ⇒ 𝑇𝑃𝐴 (1, 2, 1, 3) = 1,
1
𝑌 (1) = 𝑌 (2) & 𝑌 (1) ≠ 𝑌 (3) ⇒ 𝑇𝑃𝑌 (1, 2, 1, 3) = 1. This statement implies that the relevance value of a numerical
variable 𝐴𝑘 , which perfectly separates instances with different class
Using the identical process, the values of 𝑇𝑃𝐴 and 𝑇𝑃𝑌 for all (𝑎, 𝑏, 𝑐, 𝑑) labels while mixing instances with the same class labels, has reached
1
are summarized in Table 4. its maximum value (a value of 1).
According to Table 4: It is clear from the above characteristics of the suggested relevance
metric that it can be employed as an impurity measure for building
⎧ ∑
⎪ 𝑇𝑃𝐴 (𝑎, 𝑏, 𝑐, 𝑑)𝑇𝑃𝑌 (𝑎, 𝑏, 𝑐, 𝑑) = 4 decision trees.
1
⎪ ∑ 2
⎨ 𝑇𝑃 (𝑎, 𝑏, 𝑐, 𝑑) = 6
⎪ ∑ 𝐴1 3.3. Map-reduce framework
⎪ 𝑇𝑃2 (𝑎, 𝑏, 𝑐, 𝑑) = 4,
⎩ 𝑌
Due to its many benefits, including scalability, flexibility, speed,
where all the sums cover all ((𝑎, 𝑏), (𝑐, 𝑑)) ∈ 𝐻 × (𝐻 ⧵ {(𝑎, 𝑏)}). Hence: and simplicity, ‘‘Map-Reduce’’, a crucial programming paradigm, can
4 be used to process massive data sets in parallel [40]. The Map-Reduce
𝛹 (𝐴1 , 𝑌 ) = 𝑐𝑜𝑟 (𝑇𝑃𝐴 , 𝑇𝑃𝑌 ) = √ √ = 0.8164966.
1
6 × 4 framework simplifies the development of a traditional distributed pro-
gram and provides a simple parallel programming method. The prin-
The following properties list some characteristics of the suggested
cipal mechanism of the Map-Reduce is to split big data analysis into
relevance measure that have already been established in our previous many single and inherently parallel tasks on nodes of the clusters.
research paper [12]. The first property is applicable to a categorical The two main processing phases of the Map-Reduce task are Map
attribute, whereas the second one is applicable to a numerical attribute. and Reduce [41]. Initially, a distributed file system such as HDFS
(Hadoop Distributed File System) is usually employed to store the
Proposition 1. Assume that 𝐴𝑘 is a categorical attribute. input data set. After that, it is separated into smaller, independent data
• The relevance of the attribute 𝐴𝑘 can be determined by Eq. (6). chunks referred to as input splits [42]. Actually, for example, in Apache
Hadoop, the input data set is automatically segmented into input splits,
𝛹 (𝐴𝑘 , 𝑌 ) = 𝐿(𝐴𝑘 , 𝑌 ), (6) which are fixed-size chunks. Each of these subsets comprises a portion
5
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 4
Values of the ternary codings 𝑇𝑃𝐴 and 𝑇𝑃𝑌 for each (𝑎, 𝑏, 𝑐, 𝑑).
1
(𝑎, 𝑏)(𝑐, 𝑑) 𝑇𝑃𝐴 𝑇𝑃𝑌 (𝑎, 𝑏)(𝑐, 𝑑) 𝑇𝑃𝐴 𝑇𝑃𝑌 (𝑎, 𝑏)(𝑐, 𝑑) 𝑇𝑃𝐴 𝑇𝑃𝑌
1 1 1
of the original data set and is managed by a separate mapper. The powerful decision trees and the parallelized procedure MR-P-Tree for
input splits are then simultaneously examined using a map function managing large data sets.
to provide some interim results that will be employed as an input for
the second phase. The final results are then exported by combining 4.1. Preordonance-based decision tree (P-Tree) algorithm
these intermediate results utilizing a reduce function during the Reduce
step. Fig. 1 displays the detailed processing structure of the Map-Reduce
Several guiding principles must be established to create decision
framework [31].
trees, such as splitting attributes, splitting points, stopping criteria,
As demonstrated in Fig. 1, ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs are the only type of data
on which the Map-Reduce architecture can operate. In other words, the and a labeling rule. Since the effectiveness of decision trees depends
map and reduce functions employ the ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs as their inputs on these guidelines, they must therefore be determined with extreme
and outputs [43]. precision. However, picking the best splitting attributes and the optimal
First, during the Map phase, the map function employs a sin- splitting points is the most contentious problem that arises while creat-
gle ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pair as its input and produces a list of intermediate ing any decision tree. In order to produce effective decision trees, this
⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs by some operations that are tailored to the needs of section introduces an innovative method that employs the previously
the users. Here is an illustration of the form: stated relevance metric as an impurity measure to assess the impor-
tance of attributes as well as determine the right splitting points. The
(input) ⟨𝑘𝑒𝑦1 , 𝑣𝑎𝑙𝑢𝑒1 ⟩ → 𝐦𝐚𝐩 → list ⟨𝑘𝑒𝑦2 , 𝑣𝑎𝑙𝑢𝑒2 ⟩. (13)
proposed decision tree algorithm is known as the Preordonance-based
Then, all the output intermediate ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pairs are merged by 𝑘𝑒𝑦 decision Tree (P-Tree) strategy. The splitting features, splitting points,
between the map and reduce functions as follows: stopping criteria, and labeling rule determination processes used by the
P-Tree method are discussed in the following sections:
list ⟨𝑘𝑒𝑦2 , 𝑣𝑎𝑙𝑢𝑒2 ⟩ → 𝐜𝐨𝐦𝐛𝐢𝐧𝐞 → ⟨𝑘𝑒𝑦2 , list(𝑣𝑎𝑙𝑢𝑒2 )⟩. (14)
Finally, in the Reduce step, a new output ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ pair is created 4.1.1. Identifying splitting attributes and splitting points
based on the collected pairs by 𝑘𝑒𝑦 in the Reduce function. This shows Since the relevance metric 𝛹 is a measure based on the correlation
how the form can be pictured: coefficient, it can be used to gauge how strongly an attribute is related
⟨𝑘𝑒𝑦2 , list(𝑣𝑎𝑙𝑢𝑒2 )⟩ → 𝐫𝐞𝐝𝐮𝐜𝐞 → ⟨𝑘𝑒𝑦3 , 𝑣𝑎𝑙𝑢𝑒3 ⟩. (15) to the class attribute. Consequently, it is decided that the attribute 𝐴𝑘∗
corresponding to the maximum relevance value is deemed to be the
4. Proposed approaches: P-Tree and MR-P-Tree strategies optimal splitting attribute. In other words, 𝐴𝑘∗ is chosen if it fulfills
the following condition:
The suggested decision tree approaches are covered in this sec-
𝐴𝑘∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝛹 (𝐴𝑘 , 𝑌 ). (16)
tion. These include the non-parallel methodology P-Tree for building 𝐴𝑘 ∈𝐴𝑠𝑒𝑡
6
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
After selecting the best splitting attribute for the current node, Table 5
the ideal splitting point connected to that attribute should then be Weather data set.
identified. Depending on the type of splitting attribute that was chosen, No. Outlook Temperature Humidity Windy Play
which can either be categorical or numerical, two cases must be looked 1 Sunny 85 85 False No
at to implement this step. 2 Sunny 80 90 True No
3 Overcast 83 86 False Yes
First, in the case of a categorical cutting attribute, the set of its
4 Rainy 70 96 False Yes
distinct values then serves as the set of splitting points. Therefore, 5 Rainy 68 80 False Yes
based on the total number of the various categories for the dividing 6 Rainy 65 70 True No
attribute, the corresponding node can be divided into several child 7 Overcast 64 65 True Yes
nodes, one child node per feature category (multiway split strategy). 8 Sunny 72 95 False No
More specifically, assume that the attribute 𝐴𝑘∗ consists of 𝑚 different 9 Sunny 69 70 False Yes
10 Rainy 75 80 False Yes
values, thus 𝑚 branches are created corresponding to each attribute
11 Sunny 75 70 True Yes
value. Therefore, the set of cut points 𝑐∗ for the categorical cutting 12 Overcast 72 90 True Yes
attribute 𝐴𝑘∗ is specified as follows: 13 Overcast 81 75 False Yes
14 Rainy 71 91 True No
⋃
𝑚
𝑐∗ = {𝑐𝑗 }, (17)
𝑗=1
Table 6
where 𝑐𝑗 ∈ {𝐴𝑘∗ (1), 𝐴𝑘∗ (2), … , 𝐴𝑘∗ (𝑛)}. Relevance value of each attribute in the weather data set.
Finally, the data set 𝐷 can be split into 𝑚 subsets defined by 𝐷𝑗 = 𝐴𝑘 Outlook Temperature Humidity Windy
{𝑥𝑖 ∈ 𝐷∕𝐴𝑘∗ (𝑖) = 𝑐𝑗 }. 𝛹 (𝐴𝑘 , 𝑌 ) 0.0417 −0.0451 0.0014 0.0116
Otherwise, if the chosen splitting attribute is numerical, the corre-
sponding node is divided using the binary split strategy, depending on
a cut point that needs to be determined. To choose the best partitioning Table 7
Midpoints 𝑐𝑗 , vectors 𝑉 (Humidity, 𝑐𝑗 ), and 𝛹 (𝑉 (Humidity, 𝑐𝑗 ), Y).
point 𝑐∗ for the numerical partitioning attribute 𝐴𝑘∗ , the values of this
attribute are first sorted in ascending order and recorded as 𝑢1𝑘∗ ≤ 𝑐𝑗 77.50 87.50 92.50
𝑢𝑖 ∗ + 𝑢𝑖+1
∗ 𝑉 (Humidity, 𝑐𝑗 ) [2, 2, 2, 1, 1] [1, 2, 2, 1, 1] [1, 1, 2, 1, 1]
𝑢2𝑘∗≤ ⋯ ≤ 𝑢𝑛𝑘∗ .
Later, for each midpoint 𝑐𝑗 = 𝑘
2
𝑘
and (𝑢𝑖𝑘∗ 𝑢𝑖+1
𝑘∗ 𝛹 (𝑉 (Humidity, 𝑐𝑗 ), Y) 1 0.167 −0.167
(𝑖 = 1, 2, … , 𝑛 − 1) are two adjacent sorted values), a novel vector
𝑉 (𝐴𝑘∗ , 𝑐𝑗 ) is generated and defined by Eq. (18).
{
{ ( )}𝑛 0 𝑖𝑓 𝐴𝑘∗ (𝑖) ≤ 𝑐𝑗
𝑉 (𝐴𝑘∗ , 𝑐𝑗 ) = 𝑣 𝐴𝑘∗ (𝑖), 𝑐𝑗 𝑖=1 = (18) 4.1.3. The P-Tree decision tree algorithm
1 𝑖𝑓 𝐴𝑘∗ (𝑖) > 𝑐𝑗 . Based on the aforementioned regulations, which include the at-
The midpoint that satisfies the following requirement identifies then tribute selection process, identifying the ideal splitting point for the
the partitioning point 𝑐∗ for the cutting attribute 𝐴𝑘∗ : chosen attribute, the stopping conditions, and the labeling rule, the
procedure for creating decision trees by employing the P-Tree approach
𝑐∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝛹 (𝑉 (𝐴𝑘∗ , 𝑐𝑗 ), 𝑌 ). (19)
𝑐𝑗 can be introduced. The recommended algorithm evolves in a top-down
recursive way. The P-Tree strategy initially creates the root node by
Finally, the data set 𝐷 can be divided into 2 subsets defined by 𝐷1 =
associating the given data set 𝐷. If all the instances in the data set
{𝑥𝑖 ∈ 𝐷∕𝐴𝑘∗ (𝑖) ≤ 𝑐∗ } and 𝐷2 = {𝑥𝑖 ∈ 𝐷∕𝐴𝑘∗ (𝑖) > 𝑐∗ }.
belong to the same class, the attribute set is empty, or the proportion
Building decision trees also involves more than just providing the
of instances is less than 𝜖, a leaf node is produced and labeled by the
ideal partitioning attributes and the best partitioning points. Further-
majority class of the instances it contains. Otherwise, the proposed
more, two supplementary principles should be taken into account when
relevance metric is used to determine which attribute will split the
developing decision trees. The first concept discusses the stopping
training data set by utilizing Eq. (16). After choosing the partitioning
criterion, which determines when to stop the construction of the tree
attribute, two cases are then enforced in accordance with the kind
and produce leaf nodes. Whereas the other principle is connected to the
of this attribute in order to pinpoint the best partitioning point. If
labeling rule, which assigns a class label to the leaves of the tree.
the splitting attribute is categorical, the root node is then split into
4.1.2. Identifying stopping conditions and a labeling rule a number of child nodes based on the various values of the selected
The construction of a decision tree can typically be stopped by attribute according to Eq. (17). However, if the splitting attribute is
imposing a variety of stalling conditions. For the P-Tree strategy, three numerical, establish a cutting point using Eq. (19) and then divide
stalling conditions are taken into consideration. Evidently, the decision the root node into two groups according to it. The stated procedure
tree should cease expanding for every node that only contains instances is applied recursively to each non-empty child node. The pseudocode
of the same class labels (pure node). The second stopping requirement of the P-Tree approach is displayed in algorithm 1.
relates to the set of attributes; if there are no more attributes that would The example that follows illustrates how to build a decision tree
allow the instances to be further subdivided, then the development of utilizing the P-Tree approach.
the tree is stopped. The last criterion for restricting decision tree growth
is a predetermined threshold 𝜖 for the proportion of instances in a node; Example 2. Table 5 displays a weather data set that contains fourteen
if it is lower than 𝜖, tree advancement is halted. Ultimately, all of these instances, which are identified by two categorical attributes (Outlook
nodes that prevented the decision tree from developing are marked as and Windy), two numerical attributes (Temperature and Humidity),
leaf nodes. and one class attribute (Play). Assume that the threshold 𝜖, the stopping
On the other hand, all of the nodes that terminated the splitting criterion for the P-Tree algorithm, is set to be 0. Fig. 2 provides
procedure were eventually designated as leaf nodes of the decision tree. an illustration of the processor for building the decision tree for the
The problem of labeling the produced leaf nodes then arises after the weather data set using the P-Tree technique
growth of the decision tree has been stopped. Generally, the standard
labeling rule, which is implemented in many well-known decision tree 4.2. Parallelized preordonance-based decision tree algorithm based on Map-
approaches, including the ID3, C4.5, and CART techniques, is to label Reduce (MR-P-Tree)
a leaf node with the most prevalent class of the instances it contains.
Therefore, the P-Tree decision tree methodology will categorize the The rationale for employing the Map-Reduce framework lies in the
leaves in exactly the same manner. quadratic growth of the pair formation as the number of instances 𝑛
7
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
expands. In simpler terms, as 𝑛 increases, the size of the set 𝐻 does Table 8
not merely grow linearly but escalates at a rate proportional to the Partitioning the data set into multiple subsets for the Map-Reduce model.
square of 𝑛. This phenomenon stems from the combinatorial nature 𝐷 𝐴𝑠𝑒𝑡 = {𝐴1 , 𝐴2 , … , 𝐴𝑝 } 𝑌
of generating pairs: with each additional instance in 𝐸, potentially ⇓
𝑛 − 1 new pairs emerge, leading to an exponential surge in pair count.
Consequently, in the realm of feature relevance assessment, where 𝐷 (1)
𝐴(1)
𝑠𝑒𝑡 = {𝐴(1)
1
, 𝐴(1)
2
, … , 𝐴(1)
𝑝 } 𝑌 (1)
pairwise comparisons are indispensable, the Map-Reduce model offers 𝐷(2) 𝐴(2) = {𝐴(2) , 𝐴(2) , … , 𝐴(2) } 𝑌 (2)
𝑠𝑒𝑡 1 2 𝑝
a scalable architecture to manage the quadratic increase in the number
. . .
of comparisons as the data set expands. Leveraging the Map-Reduce . . .
paradigm enables the partitioning and concurrent execution of compu- . . .
tations, thus curtailing processing time and empowering analyses on 𝐷(𝑁) 𝐴(𝑁) (𝑁) (𝑁) (𝑁) 𝑌 (𝑁)
𝑠𝑒𝑡 = {𝐴1 , 𝐴2 , … , 𝐴𝑝 }
data sets of unprecedented scale. Therefore, Map-Reduce emerges as
an indispensable asset in contemporary data analytics, facilitating the
scrutiny of feature relevance amidst the era of big data.
The MR-P-Tree method, a parallel implementation of the P-Tree Assume that in the framework of Map-Reduce, the data set 𝐷 can
algorithm in the Map-Reduce programming framework, is covered in be randomly partitioned into 𝑁 subsets {𝐷(𝑙) }𝑁 𝑙=1
. Each produced subset
this section. The MR-P-Tree technique cannot be presented until two 𝐷(𝑙) is characterized by its set of attributes 𝐴(𝑙)
𝑠𝑒𝑡 and its class attribute
fundamental concepts have been established. The first principle, which 𝑌 (𝑙) , as demonstrated in Table 8. Obviously, each subset contains 𝑛′ =
𝑛
is similar to the P-Tree algorithm, determines the best splitting at- instances.
𝑁
tributes and the ideal splitting points. The two strategies are different in Breaking down the data set into smaller subsets fulfills several
that the P-Tree procedure employs the whole data set, whereas the MR- important purposes in the context of distributed computing frame-
P-Tree methodology is based on several small subsets that are generated works such as Map-Reduce. Firstly, it enables parallel processing by
from the full data set. While the second rule outlines a strategy for distributing computational tasks across several nodes or processors,
dividing the data set in parallel. thereby reducing overall processing time. Furthermore, smaller subsets
8
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 2. Illustration of the constructing processor of the decision tree for the weather data set using the P-Tree technique (see Tables 6 and 7).
help mitigate memory constraints by limiting the amount of data pro- optimizes resource utilization, scalability, memory efficiency, and data
cessed simultaneously, leading to more effective memory utilization. management in distributed computing environments.
In addition, this segmentation enhances scalability, allowing the algo- Similar to the P-Tree technique, the MR-P-Tree approach employs
rithm to handle increasingly larger data sets while maintaining high the relevance metric 𝛹 as an impurity measure to discover the optimal
performance levels. Moreover, smaller subsets streamline data man- cutting attribute in parallel. First, the relevance value of each attribute
agement tasks, making it easier to retrieve, manipulate, and analyze in each subset 𝐷(𝑙) is determined in the Map phase, and in the Reduce
data efficiently. Overall, dividing the data set into smaller groups often phase, the mean relevance of each attribute is computed to define its
9
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 3. Procedure for picking the dividing attribute for the MR-P-Tree algorithm in the Map-Reduce framework.
Algorithm 2 MR-Splitting Attribute-Selection (MR-SA-S) method identified, the MR-P-Tree procedure can be introduced. Like the P-Tree
method, the MR-P-Tree strategy builds decision trees in a top-down
Input: A data set 𝐷 described by 𝑝 attributes 𝐴𝑠𝑒𝑡 = {𝐴𝑘 }𝑝𝑘=1
and a
recursive manner. Along with that, it adheres to the same stalling
class attribute 𝑌 .
constraints and follows the identical labeling rule as the P-Tree strategy.
Output: The splitting attribute 𝐴𝑘∗ .
Algorithm 5 outlines the pseudocode of the MR-P-Tree technique.
1: Initialize a MapReduce Job SPLITRULEJOB:
The MR-P-Tree algorithm leverages the Map-Reduce framework to
2: Set SplitRuleTaskMapper as the Mapper Class.
handle the computational complexity and data volume when building
3: Set SplitRuleTaskReducer as the Reducer Class.
decision trees. However, potential scalability limitations may arise,
4: Pretend that the set 𝐷 can be divided into 𝑁 subsets {𝐷(𝑙) }𝑁 𝑙=1
.
particularly in terms of memory usage and processing time, when
5: In the 𝑙-th SplitRuleTaskMapper:
dealing with very large data sets.
Input: A subset 𝐷(𝑙) described by 𝑝 attributes 𝐴(𝑙) (𝑙) 𝑝
𝑠𝑒𝑡 = {𝐴𝑘 }𝑘=1 and one
(𝑙)
class attribute 𝑌 .
Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴(𝑙) , 𝛹 (𝐴(𝑙) , 𝑌 (𝑙) )⟩. 5. Experimental studies
𝑘 𝑘
(𝑙)
6: for each attribute 𝐴𝑘 ∈ 𝐴(𝑙)
𝑠𝑒𝑡 do
(𝑙) This section outlines an overview of the various experimental stud-
7: Compute 𝛹 (𝐴𝑘 , 𝑌 (𝑙) ) according to Eq. (4). ies that were conducted to evaluate the effectiveness of the P-Tree,
(𝑙) (𝑙)
8: Mapper Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘 , 𝛹 (𝐴𝑘 , 𝑌 (𝑙) )⟩. ensemble P-Tree, and MR-P-Tree methodologies. First, Section 5.1 de-
9: end for livers a summary of the data sets used for training/testing, validation
10: In the SplitRuleTaskReducer: technique, performance metrics, and the hardware and software specifi-
Input: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘 , [𝛹 (𝐴(𝑙)
𝑘
, 𝑌 (𝑙) ), 𝑙 = 1, 2, … , 𝑁]⟩. cations utilized to carry out the experiments. For parameter setting, the
Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , 𝛹 (𝐴𝑘∗ , 𝑌 )⟩. behavior of the P-Tree algorithm for several values of the stop criterion
11: for each attribute 𝐴𝑘 ∈ 𝐴𝑠𝑒𝑡 do 𝜖 is investigated in Section 5.2. Moreover, Section 5.3 introduces a
𝑁
∑ (𝑙)
𝛹 (𝐴𝑘 ,𝑌 (𝑙) ) comparison analysis of the P-Tree strategy and a few other currently
𝑙=1
12: 𝛹 (𝐴𝑘 , 𝑌 ) = 𝑁
. used decision tree techniques and provides a comparative study among
13: end for the ensemble P-Tree, RF, and XGBoost approaches. Furthermore, Sec-
14: Select the optimal attribute 𝐴𝑘∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝛹 (𝐴𝑘 , 𝑌 ). tion 5.4 contrasts the performances of the two procedures, P-Tree and
𝐴𝑘
MR-P-Tree, provides the examination of the efficacy of the MR-P-Tree
15: Reducer Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , 𝛹 (𝐴𝑘∗ , 𝑌 )⟩.
method on several data sets of greater size, and evaluate its parallel
16: Return the partitioning attribute 𝐴𝑘∗ .
performance.
10
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Algorithm 3 MR-Splitting Point-Selection (MR-SP-S) method Algorithm 4 MR-Splitting-Data Set (MR-S-DS) method
Input: The data set 𝐷 and the splitting attribute 𝐴𝑘∗ . Input: The data set 𝐷, the splitting attribute 𝐴𝑘∗ , and the splitting
Output: The splitting point 𝑐∗ . point 𝑐∗ .
1: Initialize a MapReduce Job SPLITRULEJOB: Output: Subsets 𝛺.
2: Set SplitRuleTaskMapper as the Mapper Class. 1: Initialize a MapReduce Job SPLITRULEJOB:
3: Set SplitRuleTaskReducer as the Reducer Class. 2: Set SplitRuleTaskMapper as the Mapper Class.
4: Pretend that the data set 𝐷 can be divided into 𝑁 subsets {𝐷(𝑙) }𝑁
𝑙=1
. 3: Set SplitRuleTaskReducer as the Reducer Class.
5: In the 𝑙-th SplitRuleTaskMapper: 4: Pretend that 𝐷 can be divided into 𝑁 subsets {𝐷(𝑙) }𝑁
𝑙=1
.
Input: The attribute 𝐴(𝑙)𝑘∗
. 5: In the 𝑙-th SplitRuleTaskMapper:
′
Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴(𝑙)
𝑘∗ ∗
, 𝑐 (𝑙) ⟩. Input: A subset 𝐷(𝑙) = {𝑥𝑖 }𝑛𝑖=1 .
(𝑙)
6: if 𝐴𝑘∗ is a categorical attribute then Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝑖𝑑, 𝑥𝑖 ⟩, where 𝑖𝑑 is the label of the output
⋃
𝑚 subset.
7: Define the set of splitting points 𝑐∗(𝑙) = {𝑐𝑗 }, such that 6: for each instance 𝑥𝑖 ∈ 𝐷(𝑙) do
𝑗=1
′ 7: if the splitting attribute 𝐴(𝑙) is categorical then
𝑐𝑗 ∈ {𝐴(𝑙)
𝑘∗
(1), 𝐴(𝑙)
𝑘∗
(2), … , 𝐴(𝑙)
𝑘∗
(𝑛 )} and 𝑚 is the number of distinct (𝑙)
𝑘∗
8: 𝑖𝑑(𝑥𝑖 ) = 𝐴𝑘∗ (𝑖).
values of the attribute 𝐴(𝑙)
𝑘∗
in the subset 𝐷(𝑙) .
9: else
8: else 10: if 𝐴(𝑙)
𝑘∗
(𝑖) ≤ 𝑐∗ then
9: Sort the values of 𝐴(𝑙)
𝑘∗
in ascending order and then record the 11: 𝑖𝑑(𝑥𝑖 ) = 0.
′
sorted values as 𝑢1𝑘∗ ≤ 𝑢2𝑘∗ ≤ … ≤ 𝑢𝑛𝑘∗ . 12: else
𝑢𝑖 ∗ + 𝑢𝑖+1
𝑘 𝑘∗ ′ 13: 𝑖𝑑(𝑥𝑖 ) = 1.
10: Find all the midpoints 𝑐𝑗 = 2
, 𝑖 = 1, 2, … , 𝑛 − 1. 14: end if
11: for each midpoint 𝑐𝑗 do 15: end if
12: Define the vector 𝑉 (𝐴(𝑙)
𝑘∗ 𝑗
, 𝑐 ) using Eq. (18). 16: Mapper Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝑖𝑑, 𝑥𝑖 ⟩.
13: Compute the relevance of 𝑉 (𝐴(𝑙) , 𝑐 ) using Eq. (4).
𝑘∗ 𝑗
17: end for
14: end for 18: Add 𝐷𝑖𝑑 to 𝛺, where 𝐷𝑖𝑑 is the subset with the label 𝑖𝑑.
15: Select the optimal cut point 𝑐∗(𝑙) that verifies the following 19: Return 𝛺.
condition: 𝑐∗(𝑙) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝛹 (𝑉 (𝐴(𝑙)
𝑘∗ 𝑗
, 𝑐 ), 𝑌 ).
𝑐𝑗
16: end if Algorithm 5 MR-Preordonance-based decision Tree (MR-P-Tree)
(𝑙) (𝑙)
17: Mapper Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , 𝑐∗ ⟩. algorithm
18: In the SplitRuleTaskReducer: Input: 𝐷: the data set, 𝐴𝑠𝑒𝑡 : the set of attributes, 𝜖: a stopping
Input: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , [𝑐∗(𝑙) , 𝑙 = 1, 2, … , 𝑁]⟩. criterion.
Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , 𝑐∗ ⟩. Output: A MR-P-Tree decision tree.
19: if the attribute 𝐴𝑘∗ is categorical then 1: Create initially the root node containing the whole data set 𝐷.
⋃𝑁
2: Assume 𝑝(𝐷) is the proportion of instances covered by 𝐷.
20: 𝑐∗ = 𝑐∗(𝑙) .
𝑙=1 3: if all the instances in 𝐷 belong to the same class, the set 𝐴𝑠𝑒𝑡 is
21: else empty, or 𝑝(𝐷) < 𝜖 then
𝑁
∑ (𝑙)
𝑐∗ 4: Mark the root node as a leaf node and assign the majority class
𝑙=1
22: 𝑐∗ = 𝑁 . as its label.
23: end if 5: return
24: Reducer Output: ⟨𝑘𝑒𝑦, 𝑣𝑎𝑙𝑢𝑒⟩ = ⟨𝐴𝑘∗ , 𝑐∗ ⟩. 6: else
25: Return the partitioning point 𝑐∗ . 7: Get the best splitting attribute 𝐴𝑘∗ using Algorithm 2.
8: Pinpoint the optimal splitting point 𝑐∗ for the chosen attribute
𝐴𝑘∗ by employing Algorithm 3.
9: Get the subsets 𝛺 based on 𝐴𝑘∗ and 𝑐∗ by Algorithm 4.
health monitoring of multipoint tool inserts. It contains a total of 250 in-
10: Remove the attribute 𝐴𝑘∗ from the attribute set 𝐴𝑠𝑒𝑡 .
stances, each described by thirteen statistical attributes extracted from
11: Recursively search new tree nodes from 𝛺 by Algorithm 5,
the signals. These attributes include Kurtosis, Standard Error, Maximum
respectively.
value, Skewness, Minimum value, Range, Count, Summation, Variance,
12: end if
Standard Deviation, Mode, Median [47,48]. The characteristics of the
13: Output a MR-P-Tree model.
tested data sets are summarized in Tables 9 and 10.
Selecting the benchmark data sets for evaluating the P-Tree, en-
semble P-Tree, and MR-P-Tree approaches is not arbitrary; rather,
it demands a comprehensive selection process that embraces diverse social sciences, medical sciences, and machining industry) offer insights
characteristics. Indeed, the criteria used for choosing the selected data into algorithm adaptability across different application areas. By metic-
sets include data set sizes, attribute types, class distributions, and ulously selecting data sets that embody these characteristics, biases
application domains. For instance, the data set sizes range from small- in evaluation can be mitigated, ensuring a thorough and unbiased
scale data sets like Somerville Happiness Survey with 143 instances assessment of the performance of the P-Tree, ensemble P-Tree, and
to larger data sets such as EEG eye with 14980 instances, enabling MR-P-Tree decision tree algorithms across diverse scenarios and data
evaluation across different data volumes. Moreover, attribute types types.
vary from numerical (e.g., Sonar with 60 numerical attributes) to nom- An effective performance evaluation of a classification model re-
inal (e.g., Mushroom with 22 nominal attributes), presenting distinct quires assessing it against some previously unseen data. Consequently,
challenges for algorithm assessment. Additionally, the inclusion of data it is necessary to split the available data set into two independent
sets with both binary (e.g., Breast Cancer) and multi-class (e.g., Iris) subsets: the training and testing data sets. Typically, the training set
classification tasks ensures a comprehensive evaluation of algorithm is employed to implement the learning model, while the testing set is
effectiveness. Finally, data sets from various domains (e.g., life sciences, utilized to validate the trained model [49]. The k-fold cross-validation
11
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 9
Detailed information of the small size data sets.
N𝑜 . Data sets Instances Pairs Attributes Classes Missing values
Numerical Nominal
1 Bankruptcy 250 31 125 0 6 2 no
2 Breast Cancer 286 40 755 0 9 2 yes
3 Glass 214 22 791 9 0 2 no
4 Hepatitis 155 11 935 6 13 2 yes
5 Iris 150 11 175 4 0 3 no
6 New Thyroid 215 23 005 5 0 3 no
7 Somerville Happiness Survey 143 10 153 0 6 2 no
8 Sonar 208 21 528 60 0 2 no
9 Spect (heart) 267 35 511 0 22 2 no
10 SPECTF Heart 267 35 511 44 0 2 no
11 Statlog 270 36 315 6 7 2 no
12 TAE 151 11 325 1 4 3 no
13 VibeMonitor 250 31 125 13 0 5 no
14 Wine 178 15 753 13 0 3 no
Table 10
Detailed information of the medium- and large-scale data sets.
N𝑜 . Data sets Instances Pairs Attributes Classes Missing values
Numerical Nominal
1 Bank Marketing 4521 10 217 460 7 9 2 no
2 Breast Cancer W-D 569 161 596 30 0 2 no
3 Breast Cancer Wisconsin 699 243 951 9 0 2 yes
4 Cmc 1473 1 084 128 2 7 3 no
5 Diabetic Retinopathy 1151 661 825 16 3 2 no
6 EEG eye 14 980 112192710 14 0 2 no
7 Mushroom 8124 32 995 626 0 22 2 yes
8 Pima 768 294 528 8 0 2 no
9 Tic Tac Toe 958 458 403 0 9 2 no
10 Vehicle 846 357 435 18 0 2 no
procedure is among the most well-known validation techniques. The 2.50 GHz processor and 32 Go memory size. Furthermore, the operating
data set is initially partitioned into 𝑘 disjoint groups of nearly equal system is 64-bit Windows 10, while the computing environment is ‘‘R
sizes, and then a process of classifier creation and test data examination version 4.2.1’’. Meanwhile, the MR-P-Tree procedure is evaluated in a
is performed. At each iteration, one fold (subset) is used for testing, small cluster operating environment that includes one host computer
while the remaining 𝑘 − 1 subsets are combined to be utilized for Intel (R) Core (TM) i7-11850H, 2.50 GHz CPU, 32 Go of RAM, and 64-bit
training. Each fold is subjected to the same procedure in turn. Finally, as well as four servant computers with Intel (R) Core (TM) i7-11850H,
the overall performance of the classification model can be estimated 2.50 GHz CPU, 16 Go of RAM, and 64-bit. On the other hand, ‘‘Kendall’’,
by averaging its performances across the 𝑘-folds that were utilized as a ‘‘data.tree’’, ‘‘tidyverse’’, and ‘‘caret’’ are some of the libraries that were
test subset [50]. In this work, except for the Spect (heart) and SPECTF used.
Heart data sets, which have already been split into separate training
and test sets (80 and 187 instances, respectively), all other data sets 5.2. Parameter setting
are handled by employing the five-fold cross-validation technique [49].
Consequently, all the reported results in this framework are the average The primary objective of the first experiment is to investigate the
values of thirty times five-fold cross-validation (one time is represented impact of different values of the stop criterion 𝜖 on the efficiency of
by a complete five-fold cross-validation). the P-Tree algorithm in terms of four evaluation metrics: classification
The performance evaluation metrics are essential for assessing the accuracy, size of the constructed tree (total number of tree nodes),
efficacy and applicability of machine learning algorithms. This paper number of leaf nodes, and time required to grow the decision tree.
utilizes various key metrics, including classification accuracy, sen- Higher classification accuracy, small tree size, fewer leaf nodes, and
sitivity, precision, F-measure, tree size, number of leaf nodes, and a shorter construction time all contribute to the superior performance
construction time, to evaluate the performance of the proposed decision of the decision tree strategy.
tree approaches. Classification accuracy, widely recognized as the most The data sets in Table 9 are subjected to the P-Tree methodology,
intuitive performance indicator, measures the percentage of instances with many values of the stopping criterion 𝜖 ranging from 0.02 to 0.2
that are accurately predicted. Sensitivity assesses the percentage of under a step length of 0.01. Figs. 4–6 depict intuitive representations
instances that are correctly identified as positive, while precision eval- of the influence of the parameter 𝜖 on the performance of the P-
uates the percentage of true positive instances among those that the Tree decision tree technique on fourteen data sets according to testing
model classified as positive. The F-measure examines the trade-off accuracy, tree size, leaf count, and time needed to develop the decision
between sensitivity and precision, offering a comprehensive evaluation tree.
metric. On the other hand, evaluating the size and number of leaf nodes Based on the examination of Figs. 4–6, it can be concluded that
in the decision trees constructed using the suggested strategies provides all of these evaluation measurements are sensitive to variations in the
insights into their complexity and interpretability, with smaller and halting criterion 𝜖. On the one hand, the testing accuracy typically
simpler trees generally preferred. Lastly, considering the time taken alters as the parameter 𝜖 is raised. In general, when the value of the
to construct the decision trees is essential for assessing computational stopping criterion 𝜖 is smaller than or equal to 0.09, almost all data
efficiency. sets provide the maximum classification accuracy. Whereas for the
The experiments were performed on a personal computer, where remaining data sets, a value of 𝜖 greater than 0.09 can yield the best
the hardware configuration is an Intel (R) Core (TM) i7-11850H with a prediction accuracy. This finding can be explained by the possibility
12
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 4. Behavior of the performance of the P-Tree decision tree approach for several values of the stop criterion 𝜖 on Bankruptcy, Breast Cancer, Glass, Hepatitis, and Iris data
sets.
that lower values of 𝜖 could lead to an over-partitioning problem and In summary, according to the experimental results, it can be con-
thus poor classification performance. On the other hand, for all fourteen cluded that the unparallel methodology P-Tree performs differently
data sets, the size as well as the number of leaf nodes of the trees that depending on the value of the stopping condition 𝜖 that is employed.
are produced by the P-Tree technique decrease as the stopping criterion
𝜖 increases. Consequently, it can be argued that the stopping criterion
5.3. Comparative study
𝜖 is a useful tool for fighting the over-partitioning problem (small-scale
decision trees can be generated with slightly high values of the measure
𝜖). Furthermore, it can be observed that the tree construction time This section firstly provides a comparison analysis of the P-Tree
has a downward trend with the increasing of the stopping criterion methodology and a few other currently used decision tree techniques,
𝜖. In other words, when using the P-Tree technique with low values including the ID3, C4.5, CART, PCC-Tree and DRDT approaches. After
for the parameter 𝜖, the time required to construct the decision tree is that, it introduces a comparative study among the ensemble P-Tree, RF,
significantly longer. and XGBoost strategies.
13
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 5. Behavior of the performance of the P-Tree decision tree approach for several values of the stop criterion 𝜖 on New Thyroid, Somerville Happiness Survey, Sonar, Spect
(heart), and SPECTF Heart data sets.
5.3.1. The P-Tree algorithm versus the ID3, C4.5, CART, PCC-Tree and During the comparison, the stopping criterion 𝜖 for the P-Tree
DRDT algorithms method is set at the recommended value when the highest level of
To further prove the effectiveness of the newly developed technique classification accuracy is reached for each data set. Table 11 displays
P-Tree, its performance is compared with that of five already existing the testing accuracy, sensitivity, precision, and F-measure resulting
decision tree models on a variety of data sets [51,52]. These five from the decision trees that are constructed by implementing the ID3,
strategies include three well-known methods, Iterative Dichotomizer 3 C4.5, CART, PCC-Tree, DRDT, and P-Tree approaches on the fourteen
(ID3) [1], C4.5 [5], and Classification And Regression Trees (CART) [7], data sets. For each data set, the best results are highlighted in bold
as well as two more recent decision tree techniques, Pearson’s Corre- font. Furthermore, using box plots, Fig. 7 graphically shows the classi-
lation Coefficient based decision Tree (PCC-Tree) [2] and Dispersion fication performances accomplished by using the different decision tree
Ratio-based Decision Tree (DRDT) construction technique [20]. There- techniques on the fourteen data sets.
fore, the 𝛹 measure is contrasted with the information gain, gain Through the analysis of Table 11 and Fig. 7, it can be observed that
ratio, Gini index, Pearson’s correlation coefficient, and dispersion ratio the average classification accuracy attained by the P-Tree strategy is the
splitting measures. best among the six average testing accuracies, reaching 79.95%, which
14
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 6. Behavior of the performance of the P-Tree decision tree approach for several values of the stop criterion 𝜖 on Statlog, TAE, VibeMonitor, and Wine data sets.
Fig. 7. Comparison among the classification performances of the ID3, C4.5, CART, PCC-Tree, DRDT, and P-Tree methods.
15
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 11
Comparison among the classification performances reached using the ID3, C4.5, CART, PCC-Tree, DRDT, and P-Tree decision tree approaches on the fourteen data sets (best values
are in bold font).
Data sets ID3 C4.5 CART PCC-Tree DRDT P-Tree Measures Data sets ID3 C4.5 CART PCC-Tree DRDT P-Tree
Bankruptcy 98.32 99.58 98.30 91.12 97.52 99.53 Accuracy Breast 69.57 66.06 69.94 68.67 67.37 68.66
96.26 99.96 96.26 92.98 98.69 99.91 Sensitivity Cancer 85.84 76.60 86.11 83.53 78.17 87.84
99.84 99.11 99.81 87.76 95.83 99.15 Precision 74.73 75.64 74.98 74.91 76.15 75.01
97.97 99.52 97.95 90.00 97.18 99.46 F-measure 79.74 75.95 80.01 78.85 76.99 79.65
Glass 73.85 68.89 74.10 67.29 66.35 71.40 Accuracy Hepatitis 80.13 79.06 78.68 77.56 77.74 82.82
82.86 77.58 83.22 71.30 85.87 77.36 Sensitivity 43.72 56.43 44.49 49.86 44.22 48.68
78.33 76.01 78.48 76.89 69.37 78.54 Precision 56.67 51.53 50.68 47.70 47.51 64.66
80.20 75.73 80.43 73.56 76.59 77.53 F-measure 46.61 52.16 44.98 47.17 44.12 53.39
Iris 93.27 93.91 93.53 93.49 88.17 93.31 Accuracy New 91.54 90.22 89.97 89.75 88.86 90.91
93.27 93.91 93.53 93.49 88.17 93.31 Sensitivity Thyroid 86.10 83.27 84.70 84.56 79.70 86.64
93.91 94.43 94.13 94.03 89.74 94.15 Precision 92.02 91.37 89.85 89.11 89.40 90.30
93.21 93.86 93.48 93.45 88.20 93.24 F-measure 87.94 85.72 85.95 85.64 82.66 87.50
Somerville 62.03 61.81 62.16 58.68 60.43 66.79 Accuracy Sonar 73.26 64.44 70.65 71.39 70.20 74.14
Happiness 69.83 65.67 69.26 68.38 68.03 72.34 Sensitivity 75.39 64.47 73.90 75.08 71.12 75.92
Survey 63.72 64.53 64.04 60.58 62.27 68.31 Precision 75.46 67.99 72.71 72.81 72.97 76.13
66.04 64.58 65.91 63.51 64.58 69.76 F-measure 74.89 65.56 72.71 73.57 71.58 75.59
Spect 75.94 75.12 75.94 71.05 73.88 80.21 Accuracy SPECTF 70.59 72.40 77.54 65.78 79.35 73.62
(heart) 75.58 74.56 75.58 70.27 74.26 80.81 Sensitivity Heart 71.51 71.67 79.65 65.70 85.67 76.55
97.74 97.89 97.74 97.58 96.56 97.80 Precision 95.35 97.72 95.14 95.76 91.36 93.60
85.25 84.62 85.25 81.69 83.92 88.25 F-measure 81.73 82.68 86.71 77.93 88.34 84.22
Statlog 78.64 74.61 78.68 74.70 76.24 74.47 Accuracy TAE 46.39 52.83 45.55 49.52 57.70 50.65
82.42 76.38 82.19 75.98 78.21 79.16 Sensitivity 46.32 52.75 45.52 49.64 57.65 50.50
80.06 77.88 80.33 78.36 79.31 76.32 Precision 47.61 53.77 46.86 53.87 58.74 51.07
80.97 76.87 80.94 76.84 78.49 77.41 F-measure 45.39 52.13 44.66 47.41 57.02 49.63
VibeMonitor 97.28 98.88 98.92 97.84 62.96 99.44 Accuracy Wine 90.82 92.75 88.74 93.19 82.29 93.27
97.28 98.88 98.92 97.84 62.96 99.44 Sensitivity 90.92 92.98 88.80 93.55 83.05 93.59
97.65 99.00 99.02 98.07 64.08 99.49 Precision 91.75 93.35 89.80 93.73 83.56 94.15
97.25 98.87 98.91 97.82 62.34 99.44 F-measure 90.86 92.85 88.69 93.30 82.65 93.40
is 1.26%, 2.05%, 1.18%, 3.52%, and 5.01% higher than that yielded Table 12
by the ID3, C4.5, CART, PCC-Tree, and DRDT algorithms. Additionally, Friedman test results.
out of the fourteen data sets, the ID3, C4.5, CART, PCC-Tree, and Measures 𝜒𝐹2 𝐹𝐹 Hypotheses (𝛼 = 0.05)
DRDT decision tree methods only produced the greatest classification Accuracy 16.5408 4.0223 Rejected
accuracy on 1, 2, 3, 0, and 2 data sets, respectively. Meanwhile, the P- Sensitivity 12.6225 2.8599 Rejected
Precision 13.2959 3.0482 Rejected
Tree methodology exhibits the highest testing accuracy for 6 data sets
F-measure 18.0510 4.5172 Rejected
out of 14.
Similarly, the P-Tree strategy also achieves the highest average
classification sensitivity, precision, and F-measure among all the com-
pared decision tree techniques, reaching 80.15%, 82.76%, and 80.61%, On the other hand, since the test is associated with 𝑁 data sets and
respectively. 𝑘 classifiers, the statistic 𝐹𝐹 , which is distributed according to the F-
Furthermore, a statistical analysis is performed in order to de- distribution with 𝑘−1 and (𝑘−1)(𝑁 −1) degrees of freedom, has a critical
termine whether or not there is a statistically significant difference value 𝐹𝛼 (𝑘 − 1, (𝑘 − 1)(𝑁 − 1)) for the significance level of 0.05. Finally,
between the classification performances attained using the suggested the null-hypothesis should be rejected if 𝐹𝐹 is greater than the critical
strategy P-Tree and those attained employing the other decision tree value and the Friedman test demonstrates that at least two classifiers
procedures considered in the comparison. The Friedman test [53], a differ significantly.
non-parametric statistical test that allows the comparison of classifier In this study, four independent Friedman tests are carried out to
models through various data sets based on average ranks, is used to check whether there is any meaningful difference between the classifi-
conduct this analysis. Generally, the null-hypothesis of the Friedman cation accuracies, sensitivities, precisions, and F-measures achieved by
test states that the performances of the compared classifiers are not the ID3, C4.5, CART, PCC-Tree, DRDT, and P-Tree techniques.
significantly different at the significance level 𝛼 = 5%. The average performance ranks for the four classification perfor-
The Friedman statistic 𝜒𝐹2 is calculated by utilizing the formula mances attained by the six decision tree procedures are first shown
below: in Fig. 8. In the figure, the approach that performs most effectively is
( 𝑘 )
12𝑁 ∑ 𝑘(𝑘 + 1)2 typically the one with the smallest average rank, whereas the approach
𝜒𝐹2 = 𝑅2𝑗 − , (20)
𝑘(𝑘 + 1) 𝑗=1 4 with the greatest average ranking typically performs the worst.
The P-Tree strategy appears to be the best decision tree method
where 𝑁 denotes the total number of data sets, 𝑘 denotes the number among the six approaches for all the four classification metrics, accord-
of classifiers, and 𝑅𝑗 denotes the average rank of the 𝑗th classifier. ing to the analysis of Fig. 8.
The Iman’s F statistic can then be computed as follows: The Friedman test results are, however, presented in Table 12 given
that 𝑁 = 14 data sets and 𝑘 = 6 decision tree approaches are available.
(𝑁 − 1) 𝜒𝐹2 As displayed in Table 12, the four null hypotheses are rejected since
𝐹𝐹 = . (21)
𝑁(𝑘 − 1) − 𝜒𝐹2 all of Iman’s F statistics are greater than the critical value 𝐹0.05 (5, 65) =
16
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 8. Average performance rankings for classification accuracies, sensitivities, precisions, and F-measures.
Table 13 Table 14
Pairwise differences of average performance ranks for testing accuracies. Pairwise differences of average performance ranks for testing sensitivities.
Algorithms ID3 C4.5 CART PCC-Tree DRDT P-Tree Algorithms ID3 C4.5 CART PCC-Tree DRDT P-Tree
ID3 – ID3 –
C4.5 0.3929 – C4.5 0.0357 –
CART 0.2143 0.6071 – CART 0.5000 0.5357 –
PCC-Tree 1.3929 1.0000 1.6071 – PCC-Tree 0.6786 0.6429 1.1786 –
DRDT 1.5357 1.1429 1.7500 0.1429 – DRDT 0.6071 0.5714 1.1071 0.0714 –
P-Tree 0.7500 1.1429 0.5357 2.1429 2.2857 – P-Tree 1.4643 1.5000 0.9643 2.1429 2.0714 –
Table 15
2.356. Therefore, the Friedman test affirms that there is a significant Pairwise differences of average performance ranks for testing precisions.
difference among the classification accuracies, sensitivities, precisions, Algorithms ID3 C4.5 CART PCC-Tree DRDT P-Tree
and F-measures yielded by the ID3, C4.5, CART, PCC-Tree, DRDT, and ID3 –
P-Tree procedures. C4.5 0.3929 –
In this situation, a post-hoc Nemenyi test [53] is further performed CART 0.1429 0.5357 –
to find out which strategy produces significantly better results than PCC-Tree 0.8929 1.2857 0.7500 –
DRDT 1.3214 1.7143 1.1786 0.4286 –
the other methods at the significance level 𝛼 = 5%. According to
P-Tree 0.8929 0.5000 1.0357 1.7857 2.2143 –
the Nemenyi test, the performances of two classifiers are significantly
different if their corresponding average ranks differ by at least the
critical difference (CD).
The critical difference is determined by Eq. (22). methodologies in terms of classification accuracy, sensitivity and F-
√ measure. Furthermore, it can also significantly exceed the DRDT algo-
𝑘(𝑘 + 1)
𝐶𝐷 = 𝑞𝛼 = 2.0153, (22) rithm in terms of precision. On the other side, although there is not a
6𝑁
significant difference between the P-Tree strategy and the conventional
where 𝑞𝛼 is the critical value for post-hoc tests (𝑞0.05 = 2.850 in this
study). ID3, C4.5, and CART techniques, they are still worse than the P-
The pairwise differences (in absolute value) of the average per- Tree method in terms of the four tested measurements. Finally, it
formance ranks for each of the classification accuracies, sensitivities, can be concluded that the Nemenyi test confirms that none of these
precisions, and F-measures attained by the ID3, C4.5, CART, PCC- existing decision tree models (ID3, C4.5, CART, PCC-Tree, and DRDT) is
Tree, DRDT, and P-Tree algorithms are summarized in Tables 13–16, noticeably better than the P-Tree procedure across all four classification
respectively. These pairwise differences are compared to the critical performances.
difference, and those that are greater than this value are highlighted
in blue. In light of the aforementioned study, the P-Tree algorithm is unques-
Based on the analysis of Tables 13–16, it can be observed that tionably a beneficial method for creating sufficiently effective decision
the P-Tree approach significantly outperforms the PCC-Tree and DRDT trees.
17
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 16 Table 17
Pairwise differences of average performance ranks for testing F-measures. Performance rankings for the classification accuracies achieved by the RF, XGBoost,
Algorithms ID3 C4.5 CART PCC-Tree DRDT P-Tree and ensemble P-Tree approaches.
Data sets RF XGBoost Ensemble P-Tree
ID3 –
C4.5 0.4643 – Bankruptcy 1 3 2
CART 0.0714 0.5357 – Breast Cancer 2 3 1
PCC-Tree 1.4643 1.0000 1.5357 – Hepatitis 1 3 2
DRDT 1.4643 1.0000 1.5357 0 – Iris 2 3 1
P-Tree 0.9643 1.4286 0.8929 2.4286 2.4286 – Somerville Happiness Survey 3 2 1
Spect (heart) 2 3 1
Average rankings 1.8333 2.8333 1.3333
5.3.2. The ensemble P-Tree technique versus the RF and XGBoost ap-
proaches
sets, as indicated by its lowest average ranking of 1.3333. Conversely,
Decision tree methods are often employed for their simplicity, inter-
the XGBoost model tends to perform relatively poorer with the highest
pretability, and capacity to deal with both categorical and numerical
average ranking of 2.8333, while the RF algorithm falls in between
data sets. Nevertheless, the instability of decision trees is a major
with an average ranking of 1.8333. Overall, these results affirm that
drawback. This instability describes how easily the decision tree model
the ensemble P-Tree method consistently leads in terms of classification
is affected by even slight modifications to the training data set. To
accuracy.
mitigate the instability often associated with tree-based algorithms,
Given that the Friedman test is based on six data sets and three
an ensemble method known as ‘‘ensemble P-Tree’’ presents a robust
classification strategies, the test statistic 𝐹𝐹 follows an F-distribution
solution. This ensemble strategy leverages the strengths of the P-Tree
with 2 and 10 degrees of freedom, and the critical value for a sig-
methodology to enhance predictive performance through collective
nificance level of 0.05 is 𝐹0.05 (2, 10) = 4.103. Since the test statistic
decision-making. Similar to the popular random forest strategy, the en-
𝐹𝐹 , which equals 7, exceeds the critical value, the null hypothesis
semble P-Tree algorithm integrates multiple decision trees constructed is disproved. Consequently, the Friedman test indicates that the RF,
using the P-Tree algorithm. This technique employs bootstrap sampling XGBoost, and ensemble P-Tree techniques are not comparable in terms
to generate a variety of subsets from the original data set, from which of classification accuracy. To further identify which specific approach
individual decision trees are independently built. Each tree is trained outperforms the others, the Nemenyi test is conducted at a significance
on a distinct bootstrap sample, enhancing the diversity and robustness level of 𝛼 = 5%. By comparing the pairwise differences between the
of the model. During the prediction phase, the ensemble aggregates the average rankings (represented in Table 17) with the critical difference
forecasts from all the individual trees through majority voting to make CD, which equals 1.3527, it is revealed that the ensemble P-Tree
the final decision. This procedure ensures that the P-Tree approach method significantly outperforms the XGBoost model.
is evaluated comprehensively and reliably across various subsets of In conclusion, the experimental study underscores the effectiveness
the data set, enhancing its overall performance. In comparison, the of the ensemble P-Tree method in generating more accurate decision
Random Forest (RF) [54] approach is an ensemble learning method that trees. Moreover, the Friedman and Nemenyi tests validate its superior
builds many decision trees using bootstrap sampling and aggregates performance and competitiveness against other ensemble decision tree
their predictions using the majority voting. It is known for its resilience techniques.
and capacity to manage a high number of input attributes without
overfitting. Similarly, the Extreme Gradient Boosting (XGBoost) [55] 5.4. Large-scale data sets: the MR-P-Tree strategy
technique is a potent gradient boosting framework that creates decision
trees sequentially, with each tree corrects the mistakes of its prede- The MR-P-Tree methodology excels in reducing the overall process-
cessors. It is widely recognized for its exceptional performance and ing time without sacrificing performance owing to its effective use of
efficiency in managing large data sets and intricate models. parallel computing within the Map-Reduce architecture. Indeed:
In this section, the ensemble P-Tree methodology is thoroughly The effectiveness of the P-Tree and MR-P-Tree methods is contrasted
evaluated by comparing its performance to that of two top-performing in this section for six different data sets: Bankruptcy, Breast Cancer,
decision tree algorithms: RF and XGBoost strategies. During this exper- Hepatitis, Iris, Statlog, and TAE. The parallel algorithm MR-P-Tree is
iment, the stopping criterion 𝜖 was set at 0 for the ensemble P-Tree performed on a cluster with various numbers of processors (3, 5, and
procedure. Fig. 9 illustrates the classification accuracy achieved by 7). The stopping criterion 𝜖 was set at 0.2 for both procedures during
the three strategies on the Bankruptcy, Breast Cancer, Hepatitis, Iris, this experiment. Fig. 10 illustrates the acquired results of the P-Tree
Somerville Happiness Survey, and Spect (heart) data sets. and MR-P-Tree techniques in terms of testing accuracy, decision tree
Upon analysis of Fig. 9, it becomes evident that the ensemble P- size, number of leaf nodes, and time spent to build the decision tree.
Tree methodology consistently demonstrates competitive or superior The MR-P-Tree methodology is depicted in the figure as MR-P-Tree-3,
performance across a diverse array of data sets compared to the RF MR-P-Tree-5, and MR-P-Tree-7 with 3, 5, and 7 processors, respectively.
and XGBoost algorithms. Notably, in data sets such as Breast Cancer
and Spect (heart), the ensemble P-Tree strategy stands out with the It can be seen from Fig. 10 that three important conclusions can be
summed up as follows:
highest classification accuracy. Even in cases where the RF technique
marginally outperforms it, the ensemble P-Tree procedure remains • First of all, a slight difference is observable between the classi-
highly competitive. fication accuracy attained by employing the MR-P-Tree process
To strengthen this assertion, a statistical analysis is conducted using with the various Map numbers and that reached by applying the
the Friedman test, with the null hypothesis positing equivalence in P-Tree strategy to the six data sets. Consequently, it is claimed
classification accuracy levels reached by the RF, XGBoost, and ensem- that the MR-P-Tree and P-Tree techniques perform very similarly
ble P-Tree methodologies. First, Table 17 provides the performance in terms of classification accuracy.
rankings for the classification accuracies that were achieved by the • Secondly, the parallel procedure exhibits smaller decision trees
three approaches across the six data sets, with 1 indicating the highest when compared to the unparallel methodology. Furthermore, as
accuracy and 3 indicating the lowest. the number of processors increases, the size and quantity of leaf
It appears from a review of Table 17 that the ensemble P-Tree nodes in the decision trees produced by the MR-P-Tree strategy
technique consistently achieves the best performance across most data decrease.
18
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 9. Comparison of the performance of the ensemble P-Tree technique to that of the RF and XGBoost methods.
19
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Fig. 10. Comparison among the P-Tree and MR-P-Tree methodologies in terms of classification accuracy, tree size, leaf nodes, and tree construction time.
Similar to the Speedup metric, the best situation is a linear Sizeup, Fig. 11(c), which proves the success of the Sizeup of the MR-P-Tree
meaning that if 𝑡 denotes the time required to execute a data set on strategy.
a specific number of processors, then the time needed for executing
𝑚-times of the data set by employing the same number of processors 5.5. Summary and general discussion
should be 𝑚 × 𝑡.
The experimental results hold significant implications within the
Using all of the tested processor numbers, the Sizeup has a rising scope of the objective of this study concerning the construction of
trend as the size of the data set grows, as can be seen by looking at accurate decision trees by employing a novel segmentation criterion
20
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
Table 18
Performance of the MR-P-Tree methodology on the various tested data sets.
Data sets No. of processors Measures Data sets No. of processors
three five seven three five seven
Bank 89.20 89.06 89.09 Accuracy Breast 92.39 92.54 92.51
Marketing 97.97 97.56 97.39 Sensitivity Cancer 94.73 94.60 94.00
90.63 90.82 90.98 Precision W-D 93.41 93.72 94.22
94.13 94.04 94.04 F-measure 93.99 94.08 94.02
22.78 22.56 22.58 Size of tree 15.03 12.15 9.44
18.38 18.22 18.16 Leaf nodes 8.01 6.57 5.22
Diabetic 61.52 61.28 61.52 Accuracy EEG eye 66.80 64.39 64.64
Retinopathy 53.08 52.31 52.84 Sensitivity 81.72 79.59 84.07
60.84 60.53 60.73 Precision 66.08 64.29 63.55
55.89 55.38 55.77 F-measure 73.07 71.12 72.39
21.16 19.93 18.75 Size of tree 21 21 19
11.08 10.47 9.87 Leaf nodes 11 11 10
designed for handling heterogeneous data sets and addressing tasks metrics, demonstrating the effectiveness of the MR-P-Tree algorithm in
involving both binary and multi-class classification. Specifically, the managing large volumes of data.
results highlight the performance of the P-Tree approach in addressing The P-Tree and MR-P-Tree strategies have significant practical im-
this research goal. In fact, the P-Tree method tackles these challenges plications for various real-world applications. On the one hand, the
by introducing the 𝛹 measure, rooted in preordonance theory, which performance of the P-Tree approach in producing reliable decision
accurately evaluates attribute relevance to class labels using correlation trees with smaller sizes and shorter construction times makes it well-
principles. The evaluation demonstrates that the P-Tree technique ex- suited for applications that require fast and interpretable models, such
hibits competitive performance on a variety of data sets, highlighting its as medical diagnosis or financial risk evaluation. On the other hand,
effectiveness in producing precise decision trees. Furthermore, compar- the effective parallel processing capabilities of the MR-P-Tree strategy
ison of these results with previously published decision tree procedures, enable the handling of large-scale data sets in applications like big
such as the ID3, C4.5, CART, PCC-Tree, and DRDT strategies, reveals data analytics or real-time decision-making systems. Collectively, these
the innovations and contributions of the P-Tree algorithm. Even though methodologies provide practical solutions for tasks involving data from
diverse areas, significantly advancing the fields of data mining and
the simplicity and interpretability of decision tree approaches, they
machine learning.
are prone to instability. In this paper, the ensemble P-Tree method-
The MR-P-Tree approach harnesses the power of the Map-Reduce
ology was developed to overcome this issue in the P-Tree technique.
architecture to effectively handle the computational complexity and
It generates several data subsets using bootstrap sampling, and then
large-scale data processing associated with decision tree generation.
constructs P-Trees on each one separately. By combining forecasts via
Nevertheless, scalability limitations can occur, especially with very
majority voting, the ensemble P-Tree procedure improves stability,
large data sets. Memory usage becomes a critical problem due to
predictive performance, and resilience. The examination of classifi-
the large amount of intermediate data created during the partition-
cation accuracy across several data sets indicates that the ensemble ing, relevance metric computation, and splitting point determination
P-Tree methodology consistently accomplishes results that are compet- phases. This can result in higher memory overheads and potential ob-
itive with or better than those of other ensemble algorithms, such as structions to performance, worsened by data replication and shuffling
RF and XGBoost. On the other side, the MR-P-Tree approach, which across nodes. Moreover, processing time is significantly impacted by
leverages the Map-Reduce framework, was introduced to overcome initial data partitioning overhead, synchronization points in the Map-
the challenge of successfully implementing the P-Tree methodology on Reduce paradigm, and recursive tree building. Several techniques can
large-scale data sets. The MR-P-Tree technique preserves strong classi- be employed to mitigate these difficulties, including data compression,
fication performance by utilizing a meticulous process for choosing the effective data segmentation, intermediate result caching, and Map-
optimal splitting attributes, determining appropriate splitting points, Reduce parameter optimization. These strategies can increase memory
and partitioning the training data set within a distributed computing efficiency, reduce processing overheads, and improve overall scalabil-
environment. Experiments conducted on various data sets have shown ity, enabling the MR-P-Tree method to effectively handle very large
notable execution time savings without compromising performance data sets in a distributed computing environment.
21
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
6. Conclusion and future directions improves predictive performance by employing the strengths of the
P-Tree approach through collective decision-making.
This study develops the preordonance-based decision tree (P-Tree) On the other hand, only small data classification problems can be
algorithm, a novel technique for creating decision trees. The preor- processed using the P-Tree approach. Therefore, the MR-P-Tree strat-
donance theory serves as the fundamental guiding principle of the egy, a parallel implementation of the P-Tree model in the Map-Reduce
proposed decision tree method, which employs the 𝛹 criterion as an programming framework, is designed to produce efficient decision trees
impurity measurement for node splitting. The rationale behind the from data sets of bigger size. Three parallelized methods are used by
choice of the 𝛹 metric is its versatility in handling data sets with the MR-P-Tree procedure: MR-A-S, MR-S-P, and MR-S-D, which are
categorical, numerical, and mixed-type attributes, as well as data sets employed to choose the ideal attributes for splitting, choose the ideal
with two or more labels. Moreover, it is unaffected by unbalanced class splitting points, and divide the data set, respectively.
distribution. To evaluate the effectiveness of the P-Tree, ensemble P-Tree, and
The key steps of the P-Tree decision tree procedure can be sum- MR-P-Tree algorithms, several experiments are carried out using var-
marized as follows: at each level of the tree construction process, the ious well-known real-world data sets from different domains. The ex-
attribute that maximizes the 𝛹 measure is then identified as the best perimental results demonstrate the extraordinary performance of the
splitting attribute. Following that, the best splitting point should be suggested techniques in producing strong and useful decision trees.
determined based on the nature of the preselected attribute. Finally, Despite the impressive performance demonstrated by the P-Tree
the training data set is split into two or more subsets according to and MR-P-Tree algorithms in decision tree generation, several op-
the attribute and the point that were chosen for partitioning. For each portunities for future research can further enrich and advance these
non-empty subset, this procedure is repeated until all instances in this strategies. Firstly, employing pruning techniques could refine and en-
subset belong to the same class. Meanwhile, the P-Tree methodology hance the P-Tree and MR-P-Tree methods, resulting in simpler tree
has established a threshold 𝜖 in order to prevent the over-partitioning structures, better interpretability, reduced overfitting, and improved
challenge; if the percentage of instances in a given node is lower than decision-making capabilities. Additionally, enhancing the capacity of
the stopping condition 𝜖, this node is flagged as a leaf node. For the the P-Tree and MR-P-Tree approaches to effectively manage missing or
labeling rule, each leaf node is labeled based on the majority class of incomplete data could further bolster their effectiveness in real-world
the instances it contains. Furthermore, an ensemble strategy known as applications, particularly in scenarios where data quality problems
‘‘ensemble P-Tree’’ provides a robust solution to address the instability are prevalent. Finally, unlike the univariate decision tree procedures
commonly found in tree-based methodologies. This ensemble technique P-Tree and MR-P-Tree, extending the scope to include constructing
22
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
multivariate decision trees, which integrate multiple attributes during [16] X. Wang, X. Liu, W. Pedrycz, L. Zhang, Fuzzy rule based decision trees, Pattern
the splitting process based on the preordonance theory, represents a Recognit. 48 (1) (2015) 50–59.
[17] Y. Mu, L. Wang, X. Liu, A fast rank mutual information based decision tree
promising direction. These multivariate methodologies have the poten-
and its implementation via map-reduce, Concurr. Comput.: Pract. Exper. 30 (10)
tial to enhance classification performance and model interpretability, (2018) e4387.
especially in scenarios where intricate relationships between attributes [18] Q. Hu, X. Che, L. Zhang, D. Zhang, M. Guo, D. Yu, Rank entropy-based decision
are crucial. In conclusion, the adoption of pruning techniques, the trees for monotonic classification, IEEE Trans. Knowl. Data Eng. 24 (11) (2011)
effective management of missing or incomplete data, and the devel- 2052–2064.
[19] J.C. Bezdek, R. Ehrlich, W. Full, FCM: The fuzzy c-means clustering algorithm,
opment of multivariate decision trees represent promising avenues to
Comput. Geosci. 10 (2–3) (1984) 191–203.
further enhance the performance and applicability of the P-Tree and [20] S. Roy, S. Mondal, A. Ekbal, M.S. Desarkar, Dispersion ratio based decision tree
MR-P-Tree approaches, advancing their utility across diverse real-world model for classification, Expert Syst. Appl. 116 (2019) 1–9.
applications. [21] S. Roy, S. Mondal, A. Ekbal, M.S. Desarkar, CRDT: correlation ratio based
decision tree model for healthcare data mining, in: 2016 IEEE 16th International
Conference on Bioinformatics and Bioengineering, BIBE, IEEE, 2016, pp. 36–43.
CRediT authorship contribution statement
[22] N.E.I. Karabadji, I. Khelf, H. Seridi, S. Aridhi, D. Remond, W. Dhifli, A data sam-
pling and attribute selection strategy for improving decision tree construction,
Hasna Chamlal: Writing – review & editing, Writing – Expert Syst. Appl. 129 (2019) 84–96.
original draft, Visualization, Validation, Supervision, Software, [23] M. Singh, J.K. Chhabra, WITHDRAWN: EGIA: A New Node Splitting Method
Resources, Project administration, Methodology, Investigation, Fund- for Decision Tree Generation: Special Application in Software Fault Prediction,
Elsevier, 2021.
ing acquisition, Formal analysis, Data curation, Conceptualization.
[24] H. Zhou, J. Zhang, Y. Zhou, X. Guo, Y. Ma, A feature selection algorithm of
Fadwa Aaboub: Writing – original draft, Software, Methodology, decision tree based on feature weight, Expert Syst. Appl. 164 (2021) 113842.
Investigation, Funding acquisition, Formal analysis, Data curation, [25] R. Wang, Y.-L. He, C.-Y. Chow, F.-F. Ou, J. Zhang, Learning ELM-tree from big
Conceptualization. Tayeb Ouaderhman: Writing – review & data based on uncertainty reduction, Fuzzy Sets and Systems 258 (2015) 79–100.
editing, Visualization, Validation, Supervision, Software, Project [26] C.P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques and
technologies: A survey on Big Data, Inf. Sci. 275 (2014) 314–347.
administration, Methodology, Investigation, Funding acquisition, [27] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters,
Formal analysis, Data curation, Conceptualization. Commun. ACM 51 (1) (2008) 107–113.
[28] N. Tsapanos, A. Tefas, N. Nikolaidis, I. Pitas, A distributed framework for
Declaration of competing interest trimmed kernel k-means clustering, Pattern Recognit. 48 (8) (2015) 2685–2698.
[29] F. Li, J. Chen, Z. Wang, Wireless MapReduce distributed computing, IEEE Trans.
Inform. Theory 65 (10) (2019) 6101–6114.
The authors declare that they have no known competing financial [30] M. Zhu, D. Shen, G. Yu, Y. Kou, T. Nie, Computing the split points for learning
interests or personal relationships that could have appeared to decision tree in MapReduce, in: Database Systems for Advanced Applications:
influence the work reported in this paper. 18th International Conference, DASFAA 2013, Wuhan, China, April 22-25, 2013.
Proceedings, Part II 18, Springer, 2013, pp. 339–353.
Data availability [31] Y. Mu, X. Liu, Z. Yang, X. Liu, A parallel C4. 5 decision tree algorithm based
on MapReduce, Concurr. Comput.: Pract. Exper. 29 (8) (2017) e4015.
[32] Y. Mu, X. Liu, L. Wang, A.B. Asghar, A parallel tree node splitting criterion for
Data will be made available on request. fuzzy decision trees, Concurr. Comput.: Pract. Exper. 31 (17) (2019) e5268.
[33] Y. Mu, X. Liu, L. Wang, J. Zhou, A parallel fuzzy rule-base based decision tree
References in the framework of map-reduce, Pattern Recognit. 103 (2020) 107326.
[34] F. Es-Sabery, K. Es-Sabery, J. Qadir, B. Sainz-De-Abajo, A. Hair, B. Garcia-
[1] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. Zapirain, I. De La Torre-Díez, A MapReduce opinion mining for COVID-19-related
[2] Y. Mu, X. Liu, L. Wang, A Pearson’s correlation coefficient based decision tree tweets classification using enhanced ID3 decision tree classifier, IEEE Access 9
and its parallel implementation, Inform. Sci. 435 (2018) 40–58. (2021) 58706–58739.
[3] E. Hunt, J. Marin, P. Stone, Experiments in induction, Academic Press, New [35] S. Fathimabi, E. Jangam, A. Srisaila, MapReduce based heart disease prediction
York, 1966. system, in: 2021 8th International Conference on Computing for Sustainable
[4] N.E.I. Karabadji, H. Seridi, I. Khelf, N. Azizi, R. Boulkroune, Improved decision Global Development, INDIACom, IEEE, 2021, pp. 281–286.
tree construction based on attribute selection and data sampling for fault [36] S. Chah, Critères de classification sur des données hétérogènes, Rev. Stat. Appl.
diagnosis in rotating machines, Eng. Appl. Artif. Intell. 35 (2014) 71–83. 33 (2) (1985) 19–36.
[5] J.R. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufman, 1993. [37] I.C. Lerman, Foundations and Methods in Combinatorial and Statistical Data
[6] D. Wang, L. Jiang, An improved attribute selection measure for decision tree Analysis and Clustering, Springer, 2016.
induction, in: Fourth International Conference on Fuzzy Systems and Knowledge [38] M.G. Kendall, Rank Correlation Methods, Griffin, 1948.
Discovery, Vol. 4, FSKD 2007, IEEE, 2007, pp. 654–658. [39] I.-C. Lerman, Classification et analyse ordinale des données, Dunod, 1981.
[7] L. Brieman, J. Friedman, C.J. Stone, R. Olshen, Classification and Regression [40] W. Hadoop, T.: Hadoop: The Definitive Guide, O’Reilly Media Inc, 2015.
Tree Analysis, CRC Press, Boca Raton, FL, USA, 1984. [41] I. Triguero, D. Peralta, J. Bacardit, S. García, F. Herrera, MRPR: a MapReduce
[8] J. Han, M. Kamber, Data Mining: Concepts and Techniques, second ed., solution for prototype reduction in big data classification, Neurocomputing 150
University of Illinois at Urbana Champaign, Morgan Kaufmann, 2006. (2015) 331–345.
[9] H. Chamlal, T. Ouaderhman, F. Aaboub, Preordonance correlation filter for [42] S.A. Salman, S.A. Dheyab, Q.M. Salih, W.A. Hammood, Parallel machine learning
feature selection in the high dimensional classification problem, in: 2021 Fifth algorithms, Mesop. J. Big Data (2023) 12–15.
International Conference on Intelligent Computing in Data Sciences, ICDS, IEEE, [43] I. Triguero, S. Del Río, V. López, J. Bacardit, J.M. Benítez, F. Herrera, ROSEFW-
2021, pp. 1–5. RF: the winner algorithm for the ECBDL’14 big data competition: an extremely
[10] H. Chamlal, T. Ouaderhman, F.E. Rebbah, A hybrid feature selection approach imbalanced big data bioinformatics problem, Knowl.-Based Syst. 87 (2015)
for microarray datasets using graph theoretic-based method, Inform. Sci. 615 69–79.
(2022) 449–474. [44] OpenML A worldwide machine learning lab, 2022, https://www.openml.org/.
[11] H. Chamlal, T. Ouaderhman, B. El Mourtji, Feature selection in high dimensional (Accessed 01 December 2022).
data: A specific preordonnances-based memetic algorithm, Knowl.-Based Syst. [45] M. Saar-Tsechansky, F. Provost, Handling missing values when applying
(2023) 110420. classification models, 2007.
[12] H. Chamlal, T. Ouaderhman, F. Aaboub, A graph based preordonnances theoretic [46] A.D. Patange, R. Jegadeeshwaran, Data for vibration-based multipoint tool
supervised feature selection in high dimensional data, Knowl.-Based Syst. 257 insert health monitoring, 2022, URL https://www.researchgate.net/publication/
(2022) 109899. 358751812_Data_for_Vibration-based_multipoint_tool_insert_health_monitoring.
[13] F.Z. Janane, T. Ouaderhman, H. Chamlal, A filter feature selection [47] A.D. Patange, R. Jegadeeshwaran, A machine learning approach for vibration-
for high-dimensional data, J. Algorithms Comput. Technol. 17 (2023) based multipoint tool insert health prediction on vertical machining centre
17483026231184171. (VMC), Measurement 173 (2021) 108649.
[14] T. Ouaderhman, H. Chamlal, F.Z. Janane, A new filter-based gene selection [48] A.D. Patange, S.S. Pardeshi, R. Jegadeeshwaran, A. Zarkar, K. Verma, Augmen-
approach in the DNA microarray domain, Expert Syst. Appl. 240 (2024) 122504. tation of decision tree model through hyper-parameters tuning for monitoring of
[15] B. Chandra, R. Kothari, P. Paul, A new node splitting measure for decision tree cutting tool faults based on vibration signatures, J. Vibr. Eng. Technol. 11 (8)
construction, Pattern Recognit. 43 (8) (2010) 2725–2731. (2023) 3759–3777.
23
H. Chamlal et al. Applied Soft Computing 167 (2024) 112261
[49] R. Kohavi, et al., A study of cross-validation and bootstrap for accuracy [52] F. Aaboub, H. Chamlal, T. Ouaderhman, Analysis of the prediction performance
estimation and model selection, in: Ijcai, Vol. 14, No. 2, Montreal, Canada, 1995, of decision tree-based algorithms, in: 2023 International Conference on Decision
pp. 1137–1145. Aid Sciences and Applications, DASA, IEEE, 2023, pp. 7–11.
[50] I. Goodfellow, Y. Bengio, A. Courville, Deep learning MIT press (2016), in: [53] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach.
Conference on Information and Communication Systems, ICICS, 2016, pp. Learn. Res. 7 (2006) 1–30.
151–156. [54] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32.
[51] F. Aaboub, H. Chamlal, T. Ouaderhman, Statistical analysis of various split- [55] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings
ting criteria for decision trees, J. Algorithms Comput. Technol. 17 (2023) of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and
17483026231198181. Data Mining, 2016, pp. 785–794.
[56] Q. He, T. Shang, F. Zhuang, Z. Shi, Parallel extreme learning machine for
regression based on MapReduce, Neurocomputing 102 (2013) 52–58.
24