SSL - C4.5 Rules
SSL - C4.5 Rules
5: Implementation
of a Classification Algorithm
for Semi-supervised Learning
Based on C4.5
1 Introduction
Data classification tasks have become increasingly difficult due to the complexity
and volume of datasets generated in real scenarios. Among these complexities,
we can mention a large amount of unlabeled data, the presence of data with
multiple labels, the processing of data streams, among others [2]. Additionally,
Supported by the CAPES and FAPESC organizations.
c Springer Nature Switzerland AG 2020
R. Cerri and R. C. Prati (Eds.): BRACIS 2020, LNAI 12320, pp. 513–525, 2020.
https://doi.org/10.1007/978-3-030-61380-8_35
514 A. A. Ortiz-Dı́az et al.
2 Related Works
There are different techniques applied to semi-supervised learning. Among these
techniques are Maximization of Expectations (ME), Self-training, or Co-training
[13]. Some papers related to these techniques and the present work are described
below.
A Classification Algorithm for Semi-supervised Learning Based on C4.5 515
(1−ω) |X|
ImpuritySSL(E) = ω · Impurityl (El , Y ) + |X| · i=1 Impurityu (E, Xi ) (1)
One of the simplest decision tree induction algorithms for classification is the
ID3 [9]. The ID3 algorithm has great importance in the classification area and
its fundamental ideas are used as the basis for the construction of other algo-
rithms that use the top-down approach. This algorithm is based on the expected
information gain formula (Eq. 2). The idea is to reduce the impurity value of the
nodes when going deep in the tree, always trying to make the next level of the
tree with lower impurity than the last one.
|EXi ,j |
Gain(E, Xi ) = Impurity(E) − EXi ,j ⊂E |E| · Impurity(EXi ,j ) (2)
This subsection presents a short overview of the metrics used as the basis for
the implementation of the proposed algorithm. Actually, this section serves as a
compilation of the relevant finds took into account to decide the metrics to be
used in the work.
The value of the impurity equation reflects how disorganized or how random
the dataset is. A high impurity value is typically found on a random dataset,
A Classification Algorithm for Semi-supervised Learning Based on C4.5 517
Entropy. The entropy measure is calculated by the Eq. 3. This equation pro-
vides an entropy coefficient between 0 and 1 based on the probability p that
an attribute has a nominal class presented in the dataset. Given pi = |Xi |/|X|,
where X is the set of classes or possible values for the nominal attribute of an
instance, Xi is the subset of instances where the attribute has the ith class of
X, and finally n is given by n = |X| [8].
n
S(p) = − pi · log2 (pi ) (3)
i=1
Gini. The Gini metric is represented by the Eq. 4, which generates a continuous
value between 0 and 1. The calculated value is based on the probability p that
an attribute is a nominal class presented in the dataset. Given pi = |Xi |/|X|,
where X is the set of classes or possible values for the nominal attribute of an
instance, Xi is the subset of instances where the attribute has the ith class of
X, and finally the value of n is given by n = |X| [15].
n
G(p) = 1 − p2i (4)
i=1
4 SSL-C4.5 Algorithm
Many of the supervised classification algorithms that are used to induce decision
trees often focusing only to separate the data based on their respective classes.
That is, although the other attributes are used for example as a comparison to
determine subtrees, these attributes are not considered in the metric to deter-
mine the quality of a possible division of a tree node.
On the other hand, in the unsupervised clustering approaches, all instance
attributes can be used to subdivide the search space by considering each of
them as a dimension. According to [1], for some datasets it can be assumed that
separating a data set based on any attribute can be as good as separating a
data based on a class attribute of the instances. Therefore, even if a significant
number of instances within a dataset are not previously labeled, it is still possible
to use the information from their other attributes to group them, so that the
information is also used during the processing of the data.
Taking into account these two approaches to different learning paradigms,
but with the same objective, to induce a model that represents the dataset,
this paper proposes a model capable of joining the beneficial characteristics of
both models to obtain a more promising final result. The SSL-C4.5 algorithm is
developed over the J48 algorithm, implementation in Weka [16] of the C4.5 [10]
algorithm. In this new implementation, changes have been made in the impurity
metric used to adapt the original C4.5 to support the semi-supervised learning.
According to Levatić et al. [5], an algorithm that builds a top-down decision
tree is based on the use of a heuristic capable of finding the best attribute-value
pair in a dataset to reduce the value of the impurity of new partitioned datasets.
Different algorithms can use different metrics to achieve this effect and maximize
the quality of the result as detailed in Sect. 3.1.
Algorithm 1 shows the pseudocode of the SSL-C4.5 algorithm. In line 2, the
empty list called Analysis pending nodes list is initialized. In this list, all the
nodes of the tree that will be analyzed will be stored. The root node N is added
as the first to be analyzed (line 8). Then, inside the while loop (line 9), all nodes
obtained as a result of dividing previous nodes (line 14) will be added.
In general, the Algorithm 1 is a variant of any classic algorithm for inducing
top-down decision trees. However, a notable difference is present in the function
called Attribute selection method (line 11). This function has as main objective
to decide if a node of the tree is going to be divided or not. If this function
decides not to divide the node, then this node is labeled with the dominant class
in the set of instances associated with that node. However, if this function decides
to divide this node, then the attribute that offers the most efficient division is
returned. To make this decision, the Attribute selection method function uses
the inf ormation gain f ormula (Eq. 2). It is necessary to emphasize that this
formula (Eq. 2) has been adapted to the semi-supervised paradigm.
A Classification Algorithm for Semi-supervised Learning Based on C4.5 519
Algorithm SSL-C4.5:
Based on the results of a large set of tests, we decided to use the entropy
standardized measure of impurities for the supervised learning part of the model.
The entropy measure has been tested by several investigations [8] due to its
robustness, simplicity, and flexibility. In addition, this measure is compatible
with the base supervised implementation, the J48 algorithm. The calculation of
the normalized entropy is represented by the Eq. 6, where El represents the set
of ordered data that is analyzed in a given tree node, El f ull represents the set
of all the ordered data used in the construction of trees, and S is the heuristic
metric (h) of the entropy of impurities.
S(El , Y )
Impurityl (El , Y ) = (6)
S(El f ull , Y )
Similarly, we chose to use the standardized Gini impurity metric and the
standardized variance impurity for the unsupervised learning part of the model.
520 A. A. Ortiz-Dı́az et al.
This metric has been tested by several investigations [15] due to its easy under-
standing and implementation, as well as the good results offered. The standard-
ized calculation of impurities without supervision is represented by the Eqs. 7
and 8, where E represents the set of all the data analyzed in a given tree node
(whether labeled or not), E f ull represents the set of all the data used to build
the tree (whether labeled or not), G is the Gini heuristic metric of impurities, V
is the heuristic metric of impurity variance and Xi is the attribute that is being
evaluated.
G(E,Xi )
Impurityu (E, Si ) = G(E f ull ,Xi )
, if Xi is nominal (7)
V (E,Xi )
Impurityu (E, Si ) = V (E f ull ,Xi )
, if Xi is numerical (8)
These two impurity measures are incorporated into the Eq. 1 of semi-
supervised learning proposed by Levatić et al. [5]. It creates a single measure
of impurities, which was incorporated into the J48 algorithm replacing its origi-
nal metric.
Finally, the Best splitting criterion function (line 13) receives the last node
analyzed (nextN ode), the one decided to be divided. The objective of this
function is to divide this node (nextN ode) using the attribute that guaran-
tees more efficiency. This attribute has already been selected before (line 11).
This function returns a list of new nodes (newN odes), which are added to the
Analysis pending nodes list list.
5 Results Assessment
In this section, we evaluate the predictive performance of the SSL-C4.5 algorithm
using different datasets. The datasets were mainly obtained from the UCI repos-
itory [7]. We compared our proposed algorithm (SSL-C4.5) with the SSL-PCT
algorithm proposed by Levatić et al. [5], with two variations of both algorithms
respectively, and with the implementation in WEKA of C4.5 algorithm (which
is named J48). All the experiments were performed over WEKA [16] framework
for data mining. WEKA provides a collection of evaluation tools and a great
variety of known algorithms.
5.1 Methodology
We used eight datasets with different features for the experiments to ensure high
diversity in the experimentation scenarios. Table 1 shows the features regard-
ing the selected datasets, which are: number of instances (Instances); number
of discrete attributes (Nominal); number of continuous attributes (Numeric);
and number of classes (Classes). The algorithms and variants of the algorithms
involved in the experiments are described below.
During the experiments, the algorithms were set with their default configu-
ration options. The “CollapseTree” and “SubtreeRaising” options were disabled
in the C4.5 algorithm (J48 implementation) since these features were not imple-
mented in the SSL-C4.5 algorithm. The confidence factor was set as 0.25 in the
C4.5 and SSL-C4.5 algorithms, with a minimum information gain of 0.
All algorithms were evaluated using the 10-fold cross-validation technique.
For each experiment, the part of the dataset selected to train the models was
fully used. That is, the supervised models were trained with all the training
data labeled. However, to assess the learning capability of the semi-supervised
algorithms, the class label of part of the instances from the training set was
removed, while the rest of the attributes remained intact. Therefore, for the
semi-supervised learning, only a certain number of instances were labeled. On the
other hand, during the test phase, the class of all instances (including instances
whose classes had been removed) is compared with the class predicted by the
model, in order to check if the prediction was correct.
For each dataset described in Table 1, the test datasets were created by vary-
ing the amounts of labeled instances. That is, each original dataset generated
six derived datasets with exactly the same instances but leaving only a small
amount of these instances with their labels. The following quantities of labeled
522 A. A. Ortiz-Dı́az et al.
instances were used by default: 25, 50, 100, 200, 350 and 500. The remains of the
instances were left unlabeled for training each scenario. These datasets, which
were kept with the class labels, are randomly selected. The experiments were
repeated ten times to leverage the results reliability.
During the tests of the semi-supervised algorithms, it was necessary to define
the best value of ω in the automatic adaptation tests. The value of ω was estab-
lished by performing tests (training/test), using the same test datasets, ranging
it from 0.0 (fully unsupervised) to 1.0 (fully supervised) in steps of 0.1. There-
fore, 11 values were tested and the value that generates the best model was
selected. This can guarantee that the method will never generate a worse model
than its supervised version (ω = 1.0).
the accuracy values. However, for other datasets, such as Abalone, Adult, Bank,
and Eyestatem, the new proposal has a significant improvement in accuracy val-
ues. It is also valid to highlight that when evaluating the Madelon dataset, the
opposite result occurred.
(g) (h)
Fig. 1. Results of the accuracy of the algorithms on the eight datasets. (a) Abalone
dataset, (b) Adult dataset, (c) Bank dataset, (d) Banknote dataset, (e) Biodegrada-
tion dataset, (f ) Eyestate dataset, (g) Madelon dataset, (h) Mushroom dataset.
6 Conclusion
In the present work, the main characteristics of the algorithm called SSL-C4.5
were described. This new proposal is an adaptation of the well-known C4.5
algorithm to the semi-supervised learning paradigm. SSL-C4.5 algorithm is able
to work in scenarios where few instances are labeled due to also it is able to
extract significant information from the attributes that are not classes.
Through an experimental study, we managed to test its operation in different
data scenarios compared to other known algorithms. As a result, it was shown
that the use of semi-supervised metrics in classification algorithms can contribute
to the induction of more reliable models when datasets have few instances with
labels. Although no statistically significant differences were found compared to
the rest of the algorithms, the SSL-C4.5 algorithm achieved promising empirical
results in terms of accuracy values. The new proposal obtained better results in
most of the bases.
The authors propose as future work to carry out a more complete study of the
algorithm to achieve possible structural improvements. In addition, we propose
to perform tests with more diverse databases to try to make the experimental
study more robust.
References
1. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning, vol. 2. MIT Press,
Cambridge (2006)
2. Farid, D., et al.: An adaptive ensemble classifier for mining concept drifting data
streams. Expert Syst. Appl. 15, 5895–5906 (2013)
3. Haque, A., Khan, L., Baron, M.: Sand: semi-supervised adaptive novel class detec-
tion and classification over data stream. In: AAAI, pp. 1652–1658 (2016)
4. Krawczyk, B., Minku, L., Gama, J., Stefanowski, J., Woniak, M.: Ensemble learn-
ing for data stream analysis Inf. Fusion 37, 132–156 (2017)
5. Levatic, J., Ceci, M., Kocev, D., Dzeroski, S.: Semi-supervised classification trees.
J. Intell. Inf. Syst. 49, 461–486 (2017)
6. Li, P., Wu, X., Hu, X.: Mining recurring concept drifts with limited labeled stream-
ing data. In: Sugiyama, Proceedings of 2nd Asian Conference on Machine Learning,
vol. 13, pp. 241–252 (2016)
7. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.
edu/ml
8. Ortiz-Dı́az, A., Baldo, F., Palomino-Mariño, L., Bayer, F., Verdecia-Cabrera, A.,
Frı́as-Blanco, I.: Fast adaptive stacking of ensembles adaptation for supporting
active learning. A real case application. In: 14th International Conference on Nat-
ural Computation, Fuzzy Systems and Knowledge Discovery. ICNC-FSKD (2018)
A Classification Algorithm for Semi-supervised Learning Based on C4.5 525