0% found this document useful (0 votes)
25 views13 pages

SSL - C4.5 Rules

This document presents the SSL-C4.5 algorithm designed for semi-supervised learning based on the C4.5 algorithm. SSL-C4.5 modifies the J48 implementation of C4.5 to incorporate a metric for semi-supervised learning that aims to extract information from labeled and unlabeled data. The assessment on eight datasets showed promising results compared to the supervised version of C4.5.

Uploaded by

josoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views13 pages

SSL - C4.5 Rules

This document presents the SSL-C4.5 algorithm designed for semi-supervised learning based on the C4.5 algorithm. SSL-C4.5 modifies the J48 implementation of C4.5 to incorporate a metric for semi-supervised learning that aims to extract information from labeled and unlabeled data. The assessment on eight datasets showed promising results compared to the supervised version of C4.5.

Uploaded by

josoa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

SSL-C4.

5: Implementation
of a Classification Algorithm
for Semi-supervised Learning
Based on C4.5

Agustı́n Alejandro Ortiz-Dı́az(B) , Flavio Roberto Bayer ,


and Fabiano Baldo

Graduate Program in Applied Computing, Santa Catarina State University,


Joinville, Brazil
[email protected], [email protected], [email protected]

Abstract. Classification algorithms have been extensively studied in


many of the major scientific investigations in recent decades. Many of
these algorithms are designed for supervised learning, which requires
labeled instances to achieve effective learning models. However, in many
of the real human processes, data labeling is expensive and time-
consuming. Because of this, alternative learning paradigms have been
proposed to reduce the cost of the labeling process without a signifi-
cant loss of model performance. This paper presents the Semi-Supervised
Learning C4.5 algorithm (SSL-C4.5) designed to work in scenarios where
only a small part of the data is labeled. SSL-C4.5 was implemented
over the J48 implementation of the C4.5 algorithm available at the
WEKA platform. The J48 was modified incorporating a metric for semi-
supervised learning. This metric aims at inducing decision tree models
able to analyze and extract information from the entire training dataset,
including instances of unlabeled data in scenarios where they are the
majority. The assessment performed using eight different benchmark
datasets showed that the new proposal has achieved promising results
compared to the supervised version of C4.5.

Keywords: Classification-algorithms · Semi-supervised-learning ·


Decision tree

1 Introduction
Data classification tasks have become increasingly difficult due to the complexity
and volume of datasets generated in real scenarios. Among these complexities,
we can mention a large amount of unlabeled data, the presence of data with
multiple labels, the processing of data streams, among others [2]. Additionally,
Supported by the CAPES and FAPESC organizations.
c Springer Nature Switzerland AG 2020
R. Cerri and R. C. Prati (Eds.): BRACIS 2020, LNAI 12320, pp. 513–525, 2020.
https://doi.org/10.1007/978-3-030-61380-8_35
514 A. A. Ortiz-Dı́az et al.

in real-world scenarios, a large amount of data is generated without a specialist


to label them beforehand. These datasets with unlabeled instances make many
conventional algorithms not suitable for their classification. This is because sev-
eral inducing algorithms are not designed to extract all possible information from
the data provided [4].
In this scenario, labeling the data can be an extremely expensive task [3]. For
this reason, it is not practical to previously labeling all instances of data provided
to the algorithm, which is usually necessary in the cases of supervised learning
[6]. Therefore, semi-supervised learning (SSL) methods have among their objec-
tives to face the challenges of inducing suitable predictive models considering
unlabeled instances [1]. SSL techniques allow the algorithms to combine labeled
and unlabeled data in order to improve the induced models. Therefore, the aim
of the SSL approach is to increase the capacity of the inducing algorithms by
extracting significant information from the entire training dataset.
On the other hand, according to [5], decision tree models have desirable
properties that should be exploited in SSL. Among these properties, we can
highlight the following ones: non-parametric, efficient, easily interpretable, and
suitable for predicting in many domains.
In this work, we propose a method for semi-supervised learning based on clas-
sification trees. According to the categorization in [13], our proposal is classified
as an inductive learning approach since it uses all the information to build the
models, both labeled and unlabeled instances. We selected a well-performance
supervised learning algorithm that induces decision trees and we introduced
on it a semi-supervised metric for creating models using the semi-supervised
paradigm.
The supervised learning algorithm selected was the C4.5 [10]. Two of the
main benefits of the C4.5 algorithm are that it has relatively quick induction and
generally obtains high accuracy values [14]. In addition, we modify it in order
to add the impurity metric proposed by Levatić et al. [5]. This metric improves
the division of the instances in the tree nodes to enable it to deal with unlabeled
data. We take the J48 algorithm as basis for supporting the implementation of
the adaptations on C4.5 in order to make it the proposed SSL-C4.5. The J48 is
a implementation of C4.5 in Java available at the WEKA platform [16].
The rest of the paper is organized as follows: Sect. 2 references the most
relevant related works. Section 3 describes the main characteristics of the well-
known C4.5 algorithm, including a subsection to detail some of the used impurity
metrics. Section 4 presents in detail the proposed SSL-C4.5 algorithm. Section 5
shows the experiments and assessment results. Finally, Section 6 highlights the
main conclusions and future works.

2 Related Works
There are different techniques applied to semi-supervised learning. Among these
techniques are Maximization of Expectations (ME), Self-training, or Co-training
[13]. Some papers related to these techniques and the present work are described
below.
A Classification Algorithm for Semi-supervised Learning Based on C4.5 515

An interesting idea related to the self-training technique is proposed in [11].


This paper introduces a classification algorithm that combines the characteristics
of the known methods, Naive Bayes, and C4.5. The authors propose a novel app-
roach to increase the power of semi-supervised methods using a technique called
cascade classification. The main feature of this classification technique is the use
of two levels of classification. At a first level, a base classifier is used to increase
the characteristic space by adding the class predicted by it to the distribution
of the probability class of the initial data. On a second level, another classi-
fier is incorporated, which from the new data set extracts the decision for each
instance analyzed. The new proposal was compared to other semi-supervised
classification methods. Known data sets, with standard references, were used
for the experiments. The authors concluded that the new method presented has
greater precision according to the results shown.
Another proposal for a semi-supervised classifier, also related to the self-
training technique, is presented in [12]. The authors start from the premise of
showing that learning with standard decision trees as a base-learner cannot be
effective in a self-training algorithm for the semi-supervised learning paradigm.
However, through the experiments, the authors showed that improving the prob-
ability estimate of the classifiers leads to a better selection metric for the algo-
rithm and produces a better self-training model of classification. Furthermore,
they concluded that combining the Laplacian correction, the non-pruning, on
an NBTree produced a better probability estimate in the tree classifiers. They
also showed that the Mahalanobis distance method for sampling is effective in
selecting a set of high-confidence predictions in decision trees.
With a different idea, being the basis of our work, The Semi-Supervised
Learning - Predictive Clustering Trees (SSL-PCT) algorithm, described in [5].
This is an adaptation of the Predictive Clustering Trees (PCT) algorithm that
allows the use of not only labeled instances, but also the unlabeled instances
during the model training. PCT algorithm works under the assumption that data
labels can be successively separated using techniques of clustering, allowing the
creation of a tree analogously to the operation of the ID3 algorithm. However,
PCT uses the Gini statistical dispersion measure to calculate the impurity metric
[5].
In their work, Levatić et al. [5] argues that the grouping behavior of PCT
can be used not only in the class attribute of an instance but also in the rest
of the dataset attributes. Therefore, the idea is to perform a separability anal-
ysis considering the values of the instances attributes, rather than assess the
separability only considering the class of the instances. By incorporating an
unsupervised component in the impurity metric, it is possible to perform data
separation even if the instance labeled is not previously known.
Levatić et al. [5] proposes the impurity formula described by the Eq. 1, where
a ω parameter is introduced and two impurity metrics are used, one for the super-
vised learning, Impurityl , based on the Y label of the instances, and another
for the unsupervised learning, Impurityu , based on all Xi enumerations of the
X attributes of the dataset.
516 A. A. Ortiz-Dı́az et al.

(1−ω) |X|
ImpuritySSL(E) = ω · Impurityl (El , Y ) + |X| · i=1 Impurityu (E, Xi ) (1)

According to Eq. 1, when the ω value is close to 1 the process tends to be


more supervised, while when the value is close to 0 the process tends to be more
unsupervised. In their work, Levatić et al. [5] used the standardized Gini measure
to calculate the impurity for discrete attributes.

3 C4.5 Algorithm and Impurity Measures

One of the simplest decision tree induction algorithms for classification is the
ID3 [9]. The ID3 algorithm has great importance in the classification area and
its fundamental ideas are used as the basis for the construction of other algo-
rithms that use the top-down approach. This algorithm is based on the expected
information gain formula (Eq. 2). The idea is to reduce the impurity value of the
nodes when going deep in the tree, always trying to make the next level of the
tree with lower impurity than the last one.

 |EXi ,j |
Gain(E, Xi ) = Impurity(E) − EXi ,j ⊂E |E| · Impurity(EXi ,j ) (2)

Where E is a dataset that has a X set of attributes. The algorithm selects


the best attribute Xi of X to divide the set E into subsets EXi ,j which reduces
the impurity of the subsets. EXi ,j represents the j th subsets of data in which the
data set E is divided after selecting the Xi as the best attribute for the division.
The ID3 algorithm has an important drawback, it is only capable of inducing
decision trees for attributes with discrete values.
The C4.5 algorithm [10] is an improvement of the ID3 algorithm, created by
the same author, which uses the same operating principle. That is, for each leaf
node of the tree, the C4.5 algorithm chooses the attribute that most effectively
divides its instance set into subsets that tend to one category or another. The
C4.5 algorithm has among its objectives to solve the main limitation of the
ID3 algorithm. For this reason, it supports induction of trees with numerical
attributes of continuous values, allowing such attributes to appear at multiple
levels in the same branch of the tree that is being induced. C4.5 uses a disorder
measure called entropy to calculate the impurity metric.

3.1 Impurity Measures

This subsection presents a short overview of the metrics used as the basis for
the implementation of the proposed algorithm. Actually, this section serves as a
compilation of the relevant finds took into account to decide the metrics to be
used in the work.
The value of the impurity equation reflects how disorganized or how random
the dataset is. A high impurity value is typically found on a random dataset,
A Classification Algorithm for Semi-supervised Learning Based on C4.5 517

while a low impurity is typically seen in a well-organized dataset. The impurity


measure is a metric that abstracts and quantifies the organization of data in
a numerical value. Therefore, various impurity measure formulas can be used
according to the characteristics of the data. Below, it is presented some of the
most commonly used impurity measures found in the literature.

Entropy. The entropy measure is calculated by the Eq. 3. This equation pro-
vides an entropy coefficient between 0 and 1 based on the probability p that
an attribute has a nominal class presented in the dataset. Given pi = |Xi |/|X|,
where X is the set of classes or possible values for the nominal attribute of an
instance, Xi is the subset of instances where the attribute has the ith class of
X, and finally n is given by n = |X| [8].
n

S(p) = − pi · log2 (pi ) (3)
i=1

Gini. The Gini metric is represented by the Eq. 4, which generates a continuous
value between 0 and 1. The calculated value is based on the probability p that
an attribute is a nominal class presented in the dataset. Given pi = |Xi |/|X|,
where X is the set of classes or possible values for the nominal attribute of an
instance, Xi is the subset of instances where the attribute has the ith class of
X, and finally the value of n is given by n = |X| [15].
n

G(p) = 1 − p2i (4)
i=1

Variance. To calculate the impurity for a regression model, or when evaluat-


ing numerical attributes, the standard deviation is commonly used to estimate
the data capacity of a set x, where xi represents data in x. To calculate the
incremental variance, it is possible to use the Eq. 5, so that the sum of xi , the
sum of x2i and n can be easily calculated incrementally. Thus, the method gain
efficiency by reducing the runtime.
1 
n  1 n 2
V (x) = x2i − xi (5)
n i=1 n i=1
It is important to keep in mind that the variance of the numerical values
generates a non-negative number, but it is not a normalized measure. Taking
this issue into account, the variance value of a subset of data is often divided
by the variance value of the complete dataset, generating an entropy coefficient
between 0 and 1 [5].
518 A. A. Ortiz-Dı́az et al.

4 SSL-C4.5 Algorithm
Many of the supervised classification algorithms that are used to induce decision
trees often focusing only to separate the data based on their respective classes.
That is, although the other attributes are used for example as a comparison to
determine subtrees, these attributes are not considered in the metric to deter-
mine the quality of a possible division of a tree node.
On the other hand, in the unsupervised clustering approaches, all instance
attributes can be used to subdivide the search space by considering each of
them as a dimension. According to [1], for some datasets it can be assumed that
separating a data set based on any attribute can be as good as separating a
data based on a class attribute of the instances. Therefore, even if a significant
number of instances within a dataset are not previously labeled, it is still possible
to use the information from their other attributes to group them, so that the
information is also used during the processing of the data.
Taking into account these two approaches to different learning paradigms,
but with the same objective, to induce a model that represents the dataset,
this paper proposes a model capable of joining the beneficial characteristics of
both models to obtain a more promising final result. The SSL-C4.5 algorithm is
developed over the J48 algorithm, implementation in Weka [16] of the C4.5 [10]
algorithm. In this new implementation, changes have been made in the impurity
metric used to adapt the original C4.5 to support the semi-supervised learning.
According to Levatić et al. [5], an algorithm that builds a top-down decision
tree is based on the use of a heuristic capable of finding the best attribute-value
pair in a dataset to reduce the value of the impurity of new partitioned datasets.
Different algorithms can use different metrics to achieve this effect and maximize
the quality of the result as detailed in Sect. 3.1.
Algorithm 1 shows the pseudocode of the SSL-C4.5 algorithm. In line 2, the
empty list called Analysis pending nodes list is initialized. In this list, all the
nodes of the tree that will be analyzed will be stored. The root node N is added
as the first to be analyzed (line 8). Then, inside the while loop (line 9), all nodes
obtained as a result of dividing previous nodes (line 14) will be added.
In general, the Algorithm 1 is a variant of any classic algorithm for inducing
top-down decision trees. However, a notable difference is present in the function
called Attribute selection method (line 11). This function has as main objective
to decide if a node of the tree is going to be divided or not. If this function
decides not to divide the node, then this node is labeled with the dominant class
in the set of instances associated with that node. However, if this function decides
to divide this node, then the attribute that offers the most efficient division is
returned. To make this decision, the Attribute selection method function uses
the inf ormation gain f ormula (Eq. 2). It is necessary to emphasize that this
formula (Eq. 2) has been adapted to the semi-supervised paradigm.
A Classification Algorithm for Semi-supervised Learning Based on C4.5 519

Algorithm SSL-C4.5:

S: Set of labeled or unlabeled examples.


w[0, 1]: Weighting parameter to balance the supervised and semi-supervised
paradigms.
D : Attribute list.
Attribute selection method: It uses impurity measure SSL-C4.5 that
calculates the similarity between the examples on the basis of both the class
labels and descriptive attributes.

Result: SSL − C4.5 − T ree: A decision tree for semi-supervised learning


induced according to the S training instances data set.
1 begin
2 Analysis pending nodes list ← Ø
3 Created SSL − C4.5 − T ree a tree with a single node N (the root)
4 if tuples in S are all labeled with the same class C then
5 Marking N as a leaf node labeled with the class C
6 return(SSL − C4.5 − T ree)
7 else
8 Analysis pending nodes list ← Analysis pending nodes list ∪ N
9 while Analysis pending nodes list <> Ø do
10 Extract nextN ode from Analysis pending nodes list
11 Attribute selection method(nextN ode, D) Finding the best
splitting criterion
12 if splitting criterion = true then
13 newN odes ← Best splitting criterion(nextN ode)
14 Analysis pending nodes list ←
Analysis pending nodes list ∪ newN odes
15 else
16 Marking nextN ode as a leaf node labeled with majority
class
17 return(SSL − C4.5 − T ree)
Algorithm 1: Semi-Supervised Learning C4.5 Algorithm (SSL-C4.5)

Based on the results of a large set of tests, we decided to use the entropy
standardized measure of impurities for the supervised learning part of the model.
The entropy measure has been tested by several investigations [8] due to its
robustness, simplicity, and flexibility. In addition, this measure is compatible
with the base supervised implementation, the J48 algorithm. The calculation of
the normalized entropy is represented by the Eq. 6, where El represents the set
of ordered data that is analyzed in a given tree node, El f ull represents the set
of all the ordered data used in the construction of trees, and S is the heuristic
metric (h) of the entropy of impurities.

S(El , Y )
Impurityl (El , Y ) = (6)
S(El f ull , Y )

Similarly, we chose to use the standardized Gini impurity metric and the
standardized variance impurity for the unsupervised learning part of the model.
520 A. A. Ortiz-Dı́az et al.

This metric has been tested by several investigations [15] due to its easy under-
standing and implementation, as well as the good results offered. The standard-
ized calculation of impurities without supervision is represented by the Eqs. 7
and 8, where E represents the set of all the data analyzed in a given tree node
(whether labeled or not), E f ull represents the set of all the data used to build
the tree (whether labeled or not), G is the Gini heuristic metric of impurities, V
is the heuristic metric of impurity variance and Xi is the attribute that is being
evaluated.
G(E,Xi )
Impurityu (E, Si ) = G(E f ull ,Xi )
, if Xi is nominal (7)

V (E,Xi )
Impurityu (E, Si ) = V (E f ull ,Xi )
, if Xi is numerical (8)
These two impurity measures are incorporated into the Eq. 1 of semi-
supervised learning proposed by Levatić et al. [5]. It creates a single measure
of impurities, which was incorporated into the J48 algorithm replacing its origi-
nal metric.
Finally, the Best splitting criterion function (line 13) receives the last node
analyzed (nextN ode), the one decided to be divided. The objective of this
function is to divide this node (nextN ode) using the attribute that guaran-
tees more efficiency. This attribute has already been selected before (line 11).
This function returns a list of new nodes (newN odes), which are added to the
Analysis pending nodes list list.

5 Results Assessment
In this section, we evaluate the predictive performance of the SSL-C4.5 algorithm
using different datasets. The datasets were mainly obtained from the UCI repos-
itory [7]. We compared our proposed algorithm (SSL-C4.5) with the SSL-PCT
algorithm proposed by Levatić et al. [5], with two variations of both algorithms
respectively, and with the implementation in WEKA of C4.5 algorithm (which
is named J48). All the experiments were performed over WEKA [16] framework
for data mining. WEKA provides a collection of evaluation tools and a great
variety of known algorithms.

5.1 Methodology
We used eight datasets with different features for the experiments to ensure high
diversity in the experimentation scenarios. Table 1 shows the features regard-
ing the selected datasets, which are: number of instances (Instances); number
of discrete attributes (Nominal); number of continuous attributes (Numeric);
and number of classes (Classes). The algorithms and variants of the algorithms
involved in the experiments are described below.

– C4.5 Algorithm: Supervised algorithm C4.5 [10] implemented in WEKA


named as J48.
A Classification Algorithm for Semi-supervised Learning Based on C4.5 521

– SSL-C4.5-W1.0 Algorithm: Semi-supervised implementation of SSL-C4.5


over J48 with parameter ω = 1. Equivalent to the C4.5 algorithm. Tested to
ensure that the semi-supervised version can be equivalent to the supervised
version.
– SSL-C4.5-WA: Semi-supervised implementation of SSL-C4.5 over J48 with
automatic adaptation of the ω parameter.
– SSL-PCT-W1.0: Semi-supervised PCT algorithm with parameter ω = 1.
Equivalent to the PCT algorithm [5].
– SSL-PCT-WA: Semi-supervised PCT algorithm with automatic adaptation
of the ω parameter.

Table 1. General characteristics of the used dataset

Dataset Instances Nominal Numeric Classes Missing values


Abalone 4177 0 8 3 No
Adult 32561 9 5 2 Yes
Bank 45211 10 6 2 Yes
Banknote 1372 0 4 2 No
Biodegradation 1055 0 40 2 No
Eyestate 14980 0 14 2 No
Madelon 2000 0 500 2 No
Mushroom 8124 22 0 2 No

During the experiments, the algorithms were set with their default configu-
ration options. The “CollapseTree” and “SubtreeRaising” options were disabled
in the C4.5 algorithm (J48 implementation) since these features were not imple-
mented in the SSL-C4.5 algorithm. The confidence factor was set as 0.25 in the
C4.5 and SSL-C4.5 algorithms, with a minimum information gain of 0.
All algorithms were evaluated using the 10-fold cross-validation technique.
For each experiment, the part of the dataset selected to train the models was
fully used. That is, the supervised models were trained with all the training
data labeled. However, to assess the learning capability of the semi-supervised
algorithms, the class label of part of the instances from the training set was
removed, while the rest of the attributes remained intact. Therefore, for the
semi-supervised learning, only a certain number of instances were labeled. On the
other hand, during the test phase, the class of all instances (including instances
whose classes had been removed) is compared with the class predicted by the
model, in order to check if the prediction was correct.
For each dataset described in Table 1, the test datasets were created by vary-
ing the amounts of labeled instances. That is, each original dataset generated
six derived datasets with exactly the same instances but leaving only a small
amount of these instances with their labels. The following quantities of labeled
522 A. A. Ortiz-Dı́az et al.

instances were used by default: 25, 50, 100, 200, 350 and 500. The remains of the
instances were left unlabeled for training each scenario. These datasets, which
were kept with the class labels, are randomly selected. The experiments were
repeated ten times to leverage the results reliability.
During the tests of the semi-supervised algorithms, it was necessary to define
the best value of ω in the automatic adaptation tests. The value of ω was estab-
lished by performing tests (training/test), using the same test datasets, ranging
it from 0.0 (fully unsupervised) to 1.0 (fully supervised) in steps of 0.1. There-
fore, 11 values were tested and the value that generates the best model was
selected. This can guarantee that the method will never generate a worse model
than its supervised version (ω = 1.0).

5.2 Analysis of the Results


All algorithms were evaluated taking into account two fundamental criteria,
accuracy and runtime. Figure 1 show the accuracy values of the execution of
all the algorithms on each dataset. Analyzing the eight figures, it is possible
to observe the behavior of all the algorithms studied during the experiments,
including the new proposal.
With a quick analysis, we can perceive that the C4.5 algorithm and the SSL-
C4.5-W1.0 algorithm exhibit identical behaviors. In this way, we can confirm that
the semi-supervised version implemented (our proposal) is an extension of the
original C4.5 algorithm (J48 implementation), whose additional functionality
can become inert when ω = 1.0. In addition, we can see that the algorithm
versions with the automatic adjustment parameter ω, (SSL-C4.5-WA and SSL-
PCT-WA) never have results with accuracies lower than their supervised versions
(SSL-C4.5-W1.0 and SSL-PCT-W1.0). In other words, semi-supervised models
always even or improve supervised models.
By comparing the accuracy values of the C4.5 and PCT algorithms (SSL-
PCT-W1.0), it is not possible to affirm that one algorithm is statistically superior
to the other. We can observe that for some datasets such as Abalone, Banknote,
Eyestatem, and Mushroom, the accuracy values show alternating behavior. Nev-
ertheless, in other datasets, such as Adult, Bank and Biodegradation datasets,
the C4.5 algorithm has superior results. According to our empirical analysis, the
C4.5 algorithm is capable of producing more promising results.
On the other hand, we can also observe that for all the algorithms there
were improvements in the accuracy values when the semi-supervised component
was added to the metrics. That is, when we compare the two variants of the
same algorithm, in general, the semi-supervised variant obtained better results
(SSL-C4.5-WA compared to SSL-C4.5-W1.0, and SSL-PCT-WA compared to
SSL-PCT-W1.0).
Finally, we could observe that between the two algorithms, in their semi-
supervised variants (SSL-C4.5-WA compared to SSL-PCT-WA), the new pro-
posal obtained more promising results, although a statistically significant differ-
ence could not be established. When evaluating some datasets, such as Banknote,
Biodegradation, and Mushroom, it is not possible to notice a marked difference in
A Classification Algorithm for Semi-supervised Learning Based on C4.5 523

the accuracy values. However, for other datasets, such as Abalone, Adult, Bank,
and Eyestatem, the new proposal has a significant improvement in accuracy val-
ues. It is also valid to highlight that when evaluating the Madelon dataset, the
opposite result occurred.

(a) (b) (c)

(d) (e) (f)

(g) (h)

Fig. 1. Results of the accuracy of the algorithms on the eight datasets. (a) Abalone
dataset, (b) Adult dataset, (c) Bank dataset, (d) Banknote dataset, (e) Biodegrada-
tion dataset, (f ) Eyestate dataset, (g) Madelon dataset, (h) Mushroom dataset.

In general, it can be said that the SSL-C4.5-WA algorithm, offers promising


results compared to the other algorithms or variants. In general, the accuracy
values of the new model are higher in most experiments.
Although runtime values have not been included in this paper, it is necessary
to highlight the following fact. In terms of runtime, semi-supervised algorithms
get higher numerical values than supervised algorithms when there is a significant
524 A. A. Ortiz-Dı́az et al.

amount of unclassified data. This is mainly because supervised algorithms ignore


these instances, while semi-supervised algorithms still need to process this data.

6 Conclusion
In the present work, the main characteristics of the algorithm called SSL-C4.5
were described. This new proposal is an adaptation of the well-known C4.5
algorithm to the semi-supervised learning paradigm. SSL-C4.5 algorithm is able
to work in scenarios where few instances are labeled due to also it is able to
extract significant information from the attributes that are not classes.
Through an experimental study, we managed to test its operation in different
data scenarios compared to other known algorithms. As a result, it was shown
that the use of semi-supervised metrics in classification algorithms can contribute
to the induction of more reliable models when datasets have few instances with
labels. Although no statistically significant differences were found compared to
the rest of the algorithms, the SSL-C4.5 algorithm achieved promising empirical
results in terms of accuracy values. The new proposal obtained better results in
most of the bases.
The authors propose as future work to carry out a more complete study of the
algorithm to achieve possible structural improvements. In addition, we propose
to perform tests with more diverse databases to try to make the experimental
study more robust.

Acknowledgment. This work was partially funded by the Coordination of Improve-


ment of Higher Level Personnel - CAPES, and the Foundation of Support for Research
and Innovation of Santa Catarina State - FAPESC.

References
1. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning, vol. 2. MIT Press,
Cambridge (2006)
2. Farid, D., et al.: An adaptive ensemble classifier for mining concept drifting data
streams. Expert Syst. Appl. 15, 5895–5906 (2013)
3. Haque, A., Khan, L., Baron, M.: Sand: semi-supervised adaptive novel class detec-
tion and classification over data stream. In: AAAI, pp. 1652–1658 (2016)
4. Krawczyk, B., Minku, L., Gama, J., Stefanowski, J., Woniak, M.: Ensemble learn-
ing for data stream analysis Inf. Fusion 37, 132–156 (2017)
5. Levatic, J., Ceci, M., Kocev, D., Dzeroski, S.: Semi-supervised classification trees.
J. Intell. Inf. Syst. 49, 461–486 (2017)
6. Li, P., Wu, X., Hu, X.: Mining recurring concept drifts with limited labeled stream-
ing data. In: Sugiyama, Proceedings of 2nd Asian Conference on Machine Learning,
vol. 13, pp. 241–252 (2016)
7. Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.
edu/ml
8. Ortiz-Dı́az, A., Baldo, F., Palomino-Mariño, L., Bayer, F., Verdecia-Cabrera, A.,
Frı́as-Blanco, I.: Fast adaptive stacking of ensembles adaptation for supporting
active learning. A real case application. In: 14th International Conference on Nat-
ural Computation, Fuzzy Systems and Knowledge Discovery. ICNC-FSKD (2018)
A Classification Algorithm for Semi-supervised Learning Based on C4.5 525

9. Quinlan, R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)


10. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
San Francisco (1993)
11. Stamatis, K., Nikos, F., Sotiris, K., Kyriakos, S.: A semisupervised cascade classifi-
cation algorithm. Appl. Comput. Intell. Soft Comput. 2016, 14, Article ID 5919717
(2016)
12. Tanha, J., van Someren, M., Afsarmanesh, H.: Semi-supervised self-training for
decision tree classifiers. Int. J. Mach. Learn. Cybern. 8(1), 355–370 (2015). https://
doi.org/10.1007/s13042-015-0328-7
13. Triguero, I., Garcı́a, S., Herrera, F.: Self-labeled techniques for semi-supervised
learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 42, 245–284
(2013). https://doi.org/10.1007/s10115-013-0706-y
14. Wagh, S., Khati, A., Irani, A., Inamdar, N., Soni, R.: Effective framework of j48
algorithm using semi-supervised approach for intrusion detection. Int. J. Comput.
Appl. 94(12), 23–27 (2014)
15. Wei, Z., Jia, K., Sun, Z.: An automatic detection method for morse signal based on
machine learning. In: IIH-MSP-2017: Advances in Intelligent Information Hiding
and Multimedia Signal Processing, pp. 185–191 (2017)
16. Witten, I., Frank, E., Trigg, L., Cunningham, M.H.G.H.S.: Weka: practical machine
learning tools and techniques with Java implementations. In: Proceedings of the
ICONIP/ANZIIS/ANNES 1999 Workshop on Emerging Knowledge Engineering
and Connectionist-Based Information Systems, pp. 192–196 (1999)

You might also like