Springer - Linguistic Decision Trees For Classification-2014
Springer - Linguistic Decision Trees For Classification-2014
4.1 Introduction
Tree induction learning models have received a great deal of attention over recent
years in the fields of machine learning and data mining because of their simplicity
and effectiveness. Among them, the Iterative Dichotomiser 3 (ID3) [1] algorithm
for decision trees induction has proved to be an effective and popular algorithm
for building decision trees from discrete valued data sets. The C4.5 [2] algorithm
was proposed as a successor to ID3 in which an entropy based approach to crisp
partitioning of continuous universes was adopted.
Decision tree induction is one of the simplest and yet most successful learning
algorithms. A decision tree (DT) consists of internal and external nodes and the
interconnections between nodes are called branches of the tree. An internal node
is a decision-making unit to decide which child nodes to visit next depending on
different possible values of associated variables. In contrast, an external node, also
known as leaf node, is the terminated node of a branch. It has no child nodes and is
associated with a class label that describes the given data. A decision tree is a set of
rules in a tree structure. Each branch can be interpreted as a decision rule associated
with nodes visited along this branch. For example, Fig. 4.2 is a decision tree which
is generated from the “play-tennis” problem [3] . The database for this problem is
shown in Fig. 4.1 .
Fig. 4.1 Database for the “play-tennis” problem [3] . Each instance is with 4 attributes and
one class label of either Yes or No, all attributes are with discrete values
Decision trees classify instances by sorting instances down the tree from root to
leaf nodes. This tree-structured classifier partitions the input space of the data set
recursively into mutually exclusive spaces. Following this structure, each training
data is identified as belonging to a certain subspace, which is assigned a label, a
value, or an action to characterize its data points. The decision tree mechanism
has good transparency in that we can follow a tree structure easily in order to
explain how a decision is made. Thus interpretability is enhanced when we clarify
the conditional rules characterizing the tree.
4.2 Tree Induction 79
Outlook
No Yes No Yes
Fig. 4.2 A decision tree built from the play-tennis problem [3]
4.2.1 Entropy
Entropy of a random variable is the average amount of information generated
by observing its value. Consider the random experiment of tossing a coin with
probability of heads equal to 0.9, i.e., a random variable x with
1
E(x) = ∑ P(x)H(x) = ∑ P(x) log = − ∑ log P(x)P(x) (4.4)
x x P(x) x
Because we basically use binary code for computing and communications. For
convenience, we choose the base 2. Therefore, the entropy can be formally defined
by
Entropy(x) = − ∑ pi log2 pi (4.5)
i
We have introduced DT and how DTs make decisions. The hardest problem
is how to build a DT based on training data? The most popular decision tree
induction algorithm is called ID3 and was introduced by Quinlan in 1986 [1] . It has
proved to be an effective and popular algorithm for building decision trees from
discrete valued data sets. The decision tree is guided heuristically according to the
information content of each attribute. In classification problems, we can also say
that entropy is a measurement of the impurity in a collection of training examples:
the larger the entropy is, the more random is the data. If the entropy equals 1, this
means that the data is distributed uniformly across the classes. Fig. 4.3 shows the
entropy function when the proportion of positive examples varies from 0 to 1 in a
binary classification problem.
0.9
0.8
0.7
Entropy
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Proportion
|Sv |
IG(S, A) = Entropy(S) − ∑ |S|
Entropy(Sv ) (4.6)
v∈Values(A)
Also, it is important to realize that the ID3 algorithm is not suitable for all
learning and classification problems. Typically, problems that have the following
main characteristics can be modeled as a decision tree.
(1) Instances are represented by attribute-value pairs.
(2) The target function has discrete values.
In dealing with the first problem, techniques of “learning from structured data”
have been developed as a part of Inductive Logic Programming (ILP). For the
second problem, much real-world data, including scientific and engineering data,
medical data and financial data, are continuous. In order to learn from continuous
data, we need to partition the continuous universe using some type of discretization
algorithms, as will be discussed in the following chapters.
82 4 Linguistic Decision Trees for Classification
x Y
A3
<13 >=13
45 A1
LF1 y
A2
<=45 >45
LF2 LF3 13 X
(a)
Y
x
x1 x2 y2 A3
A1 A1
LF1 y
y1 y2 y1
A2
LF2 LF3 X
(b) x1 x2
Fig. 4.4 Comparisons of crisp discretization and fuzzy discretization for decision tree
models. The decision tree in (a) has crisp boundaries that divide the data into 3 non-
overlapping areas A1 , A2 and A3 . In (b), x1 , x2 , y1 and y2 are defined by fuzzy functions
that embed more robustness with blurred boundaries
Table 4.1 Important notations for the linguistic decision tree model
DB a database with the size of |DB|: {x1 , . . . , x|DB| }
xi a n-dimensional instance that: xi ∈ DB for i = 1, . . . , |DB|
Lj a set of linguistic labels defined on the attribute j : j = 1, . . . , n
Fj the focal set on attribute j given L j for j = 1, . . . , n
|F j | = 2|L j | − 1 if Lk ∈ L j (k = 1, . . . , |L j |) are with 50% overlapping
C a set of classes with the size of |C|: {C1 , . . . ,C|C| }
T a linguistic decision tree that contains |T | branches: {B1 , . . . , B|T | }
B a set of branches: B = {B1 , . . . , BM }, T ≡ B iif: M = |T |
B a branch of LDT, it has |B| focal elements: B = {F1 , . . . , F|B| }
focal elements Fi , Fj ∈ B are defined on different attributes
84 4 Linguistic Decision Trees for Classification
distributed fuzzy sets with 50% overlap to discretize each continuous attribute
universe and obtain a corresponding linguistic data set by applying linguistic
translation (Definition 3.7). A linguistic decision tree is a decision tree where the
nodes are the random set label descriptions and the branches correspond to particular
focal elements based on DB.
Definition 4.1 (Linguistic decision tree) A linguistic decision tree is a set of
branches with associated class probabilities in the following form:
Dx 1
{ small1} {large1}
{small1 , large1}
Dx 2 LF4 Dx 2
{ large2 } {small2} { large2 }
{small2} (0.1, 0.9)
{small2 ,large2} {small2 ,large2}
(0.3, 0.7) (0.5, 0.5) (0.6, 0.4) (0.6, 0.4) (0.7, 0.3) (0.2, 0.8)
Fig. 4.5 An example of a linguistic decision tree in a binary classification problem, where
each attribute is discretized by two fuzzy labels: small and large. The tree has 7 branches and
each leaf from LF1 to LF7 is labeled with a class distribution
Definition 4.2 (Free attributes) The set of attributes free to use for expanding a
given branch B is defined by
AT TB = {x j |∀F ∈ F j ; F ∈
/ B}
4.3 Linguistic Decision for Classification 85
In an LDT, the length of the longest branch Dep(T ) is called the depth of the LDT,
which is also less than or equal to n:
Dep(T ) ≤ n (4.7)
Each branch has an associated probability distribution on the classes. For example,
an LDT shown in Fig. 4.5 might be obtained from training where the branch LF6 :
means the probability of class C1 is 0.3 and C2 is 0.7 given attribute 1 that can be
only described as large and attribute 2 that can be described as small and large. We
need to be aware that the linguistic expressions such as small, medium or large for
each attribute are not necessarily the same, since they are defined independently on
each attribute. E.g. large2 means the fuzzy label large defined on attribute 2.
where mxr (Fr ) are the associated masses of data element xr for r = 1, . . . , |B|.
Basically, the above equation can be justified as follows:
where Dx1 = F1 , . . . , Dx|B| = F|B| are conditionally independent, so that we can obtain
|B|
P(Dx1 = F1 , . . . , Dx|B| = F|B| |x1 , . . . , xn ) = ∏ P(Dxr = Fr |xr ) (4.10)
r=1
|B|
= ∏ mxr (Fr ) (4.11)
r=1
Based on Eq. (4.8), the probability of class Ct given B can then be evaluated by
∑i∈DBt P(B|xi )
P(Ct |B) = (4.12)
∑i∈DB P(B|xi )
where DBt is the subset consisting of instances which belong to class Ct .
86 4 Linguistic Decision Trees for Classification
In the case where the dominator equals zero (i.e., ∑i∈DB P(B|xi ) = 0), which can
occur when the training database for the LDT is small so that there is no non-zero
linguistic data covered by the branch. In this case, we obtain no information from
the database so that equal probabilities are assigned to each class.
1
P(Ct |B) = for t = 1, . . . , |C| (4.13)
|C|
In the process of building a linguistic decision tree, if one of the class probabilities
reaches a certain threshold at a particular depth, for example 0.9, then we might take
the view that this branch is sufficiently discriminating and that further expansion
may lead to overfitting. In this case terminating the tree expansion at the current
depth will probably help maintain accuracy on the test set. To this end, we employ
a threshold probability to determine whether or not a particular branch should
terminate.
the 4th instance has a missing value in Attribute 1. Instead of using some ad-hoc pre-
processing techniques, we simply assign equal probabilities to the focal elements
of this missing value.
Table 4.2 An example of a small-scale artificial dataset after linguistic translation. Each
attribute has 2 independently defined fuzzy labels: small (s) and large (l)
These two branches are evaluated according to Eqs. (4.8) and (4.12) (or Eq. (4.13)):
∑i=1,4,5 P(B1 |xi ) ∑i=1,4,5 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
P(+, B1 ) = =
∑5i=1 P(B1 |xi ) ∑5i=1 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
0 × 0 + 0.333 × 0 + 0 × 0.3
= =0
0 × 0 + 0.2 × 0.5 + 0 × 1 + 0.333 × 0 + 0 × 0.3
Some pre-processing techniques treat the missing values as a new value of “missing” for
nominal attributes [13] .
88 4 Linguistic Decision Trees for Classification
∑i=2,3 P(B1 |xi ) ∑i=2,3 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
P(−, B1 ) = = 5
∑5i=1 P(B1 |xi ) ∑i=1 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
0.2 × 0.5 + 0 × 1 0.1
= = =1
0 × 0 + 0.2 × 0.5 + 0 × 1 + 0.333 × 0 + 0 × 0.3 0.1
∑i=1,4,5 P(B2 |xi ) ∑i=1,4,5 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
P(+, B2 ) = =
∑5i=1 P(B2 |xi ) ∑5i=1 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
0.4 × 0.7 + 0.333 × 1 + 1 × 0.7
= = 0.767
0.4 × 0.7 + 0.8 × 0.5 + 0.9 × 0 + 0.333 × 1 + 1 × 0.7
∑i=2,3 P(B2 |xi ) ∑i=2,3 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
P(−, B2 ) = = 5
∑5i=1 P(B2 |xi ) ∑i=1 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
0.8 × 0.5 + 0.9 × 0
= = 0.233
0.4 × 0.7 + 0.8 × 0.5 + 0.9 × 0 + 0.333 × 1 + 1 × 0.7
is used for classifying a new data element, where P(b) and P(¬b) are considered
as the beliefs in b and not b, respectively [11] . This can be generalized when given a
new condition c :
where P(Bv |x) and P(Ct |Bv ) are evaluated according to Eqs. (4.8) and (4.12) (or Eq.
(4.13)), respectively. In classical decision trees, classification is made according to
the class label of the branch in which the data falls. In our approach, the data for
classification partially satisfies the constraints represented by a number of branches
and the probability estimates across the whole decision tree are then used to obtain
an overall classification.
Example 4.2 Suppose we are given the linguistic decision tree shown in Fig. 4.5
for a two-class problem with F1 = {{small1 }, {small1 , large1 }, {large1 }},
F2 = {{small2 }, {small2 , large2 }, {large2 }}. A data element y = y1 , y2 for
classification is given such that μsmall1 (y1 ) = 1, μlarge1 (y1 ) = 0.4 and μsmall2 (y2 ) =
0.2, μlarge2 (y2 ) = 1.
LDT = {B1 , B2 , B3 , B4 , B5 , B6 , B7 } = {
{small1 }, {small2 }, 0.3, 0.7,
{small1 }, {small2 , large2 }, 0.5, 0.5,
{small1 }, {large2 }, 0.6, 0.4,
{small1 , large1 }, 0.1, 0.9,
{large1 }, {small2 }, 0.6, 0.4
{large1 }, {small2 , large2 }, 0.7, 0.3
{large1 }, {large2 }, 0.2, 0.8 }
P(B2 |y) = my1 ({small1 }) × my2 ({small2 , large2 }) = 0.6 × 0.2 = 0.12
P(B3 |y) = my1 ({small1 }) × my2 ({large2 }) = 0.6 × 0.8 = 0.48
P(B4 |y) = my1 ({small1 , large1 }) = 0.4
Hence, based on Jeffrey’s rule (Eq.(4.16)), we can obtain
7
P(C1 |y) = ∑ P(C1 |Bv )P(Bv |y) = ∑ P(C1 |Bv )P(Bv |y)
v=1 v=2,3,4
= 0.12 × 0.5 + 0.48 × 0.6 + 0.4 × 0.1 = 0.388
90 4 Linguistic Decision Trees for Classification
7
P(C2 |y) = ∑ P(C2 |Bv )P(Bv |y) = ∑ P(C2 |Bv )P(Bv |y)
v=1 v=2,3,4
= 0.12 × 0.5 + 0.48 × 0.4 + 0.4 × 0.9 = 0.612
Usually, the decision threshold for a probabilistic classifier is 0.5 without assuming
any other prior information. Therefore, in this example, y is classified as C2
because P(C2 |y) > 0.5. However, in cost-sensitive learning, the decision threshold
is not necessarily 0.5 when considering the misclassification cost and prior class
distribution.
From the above examples we have known how to calculate the class probabilities
and how to use them in classification. In the next section, we will introduce the
algorithm for building a linguistic decision tree.
|C|
E(B) = − ∑ P(Ct |B) log2 P(Ct |B) (4.17)
t=1
Now, given a particular branch B, suppose we want to expand it with the attribute x j .
The evaluation of this attribute will be given based on the expected entropy defined
as follows:
Definition 4.5 (Expected entropy)
where B ∪ Fj represents the new branch obtained by appending the focal element Fj
to the end of branch B. The probability of Fj given B can be calculated as follows:
(1) Linguistic Translation: to translate real valued data into linguistic form data (see
Definition 3.7). We first discretize the continuous universe of each attribute with
fuzzy sets uniformly or non-uniformly (see Section 3.3). For each data element,
find appropriateness degrees which will be used in subsequent calculations of
mass assignments. This is because a new database is saved for use in the second
step.
92 4 Linguistic Decision Trees for Classification
(2) Decision Tree Learning: A linguistic decision tree will be developed level by
level according to the information heuristics. At each level, we will examine
each branch for calculating class probabilities and compare it with the given
threshold probability to determine if it should be terminated or not, until the
maximum depth has been reached or all branches are terminated. The pseudo-
code of the tree learning process is shown in Algorithm 3.
(3) Classification: Given an LDT, we classify a data element according to class
probabilities, given branches of the tree according to Eq. (4.16).
We evaluated the LID3 algorithm by using 14 datasets taken from the UCI Machine
Learning repository [9] . These datasets have representative properties of real-world
data, such as missing values, multi-classes, mixed-type data (numerical, nominal)
and unbalanced class distributions, etc. Table 4.3 shows the dataset, the number
of classes, the number of instances, the number of numerical (Num.) and nominal
(Nom.) attributes and whether or not the database contains missing values.
Table 4.3 Descriptions of datasets from UCI machine learning repository. Other details
about these data sets are available in Reference [9]
sub-datasets, one half for training and the other half for testing. This is referred to
as a 50-50 split experiment. The maximal depth is set manually and the results show
the best performance of LID3 across a range of depth settings. We also test the LID3
algorithm with different threshold probabilities T ranging from 0.6 to 1.0 in steps
of 0.1 and for the different fuzzy discretization methods: uniform, entropy-based
and percentile-based (see Section 3.3). For each dataset, we ran 50-50 random split
experiment 10 times. The average test accuracy with standard deviation is shown
on the right-hand side of Table 4.5 and the probability and the depth at which we
obtain this accuracy are listed in Table 4.4 .
Table 4.4 Summary of the threshold probabilities and depths for obtaining the best accuracy
with different discretization methods in the given datasets
Table 4.5 Accuracy (with standard deviation) of LID3 based on different discretization
methods and three other well-known machine learning algorithms
(Probability Estimation Trees) [12] also suggests that the full expanded estimation
trees give better performance than pruned trees .
The reason for this is that the heuristics used to generate small and compact
trees by pruning tend to reduce the quality of the probability estimates [12] . In this
context linguistic decision trees can be thought of as a type of probability estimation
tree but where the branches correspond to linguistic descriptions of objects. Strictly
speaking, our linguistic decision tree model is a probability estimation tree though
we employ the name of “linguistic decision tree”. The key difference between these
two types of trees is that PETs give probability estimation according to which we
can rank the examples given a target class [12] . We may have different classification
results based on a different given threshold which are related to class distribution or
cost matrix in cost-sensitive problems, while the decision trees only give the crisp
classification of examples. The difference in accuracy resulting from varying the
threshold probability T is quite data dependent. Figs. 4.6 to 4.8 show the results of
the datasets given in Table 4.3 . We will consider the results for 4 typical datasets:
Breast-w, Heart-statlog, Glass and Breast-Cancer. In Breast-w, the accuracy curves
are nested relative to increasing values of T . The models with high T values
outperform those with lower T values at all depths. Dataset Iris, Balance, Sonar,
In classical decision tree learning such as ID3 and C4.5, pruning can reduce overfitting
so that the pruned trees have better generalization and perform better then unpruned trees.
However, this is not the case for probability estimation trees [12] .
4.4 Experimental Studies 95
Vine, Ecoli also behave in this way. On the other hand, for datasets Heart-Statlog,
Pima, Liver, Heart-C and Heptitis, the accuracy curve of T = 0.9 is better than all
other T values at certain depths. In addition, datasets Glass and Ecoli have accuracy
curves which are very close to each other and are even identical on some trials. For
the Breast-Cancer dataset the accuracy actually decreases with increasing T . All of
the datasets we tested have almost the same trends.
Accuracy
Accuracy
0.71
0.75 0.70
0.7 0.69
0.68 T=0.6
T=0.7
0.65 0.67 T=0.8
0.66 T=0.9
T=1.0
0.6 0.65
1 2 3 4 1 2 3 4 5
Depth Depth
Fig. 4.6 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set # 1 and # 2 in Table 4.3
PT = a/2m (4.21)
Suppose there is a 50% overlapping between two neighboring fuzzy labels, a equals
m so that PT = 0.5. If there is no overlapping at all, a = 0 so that PT = 0. Fig. 4.10
shows an example of fuzzy discretiztion with different PT values on a continuous
universe.
We tested 10 datasets taken from UCI [9] repository and the average results with
standard deviation based on 10 runs of 50-50 split experiments are shown in Fig.
4.11 and 4.12 . As we can see from these figures, accuracy generally increase with
96 4 Linguistic Decision Trees for Classification
Breast-w with uniform-discretization Ecoli with uniform-discretization
0.97 0.9
T=0.6 T=0.6
0.96 T=0.7 0.85 T=0.7
T=0.8 T=0.8
T=0.9 0.8 T=0.9
0.95 T=1.0 T=1.0
0.75
Accuracy
Accuracy
0.94 0.7
0.93 0.65
0.92 0.6
0.55
0.91
0.5
0.9 0.45
0.89 0.4
1 2 3 4 5 1 2 3 4 5 6 7
Depth Depth
0.65 T=1.0
Accuracy
T=1.0
0.76
0.6 0.755
0.55 0.75
0.5 0.745
0.74
0.45
0.735
0.4 0.73
1 2 3 4 5 6 7 8 9 1 2 3 4 5
Depth Depth
T=1.0 T=1.0
0.82
Accuracy
0.75 0.815
0.74 0.81
0.805
0.73
0.8
0.72 0.795
0.71 0.79
1 2 3 4 1 2 3 4
Depth Depth
Fig. 4.7 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set #3 to #8 in Table 4.3
4.4 Experimental Studies 97
0.86 0.955
Accuracy
0.85
0.95
0.84
0.83 0.945
0.82
0.94
0.81
0.8 1 0.935
2 3 4 5 1 2 3 4
Depth Depth
Accuracy
0.582 0.758
0.58 0.756
0.578 0.754
0.576 0.752
0.574 0.75
0.572 0.748
1 2 3 4 5 6 1 2 3 4 5
Depth Depth
Accuracy
0.8 0.85
0.75 0.8
0.7 0.75 T=0.6
T=0.7
T=0.8
0.65 0.7 T=0.9
T=1.0
0.6 0.65
1 2 3 4 5 6 7 8 1 2 3 4 5 6
Depth Depth
Fig. 4.8 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set from #9 to #14 in Table 4.3
98 4 Linguistic Decision Trees for Classification
m m
a a=0
F G F'' G''
0 m a
(b) PT = a/2m (d) Relation between a and PT
Fig. 4.9 A schematic illustration of calculating the overlap parameter PT given different
degrees of overlaps
the increases in the PT values. Only the data Ionosphere is an exception. In order
to simplifying the label semantics, it is not necessary to use fuzzy labels with more
than 50% overlapping. Hence, we will use the fuzzy labels with 50% overlapping
for discretization in the following experiments.
0.8
0.6
0.4
0.2
0
(a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.8
0.6
0.4
0.2
0
(b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.8
0.6
0.4
0.2
0
(c) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.8
0.6
0.4
0.2
0
(d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1
0.8
0.6
0.4
0.2
0
(e) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 4.10 Fuzzy discretization with focal elements with different overlapping degrees. The
degrees of overlapping are PT = 0.0, 0.3, 0.5, 0.8, 1.0 from subfigure (a) to (e)
100 4 Linguistic Decision Trees for Classification
Balance Ecoli
0.92 0.9
0.9
0.88
0.88
0.86
0.86
Accuracy
Accuracy
0.84
0.84
0.82
0.82
0.8 0.8
0.78 0.78
0.76 0.76
0.74 0.74
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Glass Heptitis
0.68 0.9
0.66 0.88
0.64 0.86
Accuracy
0.62 0.84
Accuracy
0.6 0.82
0.58 0.8
0.56 0.78
0.54 0.76
0.52 0.74
0.5 0.72
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Iris Liver
0.99 0.74
0.98 0.72
0.97 0.7
Accuracy
0.68
Accuracy
0.96
0.66
0.95
0.64
0.94
0.62
0.93 0.6
0.92 0.58
0.91 0.56
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Fig. 4.11 Average accuracy with standard deviation with different PT values
4.5 Comparison Studies 101
Pima Ionosphere
0.9
0.77
0.76 0.89
0.75 0.88
Accuracy
0.74
Accuracy
0.87
0.73
0.72 0.86
0.71 0.85
0.7 0.84
0.69
0.83
0.68
0.67 0.82
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Heart-C Breast-Cancer
0.76
0.82
0.74
0.8 0.72
Accuracy
Accuracy
0.78 0.7
0.68
0.76
0.66
0.74
0.64
0.72 0.62
0.7 0.6
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Breast-w Wine
0.975 1
0.97 0.98
0.965 0.96
Accuracy
0.94
Accuracy
0.95
0.955 0.92
0.95 0.9
0.945 0.88
0.94 0.86
0.935 0.84
0.93 0.82
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Fig. 4.12 Average accuracy with standard deviation with different PT values
102 4 Linguistic Decision Trees for Classification
but effective probability estimation method and neural networks are a black-box
model well known for its high predictive accuracy. We then carried out paired t-
tests with confidence level of 90% to compare LID3-Uniform, LID3-Entropy and
LID3-Percentile with each of the three algorithms [3] . A summary of the results is
shown in Table 4.6 .
Table 4.6 Summary of comparisons of LID3 based on different discretization methods with
three other well-known machine learning algorithms
Across the data sets, all LID3 algorithms (with different discretization
techniques) outperform C4.5, with LID3-Percentile achieving the best results with
10 wins, 4 ties and no losses. The performance of the Naive Bayes algorithm and
LID3-Uniform is roughly equivalent although LID3-Entropy and LID3-Percentile
outperform Naive Bayes. From Table 4.5 , we can see that the datasets on which
Naive Bayes outperforms LID3 are those with a mixture of continuous and discrete
attributes, namely Heart-C, Heart-Statlog and Heptitis. Most of the comparisons
with the Neural Network result in ties rather than wins or losses, especially
for LID3-Entropy and LID3-Percentile. Due to the limited number and type of
datasets we used for evaluation, we may not draw the strong conclusion that LID3
outperforms all the other 3 algorithms. However we can at least conclude that based
on our experiments, the LID3 outperforms C4.5 and has equivalent performance to
Naive Bayes and the Neural Networks. For the datasets with only numerical values,
LID3 outperforms both C4.5 and Naive Bayes. Between different discretization
methods, percentile-based and entropy-based approaches achieve better results than
uniform discretization.
In the previous section, we showed that LID3 performs at least as well as and often
better than, three well-known classification algorithms across a range of datasets.
However, even with only 2 fuzzy sets for discretization, the number of branches
increases exponentially with the depth of the tree. Unfortunately, the transparency
of the LDT decreases with the increasing number of branches. To help to maintain
transparency by generating more compact trees, a forward merging algorithm based
on the LDT model is introduced in this section and experimental results are given to
support the validity of our approach.
4.6 Merging of Branches 103
where C = {C1 , . . . ,C|C| } is the set of classes, then B1 and B2 can be merged into
one branch MB. A merged linguistic decision tree MT can be represented by a set
of merged branches MB, or formally
MT = {MB1 , . . . , MB|MT | }
MB = M j1 , . . . , M j|MB|
|MB| are the number of nodes for the branch MB where the node is defined as:
|M j |
M j = {Fj1 , . . . , Fj }
Each node is a set of focal elements such that Fji is adjacent to Fji+1 for i =
1, . . . , |M j | − 1. If |M| > 1, it is called compound node, which means it is a
compound of more than one focal elements because of merging (e.g., see Fig. 4.13
). The associate mass for M j is given by
|M j |
mx (M j ) = ∑ mx (Fji ) (4.23)
i=1
(0.3, 0.7) (0.5, 0.5) (0.6, 0.4) (0.6, 0.4) (0.7, 0.3) (0.55, 0.45)
A1
{small1} {big1}
{small1, big1}
A2 L3 A2
A1
{small1} {big1}
{small1,big1}
A2 L3 A2
Fig. 4.13 An illustration of forward branch merging process. Given the merging threshold
0.1, branches L2 and L3 are firstly merged into new merged branch L2 and branches L5 and
L6 are merged into a new branch L4 . There is one more step merging of the new L4 and L5
because these branches have close probabilities less than or equal to the threshold probability
1. Finally, the original LDT with 7 branches is merged into a new LDT with 4 branches only
4.6 Merging of Branches 105
Therefore, based on Eqs. (4.12), (4.13) and (4.23) we use the following formula to
calculate the class probabilities given a merged branch.
Different from the normal LDTs, each branch of the dual-branch LDT has the
summed masses of the neighboring focal elements.
Table 4.7 Comparisons of accuracy and the number of leaves (rules) |T | with different
merging thresholds Tm across a set of UCI datasets [9] . The results for Tm = 0 are obtained
with NF = 2 and results for other Tm values are obtained with NF values listed in the second
column of the table
The number of fuzzy sets NF in the merging algorithm is also a key parameter.
Compared to NF = 3, setting NF = 2 can achieve better transparency, but for some
datasets, with NF = 2, the accuracy is greatly reduced although the resulting trees
have significantly fewer branches. For example, Figs. 4.14 and 4.15 show the
change in test accuracy and the number of leaves (or the number of rules interpreted
from a LDT) for different Tm on the Breast-w dataset. Fig. 4.14 is with NF = 2 and
Fig. 4.15 with NF = 3. Fig. 4.14 shows that the accuracy is not greatly influenced
by merging, but the number of branches is greatly reduced. This is especially true
for the curve marked by ‘+’ corresponding to Tm = 0.3 where applying forward
merging, the best accuracy (at the depth 4) is only reduced by approximately 1%,
whereas, the number of branches is reduced by roughly 84%. However, in Fig. 4.15
, at the depth 4 with Tm = 0.3, the accuracy also reduces about 1% but the number of
branches only reduces by 55%. So, for this dataset, we should choose NF = 2 rather
than NF = 3.
However, this is not always the case. For the dataset Iris, the change in accuracy
and the number of branches against depth with NF = 2 and NF = 3 is shown in
Figs. 4.16 and 4.17 , respectively. As we can see from Fig. 4.16 , by applying the
forward merging algorithm, the accuracy is greatly changed. The best accuracy with
4.6 Merging of Branches 107
Breast-w with merging
0.97
Tm=0
Tm=0.1
0.96 Tm=0.2
Tm=0.3
Tm=0.4
0.95
0.94
Accuracy
140
0.93 120
Number of leaves
100
0.92 80
60
0.91 40
20
0.9 0
1 2 3 4 5
Depth
0.89
1 2 3 4 5
Depth
Fig. 4.14 The change in accuracy and number of leaves as Tm varies in the Breast-w dataset
with NF = 2
0.94
Accuracy
0.93 80
Number of leaves
60
0.92
40
0.91
20
0.9
0
1 2 Depth 3 4
0.89
1 2 3 4
Depth
Fig. 4.15 The change in accuracy and number of leaves as Tm varies in the Breast-w dataset
with NF = 3. While the dot trial Tm = 0 is with NF = 2
108 4 Linguistic Decision Trees for Classification
Iris with merging
0.95
0.9
0.85
0.8
Accuracy
0.75 50
Number of leaves
0.7 40
30
0.65
20
0.6 Tm=0
Tm=0.1 10
Tm=0.2
0.55 Tm=0.3 01 2 3 4
Tm=0.4 Depth
0.5 4
1 2 3
Depth
Fig. 4.16 The change in accuracy and number of leaves as Tm varies in the Iris dataset with
NF = 2
0.98
0.96
0.94
Accuracy
0.92
50
0.9
40
Number of leaves
0.88
30
0.86
20
0.84 Tm=0
Tm=0.1 10
0.82 Tm=0.2
Tm=0.3 01 2 3 4
Tm=0.4 Depth
0.8
1 2 3 4
Depth
Fig. 4.17 The change in accuracy and number of leaves as Tm varies in the Iris dataset with
NF = 3. While the dot trial Tm = 0 is with NF = 2
4.6 Merging of Branches 109
merging is roughly 10% worse than with the non-merging algorithm. But for NF = 3,
as we can see from Fig. 4.17 , the accuracy is not that greatly reduced compared to
NF = 2, and we still obtain a reduced number of branches, compared to the accuracy
for Tm = 0 obtained from NF = 2. In this case we should prefer NF = 3.
Table 4.7 shows the results with optimal NF and different Tm ranging from 0
to 0.4, where Tm = 0 represents no merging. Acc represents the average accuracy
from 10 runs of experiments and |T | is the average number of rules (leaves).
Unless otherwise stated, the results obtained in this section are with the threshold
probability set to T = 1. The results for Tm from 0.1 to 0.4 are obtained at
the depth where the optimal accuracy is found when Tm = 0. As we can see
from the table, for most cases, the accuracy before and after merging is not
significantly different but the number of branches is dramatically reduced. In some
cases, the merging algorithm even outperforms the LID3 without merging. The
possible reason for this is because the merging algorithm generates self-adapting
granularities based on class probabilities. Compared to other methods that discretize
attributes independently, merging may generate a more reasonable tree with more
appropriate information granules.
Table 4.8 Mean AUC values with standard deviation of six data sets with different merging
thresholds
100
Tm=0
Tm=0.1
95 Tm=0.2
Tm=0.3
Average AUC Values
90
85
80
75
70
1 2 3 4 5 6
Datasets
Fig. 4.18 Comparison between non-merged trees and merged trees with Tm ranging from
0.1 to 0.3 on the given test data in terms of accuracy
4.7 Linguistic Reasoning 111
45
Tm=0
40 Tm=0.1
Tm=0.2
Tm=0.3
35
Average number of branches
30
25
20
15
10
0
1 2 3 4 5 6
Datasets
Fig. 4.19 Comparison between non-merged trees and merged trees with Tm ranging from
0.1 to 0.3 on the given test data in terms of the number of branches
include s and m and exclude all other labels that occur in focal sets that are supersets
of {s, m}. Given a set of focal sets {{s, m}, {m}} this provides the information that
the set of labels is either {s, m} or {m} and hence the sentence providing the same
information should be the disjunction of the α sentences for both focal sets (see
Section 3.4.3).
Dx3
{{m3 , l3}, {l3}}
{s3 }
{{s3 , m3 }{m 3 }}
Dx4 Dx4
Fig. 4.20 A merged linguistic decision tree for the Iris problem
As discussed in the last section, a merged LDT was obtained from a real-world
dataset Iris at the depth 2 when Tm = 0.3 and where L j ={small j (s j ), medium j (m j ),
large j (l j )| j = 1, . . . , 4} (see Fig 4.21).
We can then translate this tree into a set of linguistic expressions as follows:
MTiris = {
s3 ∧ ¬(m3 ∨ l3 ), 1.0000, 0.0000, 0.0000
m3 ∧ ¬l3 , s4 ∧ ¬(m4 ∨ l4 ), 1.0000, 0.0000, 0.0000
m3 ∧ ¬l3 , m4 ∧ ¬l4 , 0.0008, 0.9992, 0.0000
m3 ∧ ¬l3 , ¬s4 ∧ m4 ∧ l4 , 0.0000, 0.5106, 0.4894
m3 ∧ ¬l3 , ¬(s4 ∨ m4 ) ∧ l4 , 0.0000, 0.0556, 0.9444
4.7 Linguistic Reasoning 113
Furthermore, the tree itself can be rewritten as a set of fuzzy rules. For example
branch 2 corresponds to the rule:
IF Attribute 3 is medium but not large and Attribute 4 is only small, THEN the
class probabilities given these branches are (1.0000, 0.0000, 0.0000).
Dx3
s3 (m3 l3) l3
m3 l3
Dx4 Dx4
^
(s4 m4) ^ l4
s4 (m4 l4) l4
m4 l4 s4 ^ m4 ^ l4 s4 (m4 l4) m4 l4
Fig. 4.21 A merged linguistic decision tree in logical expressions for the LDT shown in Fig.
4.20
More details of the calculation of mass assignment given a linguistic constraint are
given in Example 4.3. For branch B the probability of B given θ is evaluated by
|B|
P(B|θ ) = ∏ mθ jr (Fjr ) (4.30)
r=1
where |B| is the number of nodes in branch B. By Jeffrey’s rule [11] , we can obtain
|T |
P(Ct |θ ) = ∑ P(Ct |Bv )P(Bv |θ ) (4.31)
v=1
Example 4.3 Given the LDT in Example 4.2, suppose we know that for a particular
data element “x1 is not large and x2 is small”. We then can translate this knowledge
into the following linguistic constraint vector:
By applying the λ -function (Definition 3.10), we can generate the associated label
sets, so that:
λ (¬large1 ) = {{small1 }}
λ (small2 ) = {{small2 }, {small2 , large2 }}
This gives
Currently, there are very few benchmark problems of this kind with fuzzy attribute
values. This is because, traditionally, only crisp data values are recorded even in
cases where this is inappropriate. Hence, we have generated a fuzzy database from
a toy problem where the aim is to identify the interior of a figure of eight shape.
Specifically, a figure of eight shape was generated according to the equation
where t ∈ [0, 2π] (See Fig. 4.23 ). Points in [−1.6, 1.6]2 are classified as legal if they
lie within the ‘eight’ shape (marked with ×) and illegal if they lie outside (marked
with points).
To form the fuzzy database we first generated a crisp database by uniformly
sampling 961 points across [−1.6, 1.6]2 . Then each data vector x1 , x2 was
converted to a vector of linguistic expressions θ1 , θ2 as follows: θ j = θR j where
R j = {F ∈ F j : mx j (F) > 0}
An LDT was then learnt by applying the LID3 algorithm to the crisp database. This
tree was then used to classify both the crisp and fuzzy data. The results are shown
in Table 4.9 and the results with NF = 7 are shown in Fig. 4.22 .
116 4 Linguistic Decision Trees for Classification
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0.5 0.5
1 1
1.5 1.5
2 2
2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2
(a) (b)
Fig. 4.22 Classification of crisp dataset (a) and fuzzy data without masses (b), where each
attribute is discretized uniformly by 7 fuzzy sets
Table 4.9 Classification accuracy based on crisp data and fuzzy data of the “eight” problem
NF = 3 NF = 4 NF = 5 NF = 6 NF = 7
Crisp Data 87.72% 94.17% 95.94% 97.29% 98.54%
Fuzzy Data 79.29% 85.85% 89.39% 94.17% 95.01%
As we can see from Table 4.9 , our model gives a reasonable approximation of
the legal data area, though it is not as accurate as testing on crisp data. The accuracy
increases with NF the number of fuzzy sets used for discretization. These results
show that the LDT model can perform well in dealing with fuzzy and ambiguous
data. Here the “eight” problem is also used for testing classification with linguistic
constraints in the following example.
Example 4.4 Suppose an LDT is trained on the “eight” database where each
attribute is discretized by five fuzzy sets uniformly: verysmall (vs), small (s),medium
(m), large (l) and verylarge (vl). Further, suppose we are given the following
description of data points:
Experimental results obtained based on the approach introduced in Section 4.7 are
as follows:
4.8 Summary 117
As we can see from Fig. 4.23 , the above 3 linguistic constraints roughly
correspond to the areas 1, 2 and 3, respectively. By considering the occurrence of
legal and illegal examples within these areas, we can verify the correctness of our
approach.
Very Large Large Medium
1.5
Area 3
Area 2
0.5
0.5
Small Very Small
1
Area 1
1.5 Very Small Small Medium Large Very Large
2.5
Fig. 4.23 Testing on the “eight” problem with linguistic constraints θ , where each attribute
is discretized by 5 trapezoidal fuzzy sets: very small, small, medium, large and very large
4.8 Summary
In this chapter, a decision tree learning algorithm is proposed based on label
semantics. Unlike classical decision trees, the new algorithm uses probability
estimation based on linguistic labels. The linguistic labels are based on fuzzy
discretization using a number of different methods including uniform partitioning,
a percentile-based partitioning and an entropy-based partitioning. We found that the
percentile-based discretization and entropy-based discretization outperform uniform
118 References
discretization, but no statistical significance was found. By testing our new model
on real-world datasets and compared with three well-known machine learning
algorithms, we found that LID3 outperformed C4.5 on all given datasets and
outperforms Naive Bayes on datasets with numerical attributes only. Also it has
equivalent classification accuracy and better transparency when compared to back
propagation Neural Networks.
In order to obtain compact trees, a forward merging algorithm was proposed
and the experimental results show that the number of branches can be greatly
reduced without a significant loss in accuracy. Finally, we introduce the method
of interpreting a linguistic decision tree into a set of linguistic rules joined by
logical connectives. The methods for classification with linguistic constraints and
fuzzy data classification are discussed, supported by a test on toy problems. In the
subsequent chapter, we will focus on extending the LDT model from classification
problems to prediction problems.
References
[1] Quinlan J. R.: Induction of decision trees, Machine Learning, 1: pp. 81-106.
(1986).
[2] Quinlan J. R.:C4.5: Programs for Machine Learning, San Mateo: Morgan
Kaufmann. (1993).
[3] Mitchell T.: Machine Learning, McGraw-Hill, New York. (1997).
[4] Berthold M., Hand D. J.: Ed. , Intelligent Data Analysis, Springer-Verlag,
Berlin Heidelberg. (1999).
[5] Peng Y., Flach P. A.: Soft discretization to enhance the continuous decision
trees, Integrating Aspects of Data Mining, Decision Support and Meta-
Learning, C. Giraud-Carrier, N. Lavrac and S. Moyle, editors, pp. 109-118,
ECML/PKDD’01 workshop. (2001).
[6] Baldwin J. F., Lawry J., Martin T. P.: Mass assignment fuzzy ID3 with
applications. Proceedings of the Unicom Workshop on Fuzzy Logic:
Applications and Future Directions, pp. 278-294, London. (1997).
[7] Janikow C. Z.: Fuzzy decision trees: issues and methods, IEEE Trans. on
Systems, Man, and Cybernetics-Part B: Cybernetics, 28/1: pp. 1-14. (1998).
[8] Olaru C., Wehenkel L., A complete fuzzy decision tree technique, Fuzzy Sets
and Systems, 138: pp.221-254. (2003).
[9] Blake C., Merz C. J.: UCI machine learning repository.
[10] Qin Z., Lawry J.: A tree-structured model classification model based on label
semantics, Proceedings of the 10th International Conference on Information
Processing and Management of Uncertainty in Knowledge-based Systems
(IPMU-04), pp. 261-268, Perugia, Italy. (2004).
[11] Jeffrey R. C.: The Logic of Decision, Gordon & Breach Inc., New York.
(1965).
[12] Provost F., Domingos P.: Tree induction for probability-based ranking,
Machine Learning, 52, pp. 199-215. (2003).
References 119
[13] Witten I. H., Frank E.: Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations, Morgan Kaufmann. (1999).