0% found this document useful (0 votes)
12 views43 pages

Springer - Linguistic Decision Trees For Classification-2014

Uploaded by

Sapo Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views43 pages

Springer - Linguistic Decision Trees For Classification-2014

Uploaded by

Sapo Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

4

Linguistic Decision Trees for Classification

Science is the systematic classification of experience.


— George Henry Lewes (1817-1878)

4.1 Introduction

In this chapter, label semantics theory is applied to designing transparent data


mining models. A label semantics based decision tree model is proposed where
nodes are linguistic descriptions of variables and leaves are sets of appropriate
labels. For each branch, instead of labeling it with a certain class, the probability
of a particular class given this branch can be computed based on the given training
dataset. This new model is referred to as a linguistic decision tree (LDT).
A new algorithm for building such a tree model guided by information based
heuristics is proposed by modifying the classical ID3 algorithms in accordance
with label semantics theory. By empirical experiments on real-world datasets it
is verified that LDTs have better or equivalent classification accuracy compared
to three well-known machine learning algorithms: C4.5, Naive Bayes (NB) and
Back Propagation (BP) Neural Networks. Each LDT can be interpreted as a set of
linguistic rules that give this model a good transparency compared to other black-
box data mining models. By applying a new proposed forward branch merging
algorithm, the complexity of the tree can be greatly reduced without significant
loss of accuracy. Finally, a method for linguistic query evaluation is discussed and
supported with an example at the end of this chapter. This methodology can be
extended to learning from fuzzy data.

4.2 Tree Induction

Tree induction learning models have received a great deal of attention over recent
years in the fields of machine learning and data mining because of their simplicity

Z. Qin et al., Uncertainty Modeling for Data Mining


© Zhejiang University Press, Hangzhou and Springer-Verlag Berlin Heidelberg 2014
78 4 Linguistic Decision Trees for Classification

and effectiveness. Among them, the Iterative Dichotomiser 3 (ID3) [1] algorithm
for decision trees induction has proved to be an effective and popular algorithm
for building decision trees from discrete valued data sets. The C4.5 [2] algorithm
was proposed as a successor to ID3 in which an entropy based approach to crisp
partitioning of continuous universes was adopted.
Decision tree induction is one of the simplest and yet most successful learning
algorithms. A decision tree (DT) consists of internal and external nodes and the
interconnections between nodes are called branches of the tree. An internal node
is a decision-making unit to decide which child nodes to visit next depending on
different possible values of associated variables. In contrast, an external node, also
known as leaf node, is the terminated node of a branch. It has no child nodes and is
associated with a class label that describes the given data. A decision tree is a set of
rules in a tree structure. Each branch can be interpreted as a decision rule associated
with nodes visited along this branch. For example, Fig. 4.2 is a decision tree which
is generated from the “play-tennis” problem [3] . The database for this problem is
shown in Fig. 4.1 .

Day Outlook Temperature Humidity Wind Play-tennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Weak Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Fig. 4.1 Database for the “play-tennis” problem [3] . Each instance is with 4 attributes and
one class label of either Yes or No, all attributes are with discrete values

Decision trees classify instances by sorting instances down the tree from root to
leaf nodes. This tree-structured classifier partitions the input space of the data set
recursively into mutually exclusive spaces. Following this structure, each training
data is identified as belonging to a certain subspace, which is assigned a label, a
value, or an action to characterize its data points. The decision tree mechanism
has good transparency in that we can follow a tree structure easily in order to
explain how a decision is made. Thus interpretability is enhanced when we clarify
the conditional rules characterizing the tree.
4.2 Tree Induction 79

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Fig. 4.2 A decision tree built from the play-tennis problem [3]

4.2.1 Entropy
Entropy of a random variable is the average amount of information generated
by observing its value. Consider the random experiment of tossing a coin with
probability of heads equal to 0.9, i.e., a random variable x with

P(x = Head) = 0.9 P(x = Tail) = 0.1

The amount of information generated by observing a head is less than the


information generated by observing a tail. Intuitively, one can appreciated the
observing the outcome “heads” provides little information, since the probability
of heads is 0.9, i.e., heads is almost certain to come up. Observing “tails” on the
other hand provides much information, since its probability is low [4] . In statistical
physics, entropy is used to evaluate the randomness of a state, a large entropy
value indicates that there is more randomness involved. If an event x happens with
probability P(x), the function for measuring information content H(x) should be
inversely proportional to P(x). It should satisfy
1
H(x) = (4.1)
P(x)
If both events x1 and x2 happen, where the probability of these two events happen is
P(x1 )P(x2 ). However, the information content should satisfy

H(x1 , x2 ) = H(x1 ) + H(x2 ) (4.2)

If we use a logarithm function


1
H(x) = log (4.3)
P(x)
both the above two conditions can be satisfied. Given a random variable x with
probability P(x), the entropy is the expectation of the information content function
H(x):
80 4 Linguistic Decision Trees for Classification

1
E(x) = ∑ P(x)H(x) = ∑ P(x) log = − ∑ log P(x)P(x) (4.4)
x x P(x) x

Because we basically use binary code for computing and communications. For
convenience, we choose the base 2. Therefore, the entropy can be formally defined
by
Entropy(x) = − ∑ pi log2 pi (4.5)
i

We have introduced DT and how DTs make decisions. The hardest problem
is how to build a DT based on training data? The most popular decision tree
induction algorithm is called ID3 and was introduced by Quinlan in 1986 [1] . It has
proved to be an effective and popular algorithm for building decision trees from
discrete valued data sets. The decision tree is guided heuristically according to the
information content of each attribute. In classification problems, we can also say
that entropy is a measurement of the impurity in a collection of training examples:
the larger the entropy is, the more random is the data. If the entropy equals 1, this
means that the data is distributed uniformly across the classes. Fig. 4.3 shows the
entropy function when the proportion of positive examples varies from 0 to 1 in a
binary classification problem.

0.9

0.8

0.7
Entropy

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Proportion

Fig. 4.3 The entropy for binary classification problems

Based on entropy, Information Gain (IG) is used to measure the effectiveness of


an attribute as a means of discriminating between classes.
4.2 Tree Induction 81

|Sv |
IG(S, A) = Entropy(S) − ∑ |S|
Entropy(Sv ) (4.6)
v∈Values(A)

It is simply the expected reduction of entropy caused by partitioning the examples


according to this attribute. More details regarding the ID3 algorithm are given in [3] .
The ID3 algorithm is a hill-climbing algorithm, which is guided by information gain
in the hypotheses space. This gives us the following approximation of its inductive
bias: Shorter trees are preferred over longer trees. Trees that place high information
gain attributes close to the root are preferred over those that do not. The pseudo-code
is given in Algorithm 2.

Algorithm 2: ID3 algorithm for decision tree learning


ID3 (Examples, Target_Attribute, Attributes)
• Create a root node for the tree.
• If all examples are positive, Return the single-node tree Root, with label = +.
• If all examples are negative, Return the single-node tree Root, with label = −.
• If number of predicting attributes is empty, then Return the single node tree Root, with
label = most common value of the target attribute in the examples.
• Otherwise Begin:
– A = The Attribute that best classifies examples.
– Decision Tree attribute for Root = A.
– For each possible value, vi , of A:
· Add a new tree branch below Root, corresponding to the test A = vi .
· Let Examples(vi ), be the subset of examples that have the value vi for A
· If Examples(vi ) is empty:
· Then below this new branch add a leaf node with label = most common
target value in the examples.
· Else below this new branch add the subtree ID3 (Examples(vi ),
Target_Attribute, Attributes ∈
/ A).
• End
• Return Root

Also, it is important to realize that the ID3 algorithm is not suitable for all
learning and classification problems. Typically, problems that have the following
main characteristics can be modeled as a decision tree.
(1) Instances are represented by attribute-value pairs.
(2) The target function has discrete values.
In dealing with the first problem, techniques of “learning from structured data”
have been developed as a part of Inductive Logic Programming (ILP). For the
second problem, much real-world data, including scientific and engineering data,
medical data and financial data, are continuous. In order to learn from continuous
data, we need to partition the continuous universe using some type of discretization
algorithms, as will be discussed in the following chapters.
82 4 Linguistic Decision Trees for Classification

4.2.2 Soft Decision Trees


One inherent disadvantage of crisp partitioning is that it tends to make the induced
decision trees sensitive to noise. This noise is not only due to the lack of precision
or errors in measured features but is often present in the model itself, since the
available features may not be sufficient to provide a complete model of the system.
For each attribute, disjoint classes are separated with clearly defined boundaries.
These boundaries are “critical” since a small change close to these points will
probably cause a complete change in classification.
Due to the existence of uncertainty and imprecise information in real-world
problems, the class boundaries may not be defined clearly. In this case, decision
trees may produce high misclassification rates in testing, even if they perform well
in training [3,5] .
This fact can be illustrated as follows: Fig. 4.4 shows a decision tree in a two-
class problem, in which there are two continuous attributes x and y. Using crisp
discretization, the decision space is partitioned into a set of non-overlapping sub-
regions A1 , A2 and A3 , which have clear boundaries with each other. The object
for classification will definitely fall into one of these areas. For example, the given
object (x = 13.5, y = 46.0) will be classified as A3 , However, if this object is distorted
due to noise so that (x = 12.9, y = 46.2), then the object will be misclassified as
A1 (see Fig. 4.4 (a)). In contrast, consider the use of fuzzy discretization (Fig. 4.4
(b)), where the continuous universe is partitioned by overlapped trapezoidal fuzzy
sets {x1 , x2 } and {y1 , y2 }. As shown in the right-hand figures of Fig. 4.4 , A1 , A2
and A3 generated from fuzzy discretization appear as overlapping subregions with
blurred boundaries. The possibility degree of an object belonging to each region
will be given by the membership of pre-defined fuzzy sets. The object may fall in
the overlapping area. These results can then aid the human users to make their final
decisions or suggest further investigation.
Many fuzzy approaches for decision tree learning have been proposed to
overcome the weaknesses described above[5−8] . In particular, gives a comprehensive
overview of the fuzzy decision tree literature [8] . The algorithm we will present can
be also considered as a soft decision. It is based on the label semantics theory (see
Chapter 3) and provides a good interpretation of decision rules by using linguistic
labels.

4.3 Linguistic Decision for Classification


In order to avoid being confusing about the complicated notations used in label
semantics and the LDT model, we highlight a few most important notations and list
them in Table 4.1 . Generally, a tree is considered as a set of branches and each
branch is a set of non-zero focal elements with a probability distribution on classes.
Consider a database DB = {x1 , . . . , x|DB| } where each instance has n attributes
(i.e., xi = x1 (i), . . . , xn (i)) and each instance is categorized as belonging to one
of the classes: C = {C1 , . . . ,C|C| }. Unless otherwise stated, we use uniformly
4.3 Linguistic Decision for Classification 83

x Y
A3
<13 >=13
45 A1

LF1 y
A2
<=45 >45

LF2 LF3 13 X
(a)
Y
x
x1 x2 y2 A3

A1 A1
LF1 y
y1 y2 y1
A2

LF2 LF3 X

(b) x1 x2

Fig. 4.4 Comparisons of crisp discretization and fuzzy discretization for decision tree
models. The decision tree in (a) has crisp boundaries that divide the data into 3 non-
overlapping areas A1 , A2 and A3 . In (b), x1 , x2 , y1 and y2 are defined by fuzzy functions
that embed more robustness with blurred boundaries

Table 4.1 Important notations for the linguistic decision tree model
DB a database with the size of |DB|: {x1 , . . . , x|DB| }
xi a n-dimensional instance that: xi ∈ DB for i = 1, . . . , |DB|
Lj a set of linguistic labels defined on the attribute j : j = 1, . . . , n
Fj the focal set on attribute j given L j for j = 1, . . . , n
|F j | = 2|L j | − 1 if Lk ∈ L j (k = 1, . . . , |L j |) are with 50% overlapping
C a set of classes with the size of |C|: {C1 , . . . ,C|C| }
T a linguistic decision tree that contains |T | branches: {B1 , . . . , B|T | }
B a set of branches: B = {B1 , . . . , BM }, T ≡ B iif: M = |T |
B a branch of LDT, it has |B| focal elements: B = {F1 , . . . , F|B| }
focal elements Fi , Fj ∈ B are defined on different attributes
84 4 Linguistic Decision Trees for Classification

distributed fuzzy sets with 50% overlap to discretize each continuous attribute
universe and obtain a corresponding linguistic data set by applying linguistic
translation (Definition 3.7). A linguistic decision tree is a decision tree where the
nodes are the random set label descriptions and the branches correspond to particular
focal elements based on DB.
Definition 4.1 (Linguistic decision tree) A linguistic decision tree is a set of
branches with associated class probabilities in the following form:

T = {B1 , P(C1 |B1 ), . . . , P(C|C| |B1 ), . . . ,


B|T | , P(C1 |B|T | ), . . . , P(C|C| |B|T | )}

where |T | is the number of branches of the linguistic decision tree T .


A branch B is defined as a set of focal elements
B = F1 , . . . , F|B| 
where Fj ∈ F j . F j is the focal set for attribute j (Definition 3.6). |B| is the length of
a branch, corresponding to the number of component nodes (attributes), is less than
or equal to n, the number of attributes.
Within an LDT (see Fig. 4.5 ) each node is an attributes that can be split into
branches according to the focal elements of this node (attribute). One attribute is not
allowed to appear more than once in a branch, and an attribute which is not currently
part of a branch is referred to as a free attribute.

Dx 1
{ small1} {large1}
{small1 , large1}

Dx 2 LF4 Dx 2
{ large2 } {small2} { large2 }
{small2} (0.1, 0.9)
{small2 ,large2} {small2 ,large2}

LF1 LF2 LF3 LF5 LF6 LF7

(0.3, 0.7) (0.5, 0.5) (0.6, 0.4) (0.6, 0.4) (0.7, 0.3) (0.2, 0.8)

Fig. 4.5 An example of a linguistic decision tree in a binary classification problem, where
each attribute is discretized by two fuzzy labels: small and large. The tree has 7 branches and
each leaf from LF1 to LF7 is labeled with a class distribution

Definition 4.2 (Free attributes) The set of attributes free to use for expanding a
given branch B is defined by

AT TB = {x j |∀F ∈ F j ; F ∈
/ B}
4.3 Linguistic Decision for Classification 85

In an LDT, the length of the longest branch Dep(T ) is called the depth of the LDT,
which is also less than or equal to n:

Dep(T ) ≤ n (4.7)

Each branch has an associated probability distribution on the classes. For example,
an LDT shown in Fig. 4.5 might be obtained from training where the branch LF6 :

{large1 }, {small2 , large2 }, 0.7, 0.3

means the probability of class C1 is 0.3 and C2 is 0.7 given attribute 1 that can be
only described as large and attribute 2 that can be described as small and large. We
need to be aware that the linguistic expressions such as small, medium or large for
each attribute are not necessarily the same, since they are defined independently on
each attribute. E.g. large2 means the fuzzy label large defined on attribute 2.

4.3.1 Branch Probability


According to the definition of LDT (Definition 4.1), given a branch of an LDT in the
form of B = F1 , . . . , F|B| , the probability of class Ct (t = 1, . . . , |C|) given B can then
be evaluated from a training set DB. First, we consider the probability of a branch
B given a particular example x ∈ DB, where x = x1 , . . . , xn  ∈ Ω1 ×, . . . , ×Ωn . This
probability is evaluated by:
|B|
P(B|x) = ∏ mxr (Fr ) (4.8)
r=1

where mxr (Fr ) are the associated masses of data element xr for r = 1, . . . , |B|.
Basically, the above equation can be justified as follows:

P(B|x) = P(Dx1 = F1 , . . . , Dx|B| = F|B| |x1 , . . . , xn ) (4.9)

where Dx1 = F1 , . . . , Dx|B| = F|B| are conditionally independent, so that we can obtain

|B|
P(Dx1 = F1 , . . . , Dx|B| = F|B| |x1 , . . . , xn ) = ∏ P(Dxr = Fr |xr ) (4.10)
r=1
|B|
= ∏ mxr (Fr ) (4.11)
r=1

Based on Eq. (4.8), the probability of class Ct given B can then be evaluated by

∑i∈DBt P(B|xi )
P(Ct |B) = (4.12)
∑i∈DB P(B|xi )
where DBt is the subset consisting of instances which belong to class Ct .
86 4 Linguistic Decision Trees for Classification

DBt = {xi |xi → Ct } : i = 1, . . . , |DB|.

In the case where the dominator equals zero (i.e., ∑i∈DB P(B|xi ) = 0), which can
occur when the training database for the LDT is small so that there is no non-zero
linguistic data covered by the branch. In this case, we obtain no information from
the database so that equal probabilities are assigned to each class.
1
P(Ct |B) = for t = 1, . . . , |C| (4.13)
|C|

In the process of building a linguistic decision tree, if one of the class probabilities
reaches a certain threshold at a particular depth, for example 0.9, then we might take
the view that this branch is sufficiently discriminating and that further expansion
may lead to overfitting. In this case terminating the tree expansion at the current
depth will probably help maintain accuracy on the test set. To this end, we employ
a threshold probability to determine whether or not a particular branch should
terminate.

Definition 4.3 (Threshold probability of a LDT) In the process of building a


linguistic decision tree, if the maximum class probability, given a particular branch,
is greater than, or equal to a given threshold probability T , then the branch will be
terminated at the current depth.

Obviously, when using this probability-based thresholding, the branches of a


tree may have different lengths. For example, see Fig. 4.5 , where the threshold
probability T = 0.9, so that the 4th branch {small1 , large1 } is terminated at the
depth 1 while the other branches expand to the next depth.
In the above discussions we have been concerned about continuous (or
numerical) attributes, but can we learn with discrete (or nominal) attributes? One
problem is that the values of discrete attributes may not have a natural ordering
like continuous ones. For example, values for a person’s age can be sorted in
an increasing manner so that the labels young, middle-aged and old can be
meaningfully defined by fuzzy sets. However, if we consider the gender of a person,
there are only two possible values male or female, which are unordered. Hence,
partitioning discrete attribute domains using fuzzy labels is problematic. Instead,
we do not attempt to group discrete values but treat discrete values as distinct labels
which do not overlap with each other. Hence, the following focal elements for the
attribute “gender” are {male} and {female}. In this representation the associated
masses for each focal element will be binary, i.e. either zero or one. For instance,

1 if gender= male
mgender ({male}) =
0 otherwise

Missing values can be handled simply by assigning the equal masses of


corresponding focal elements. For example, given the database shown in Table 4.2 ,
4.3 Linguistic Decision for Classification 87

the 4th instance has a missing value in Attribute 1. Instead of using some ad-hoc pre-
processingŒ techniques, we simply assign equal probabilities to the focal elements
of this missing value.

Table 4.2 An example of a small-scale artificial dataset after linguistic translation. Each
attribute has 2 independently defined fuzzy labels: small (s) and large (l)

Attribute 1 (x1 ) Attribute 2 (x2 )


# Instance Class
{s1 } {s1 , l1 } {l1 } {s2 } {s2 , l2 } {l2 }
1 0 0.4 0.6 0 0.7 0.3 +
2 0.2 0.8 0 0.5 0.5 0 −
3 0 0.9 0.1 1 0 0 −
4 0.333 0.333 0.333 0 1 0 +
5 0 1 0 0.3 0.7 0 +

Example 4.1 Consider a two-class problem with 2 attributes, where L1 = {small1


(s1 ), large1 (l1 )} and L2 ={small2 (s2 ), large2 (l2 )}. We assume the focal set F1 =
{{s1 }, {s1 , l1 }, {l1 }} and F2 = {{s2 }, {s2 , l2 }, {l2 }}. Suppose that the database
generated from the linguistic translation from the original training database is given
in Table 4.2 , and it has two target classes, positive (+) and negative (−) where

DB+ = {x1 , x4 , x5 }, DB− = {x2 , x3 }

Now suppose we are given two branches of the form:

B1 = {small1 }, {small2 }, P(+|B1 ), P(−|B1 )


B2 = {small1 , large1 }, {small2 , large2 }, P(+|B2 ), P(−|B2 )

These two branches are evaluated according to Eqs. (4.8) and (4.12) (or Eq. (4.13)):

∑i=1,4,5 P(B1 |xi ) ∑i=1,4,5 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
P(+, B1 ) = =
∑5i=1 P(B1 |xi ) ∑5i=1 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
0 × 0 + 0.333 × 0 + 0 × 0.3
= =0
0 × 0 + 0.2 × 0.5 + 0 × 1 + 0.333 × 0 + 0 × 0.3

Œ Some pre-processing techniques treat the missing values as a new value of “missing” for
nominal attributes [13] .
88 4 Linguistic Decision Trees for Classification

∑i=2,3 P(B1 |xi ) ∑i=2,3 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
P(−, B1 ) = = 5
∑5i=1 P(B1 |xi ) ∑i=1 mx1 (i) ({s1 }) × mx2 (i) ({s2 })
0.2 × 0.5 + 0 × 1 0.1
= = =1
0 × 0 + 0.2 × 0.5 + 0 × 1 + 0.333 × 0 + 0 × 0.3 0.1

∑i=1,4,5 P(B2 |xi ) ∑i=1,4,5 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
P(+, B2 ) = =
∑5i=1 P(B2 |xi ) ∑5i=1 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
0.4 × 0.7 + 0.333 × 1 + 1 × 0.7
= = 0.767
0.4 × 0.7 + 0.8 × 0.5 + 0.9 × 0 + 0.333 × 1 + 1 × 0.7

∑i=2,3 P(B2 |xi ) ∑i=2,3 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
P(−, B2 ) = = 5
∑5i=1 P(B2 |xi ) ∑i=1 mx1 (i) ({s1 , l1 }) × mx2 (i) ({s2 , l2 })
0.8 × 0.5 + 0.9 × 0
= = 0.233
0.4 × 0.7 + 0.8 × 0.5 + 0.9 × 0 + 0.333 × 1 + 1 × 0.7

4.3.2 Classification by LDT


Now consider classifying an unlabeled instance in the form of x = x1 , . . . , xn  which
may not be contained in the training data set DB. We apply a linguistic translation
to x based on the fuzzy covering of the training data DB. In the case that a data
element appears beyond the range of the training data set [Rmin , Rmax ] for a particular
attribute, we assign the appropriateness degrees of Rmin or Rmax to the element
depending on which side of the range it appears.

μFi (x) = μFi (Rmin ) if x < Rmin


μFi (x) = μFi (Rmax ) if x > Rmax
where i = 1, . . . , |F|. Jeffrey’s rule:

P(a) = P(a|b)P(b) + P(a|¬b)P(¬b) (4.14)

is used for classifying a new data element, where P(b) and P(¬b) are considered
as the beliefs in b and not b, respectively [11] . This can be generalized when given a
new condition c :

P(a|c) = P(a|b)P(b|c) + r(a|¬b)P(¬b|c) (4.15)

Hence, we can evaluate the probabilities of class Ct based on a given LDT, T , by


using Jeffrey’s rule as follows:
|T |
P(Ct |x) = ∑ P(Ct |Bv )P(Bv |x) (4.16)
v=1

 There is an implicit assumption that a is conditionally independent of c given b.


4.3 Linguistic Decision for Classification 89

where P(Bv |x) and P(Ct |Bv ) are evaluated according to Eqs. (4.8) and (4.12) (or Eq.
(4.13)), respectively. In classical decision trees, classification is made according to
the class label of the branch in which the data falls. In our approach, the data for
classification partially satisfies the constraints represented by a number of branches
and the probability estimates across the whole decision tree are then used to obtain
an overall classification.
Example 4.2 Suppose we are given the linguistic decision tree shown in Fig. 4.5
for a two-class problem with F1 = {{small1 }, {small1 , large1 }, {large1 }},
F2 = {{small2 }, {small2 , large2 }, {large2 }}. A data element y = y1 , y2  for
classification is given such that μsmall1 (y1 ) = 1, μlarge1 (y1 ) = 0.4 and μsmall2 (y2 ) =
0.2, μlarge2 (y2 ) = 1.

The LDT given in Fig. 4.5 can be written as

LDT = {B1 , B2 , B3 , B4 , B5 , B6 , B7 } = {
{small1 }, {small2 }, 0.3, 0.7,
{small1 }, {small2 , large2 }, 0.5, 0.5,
{small1 }, {large2 }, 0.6, 0.4,
{small1 , large1 }, 0.1, 0.9,
{large1 }, {small2 }, 0.6, 0.4
{large1 }, {small2 , large2 }, 0.7, 0.3
{large1 }, {large2 }, 0.2, 0.8 }

The mass assignments on y are

my1 = {small1 , large1 } : 0.4, {small1 } : 0.6


my2 = {small2 , large2 } : 0.2, {large2 } : 0.8
According to Eq. (4.8), we obtain

P(B1 |y) = P(B5 |y) = P(B6 |y) = P(B7 |y) = 0

P(B2 |y) = my1 ({small1 }) × my2 ({small2 , large2 }) = 0.6 × 0.2 = 0.12
P(B3 |y) = my1 ({small1 }) × my2 ({large2 }) = 0.6 × 0.8 = 0.48
P(B4 |y) = my1 ({small1 , large1 }) = 0.4
Hence, based on Jeffrey’s rule (Eq.(4.16)), we can obtain
7
P(C1 |y) = ∑ P(C1 |Bv )P(Bv |y) = ∑ P(C1 |Bv )P(Bv |y)
v=1 v=2,3,4
= 0.12 × 0.5 + 0.48 × 0.6 + 0.4 × 0.1 = 0.388
90 4 Linguistic Decision Trees for Classification
7
P(C2 |y) = ∑ P(C2 |Bv )P(Bv |y) = ∑ P(C2 |Bv )P(Bv |y)
v=1 v=2,3,4
= 0.12 × 0.5 + 0.48 × 0.4 + 0.4 × 0.9 = 0.612

Usually, the decision threshold for a probabilistic classifier is 0.5 without assuming
any other prior information. Therefore, in this example, y is classified as C2
because P(C2 |y) > 0.5. However, in cost-sensitive learning, the decision threshold
is not necessarily 0.5 when considering the misclassification cost and prior class
distribution.
From the above examples we have known how to calculate the class probabilities
and how to use them in classification. In the next section, we will introduce the
algorithm for building a linguistic decision tree.

4.3.3 Linguistic ID3 Algorithm


Linguistic ID3 (LID3) is the learning algorithm we propose for building the
linguistic decision tree based on a given linguistic database. Similar to the ID3
algorithm [1] , search is guided by an information based heuristic, but the information
measurements of an LDT are modified in accordance with label semantics. The
measure of information defined for a branch B can be viewed as an extension of the
entropy measure used in ID3.
Definition 4.4 (Branch entropy) The entropy of branch B given a set of classes
C = {C1 , . . . ,C|C| } is

|C|
E(B) = − ∑ P(Ct |B) log2 P(Ct |B) (4.17)
t=1

Now, given a particular branch B, suppose we want to expand it with the attribute x j .
The evaluation of this attribute will be given based on the expected entropy defined
as follows:
Definition 4.5 (Expected entropy)

EE(B, x j ) = ∑ E(B ∪ Fj ) · P(Fj |B) (4.18)


Fj ∈F j

where B ∪ Fj represents the new branch obtained by appending the focal element Fj
to the end of branch B. The probability of Fj given B can be calculated as follows:

∑i∈DB P(B ∪ Fj |xi )


P(Fj |B) = (4.19)
∑i∈DB P(B|xi )
We can now define the Information Gain (IG) obtained by expanding branch B with
attribute x j as:
IG(B, x j ) = E(B) − EE(B, x j ) (4.20)
4.3 Linguistic Decision for Classification 91

The goal of tree-structured learning models is to make subregions partitioned


by branches be less “impure” in terms of the mixture of class labels than the
unpartitioned dataset. For a particular branch, the most suitable free attribute
for further expanding (or partitioning), is the one by which the “pureness” is
maximally increased with expandsion. That corresponds to selecting the attribute
with maximum information gain. As with ID3 learning, the most informative
attribute will form the root of a linguistic decision tree, and the tree will expand
into branches associated with all possible focal elements of this attribute. For each
branch, the free attribute with maximum information gain will be the next node,
from level to level, until the tree reaches the maximum specified depth or the
maximum class probability reaches the given threshold probability.

Algorithm 3: Linguistic decision tree learning


input : LD: Linguistic dataset obtained from Algorithm 1.
output: T : Linguistic Decision Tree
Set a maximum depth Mdep and a threshold probability T .;
for l ← 0 to Mdep do
B ← 0/ when l = 0 ;
The set of branches of LDT at depth l is Bl = {B1 , . . . , B|Bl | } ;
for v ← 1 to |B| do
foreach Bv do : ;
for t ← 1 to |C| do
foreach t do Calculating conditional probabilities:
P(Ct |Bv ) = ∑i∈DBt P(Bv |xi )/∑i∈DB P(Bv |xi ) ;
if P(Ct |Bv ) ≥ T then
break (step out the loop)
if ∃ x j : x j is free attribute then
foreach x j do : Calculate: IG(Bv , x j ) = E(Bv ) − EE(Bv , x j ) ;
IGmax (Bv ) = maxx j [IG(Bv , x j )] ;
Expanding Bv with xmax where xmax is the free attribute we can obtain
the maximum IG value IGmax . ;

Bv ← Fj ∈F j {Bv ∪ Fj }.
else
exit;
s 
Bl+1 ← r=1 Br .
T ←B

(1) Linguistic Translation: to translate real valued data into linguistic form data (see
Definition 3.7). We first discretize the continuous universe of each attribute with
fuzzy sets uniformly or non-uniformly (see Section 3.3). For each data element,
find appropriateness degrees which will be used in subsequent calculations of
mass assignments. This is because a new database is saved for use in the second
step.
92 4 Linguistic Decision Trees for Classification

(2) Decision Tree Learning: A linguistic decision tree will be developed level by
level according to the information heuristics. At each level, we will examine
each branch for calculating class probabilities and compare it with the given
threshold probability to determine if it should be terminated or not, until the
maximum depth has been reached or all branches are terminated. The pseudo-
code of the tree learning process is shown in Algorithm 3.
(3) Classification: Given an LDT, we classify a data element according to class
probabilities, given branches of the tree according to Eq. (4.16).

4.4 Experimental Studies

We evaluated the LID3 algorithm by using 14 datasets taken from the UCI Machine
Learning repository [9] . These datasets have representative properties of real-world
data, such as missing values, multi-classes, mixed-type data (numerical, nominal)
and unbalanced class distributions, etc. Table 4.3 shows the dataset, the number
of classes, the number of instances, the number of numerical (Num.) and nominal
(Nom.) attributes and whether or not the database contains missing values.

Table 4.3 Descriptions of datasets from UCI machine learning repository. Other details
about these data sets are available in Reference [9]

Number Missing Attributes Number


# Dataset Data Size
of Classes Values Num. Nom.
1 Balance 3 625 no 4 0
2 Breast-Cancer* 2 286 yes 3 6
3 Breast-w 2 699 no 9 0
4 Ecoli 8 336 no 7 1
5 Glass 6 214 no 9 0
6 Heart-c 2 303 yes 6 7
7 Heart-Statlog 2 270 no 7 6
8 Heptitis 2 155 yes 6 13
9 Ionosphere 2 351 no 34 0
10 Iris 3 150 no 4 0
11 Liver 2 345 no 6 0
12 Pima 2 768 no 8 0
13 Sonar 2 208 no 60 0
14 Wine 3 178 no 14 0

In the following experiments, unless otherwise stated, attributes are discretized


by 2 trapezoidal fuzzy sets with 50% overlap, and classes are evenly split into two
4.4 Experimental Studies 93

sub-datasets, one half for training and the other half for testing. This is referred to
as a 50-50 split experiment. The maximal depth is set manually and the results show
the best performance of LID3 across a range of depth settings. We also test the LID3
algorithm with different threshold probabilities T ranging from 0.6 to 1.0 in steps
of 0.1 and for the different fuzzy discretization methods: uniform, entropy-based
and percentile-based (see Section 3.3). For each dataset, we ran 50-50 random split
experiment 10 times. The average test accuracy with standard deviation is shown
on the right-hand side of Table 4.5 and the probability and the depth at which we
obtain this accuracy are listed in Table 4.4 .

Table 4.4 Summary of the threshold probabilities and depths for obtaining the best accuracy
with different discretization methods in the given datasets

LID3-Uniform LID3-Entropy LID3-Percentile


#
Threshold Best Depth Threshold Best Depth Threshold Best Depth
1 1.0 4 1.0 4 1.0 4
2 0.7 2 0.7 2 0.7 2
3 1.0 4 1.0 3 1.0 3
4 1.0 7 1.0 7 1.0 7
5 0.9 9 0.8 9 0.8 8
6 0.9 3 0.9 4 0.9 3
7 0.9 3 0.9 3 0.9 4
8 0.9 4 0.9 4 0.9 3
9 0.9 6 0.9 6 0.9 6
10 1.0 3 1.0 3 1.0 3
11 0.9 5 1.0 5 1.0 5
12 0.9 5 0.9 4 0.9 3
13 1.0 8 1.0 8 1.0 8
14 1.0 4 1.0 5 1.0 5

4.4.1 Influence of the Threshold


As can be seen from the results, the best accuracy is usually obtained with high
threshold probabilities T = 0.9 or T = 1.0, especially for datasets with only
numerical attributes (such as breast-w, iris, balance, wine) or where numerical
attributes play important roles in learning (ecoli, heptitis). Recent work on PETs
94 4 Linguistic Decision Trees for Classification

Table 4.5 Accuracy (with standard deviation) of LID3 based on different discretization
methods and three other well-known machine learning algorithms

# C4.5 Naive Bayes Neural Network LID3-Uniform LID3-Entropy LID3-Percentile


1 79.20±1.53 89.46±2.09 90.38±1.18 83.80±1.19 83.07±3.22 86.23±0.97
2 69.16±4.14 71.26±2.96 66.50±3.48 73.06±3.05 73.47±2.66 73.06±3.05
3 94.38±1.42 96.28±0.73 94.96±0.80 96.43±0.70 96.11±0.78 96.11±0.89
4 78.99±2.23 85.36±2.42 82.62±3.18 85.41±1.94 86.53±1.28 85.59±2.19
5 64.77±5.10 45.99±7.00 64.30±3.38 65.96±2.31 65.60±2.57 65.87±2.32
6 75.50±3.79 84.24±2.09 79.93±3.99 76.71±3.81 78.09±3.58 77.96±2.88
7 75.78±3.16 84.00±1.68 78.89±3.05 76.52±3.63 78.07±3.63 79.04±2.94
8 76.75±4.68 83.25±3.99 81.69±2.48 82.95±2.42 83.08±2.82 83.08±1.32
9 89.60±2.13 82.97±2.51 87.77±2.88 88.98±2.23 89.11±2.30 88.01±1.83
10 93.47±3.23 94.53±2.63 95.87±2.70 96.00±1.26 96.13±1.60 96.40±1.89
11 65.23±3.86 55.41±5.39 66.74±4.89 58.73±1.99 64.62±2.80 69.25±2.84
12 72.16±2.80 75.05±2.37 74.64±1.41 76.22±1.81 76.22±1.85 76.54±1.34
13 70.48±0.00 70.19±0.00 81.05±0.00 86.54±0.00 87.50±0.00 89.42±0.00
14 88.09±4.14 96.29±2.12 96.85±1.57 95.33±1.80 95.78±1.80 95.89±1.96

(Probability Estimation Trees) [12] also suggests that the full expanded estimation
trees give better performance than pruned treesŽ .
The reason for this is that the heuristics used to generate small and compact
trees by pruning tend to reduce the quality of the probability estimates [12] . In this
context linguistic decision trees can be thought of as a type of probability estimation
tree but where the branches correspond to linguistic descriptions of objects. Strictly
speaking, our linguistic decision tree model is a probability estimation tree though
we employ the name of “linguistic decision tree”. The key difference between these
two types of trees is that PETs give probability estimation according to which we
can rank the examples given a target class [12] . We may have different classification
results based on a different given threshold which are related to class distribution or
cost matrix in cost-sensitive problems, while the decision trees only give the crisp
classification of examples. The difference in accuracy resulting from varying the
threshold probability T is quite data dependent. Figs. 4.6 to 4.8 show the results of
the datasets given in Table 4.3 . We will consider the results for 4 typical datasets:
Breast-w, Heart-statlog, Glass and Breast-Cancer. In Breast-w, the accuracy curves
are nested relative to increasing values of T . The models with high T values
outperform those with lower T values at all depths. Dataset Iris, Balance, Sonar,

Ž In classical decision tree learning such as ID3 and C4.5, pruning can reduce overfitting
so that the pruned trees have better generalization and perform better then unpruned trees.
However, this is not the case for probability estimation trees [12] .
4.4 Experimental Studies 95

Vine, Ecoli also behave in this way. On the other hand, for datasets Heart-Statlog,
Pima, Liver, Heart-C and Heptitis, the accuracy curve of T = 0.9 is better than all
other T values at certain depths. In addition, datasets Glass and Ecoli have accuracy
curves which are very close to each other and are even identical on some trials. For
the Breast-Cancer dataset the accuracy actually decreases with increasing T . All of
the datasets we tested have almost the same trends.

Balance with uniform-discretization Breast-Cancer with uniform-discretization


0.85 T=0.6 0.74
T=0.7 0.73
0.8 T=0.8
T=0.9 0.72
T=1.0

Accuracy
Accuracy

0.71
0.75 0.70
0.7 0.69
0.68 T=0.6
T=0.7
0.65 0.67 T=0.8
0.66 T=0.9
T=1.0
0.6 0.65
1 2 3 4 1 2 3 4 5
Depth Depth

Fig. 4.6 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set # 1 and # 2 in Table 4.3

4.4.2 Overlapping Between Fuzzy Labels


As we can see from Chapter 3, we usually use the trapezoidal fuzzy sets with
50% overlapping in linguistic translation. In this section, we will investigate the
influences of overlapping degrees on the accuracy by some empirical studies.
First, we introduce a new parameter PT by which to measure the degree of
overlapping of fuzzy labels. PT = 0.5 represents 50% of overlapping between
each two neighboring fuzzy labels (e.g., see Fig. 4.9 (a)). PT = 0 represents no
overlapping at all (see Fig. 4.9 (c)). The relation between different overlapping and
PT values is schematically shown in Fig. 4.9 . Given fuzzy labels F and G, m is the
distance between the center of a fuzzy label to the meeting point of these two fuzzy
labels. a is actually the length of the overlapping area. PT is then defined as follows:

PT = a/2m (4.21)

Suppose there is a 50% overlapping between two neighboring fuzzy labels, a equals
m so that PT = 0.5. If there is no overlapping at all, a = 0 so that PT = 0. Fig. 4.10
shows an example of fuzzy discretiztion with different PT values on a continuous
universe.
We tested 10 datasets taken from UCI [9] repository and the average results with
standard deviation based on 10 runs of 50-50 split experiments are shown in Fig.
4.11 and 4.12 . As we can see from these figures, accuracy generally increase with
96 4 Linguistic Decision Trees for Classification
Breast-w with uniform-discretization Ecoli with uniform-discretization
0.97 0.9
T=0.6 T=0.6
0.96 T=0.7 0.85 T=0.7
T=0.8 T=0.8
T=0.9 0.8 T=0.9
0.95 T=1.0 T=1.0
0.75

Accuracy
Accuracy

0.94 0.7
0.93 0.65
0.92 0.6
0.55
0.91
0.5
0.9 0.45
0.89 0.4
1 2 3 4 5 1 2 3 4 5 6 7
Depth Depth

Glass with uniform-discretization Heart-C with uniform-discretization


0.75 0.775
T=0.6 T=0.6
T=0.7 0.77 T=0.7
0.7 T=0.8
T=0.9 T=0.8
0.765 T=0.9
Accuracy

0.65 T=1.0
Accuracy

T=1.0
0.76
0.6 0.755
0.55 0.75
0.5 0.745
0.74
0.45
0.735
0.4 0.73
1 2 3 4 5 6 7 8 9 1 2 3 4 5
Depth Depth

Heart-Starlog with uniform-discretization Heptitis with uniform-discretization


0.78 0.835
T=0.6 T=0.6
0.77 T=0.7 0.83 T=0.7
T=0.8 T=0.8
T=0.9 0.825 T=0.9
0.76
Accuracy

T=1.0 T=1.0
0.82
Accuracy

0.75 0.815
0.74 0.81
0.805
0.73
0.8
0.72 0.795
0.71 0.79
1 2 3 4 1 2 3 4
Depth Depth

Fig. 4.7 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set #3 to #8 in Table 4.3
4.4 Experimental Studies 97

Ionosphere with uniform-discretization Iris with uniform-discretization


0.89 0.965
T=0.6 T=0.6
0.88 T=0.7 T=0.7
T=0.8 0.96 T=0.8
0.87 T=0.9 T=0.9
T=1.0 T=1.0
Accuracy

0.86 0.955

Accuracy
0.85
0.95
0.84
0.83 0.945
0.82
0.94
0.81
0.8 1 0.935
2 3 4 5 1 2 3 4
Depth Depth

Liver with uniform-discretization Pima with uniform-discretization


0.59 0.766
T=0.6 T=0.6
0.588 T=0.7 0.764 T=0.7
T=0.8 T=0.8
0.586 T=0.9 0.762 T=0.9
T=1.0 T=1.0
0.584 0.76
Accuracy

Accuracy

0.582 0.758
0.58 0.756
0.578 0.754
0.576 0.752
0.574 0.75
0.572 0.748
1 2 3 4 5 6 1 2 3 4 5
Depth Depth

Sonar with uniform-discretization Wine with uniform-discretization


0.95 1
T=0.6
0.9 T=0.7 0.95
T=0.8
T=0.9
0.85 T=1.0 0.9
Accuracy

Accuracy

0.8 0.85
0.75 0.8
0.7 0.75 T=0.6
T=0.7
T=0.8
0.65 0.7 T=0.9
T=1.0
0.6 0.65
1 2 3 4 5 6 7 8 1 2 3 4 5 6
Depth Depth

Fig. 4.8 Comparisons of accuracy at different depths with threshold probability ranges from
0.6 to 1.0 on data set from #9 to #14 in Table 4.3
98 4 Linguistic Decision Trees for Classification
m m
a a=0
F G F'' G''

(a) when a = m, PT =0.5 (c) when a = 0, PT = 0


m PT
a
F' G'
0.5

0 m a
(b) PT = a/2m (d) Relation between a and PT

Fig. 4.9 A schematic illustration of calculating the overlap parameter PT given different
degrees of overlaps

the increases in the PT values. Only the data Ionosphere is an exception. In order
to simplifying the label semantics, it is not necessary to use fuzzy labels with more
than 50% overlapping. Hence, we will use the fuzzy labels with 50% overlapping
for discretization in the following experiments.

4.5 Comparison Studies


From the Table 4.4 , we also see that the optimal values of T and depth are
relatively invariant across the discretization techniques. Overall the entropy-based
and percentile-based discretization methods performed better than the uniform
discretization although no statistically significant difference was found between
the three methods. In order to conduct the comparison studies, we first start by
introducing the neural networks.
We now compare LID3 with different discretization with C4.5, Naive Bayes
Learning and Neural Networks  using 10 50-50 splits on each dataset and the
average accuracy and standard deviation for each test are shown in Table 4.5 .
The reason for choosing these three particular learning algorithms is as follows:
C4.5 is the most well-known tree induction algorithm, Naive Bayes is a simple
 WEKA is used to generate the results of J48 (C4.5 in WEKA) unpruned tree, Naive Bayes
and Neural Networks with default parameter settings. [13]
4.5 Comparison Studies 99
1

0.8

0.6

0.4

0.2

0
(a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.8

0.6

0.4

0.2

0
(b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.8

0.6

0.4

0.2

0
(c) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.8

0.6

0.4

0.2

0
(d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1

0.8

0.6

0.4

0.2

0
(e) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 4.10 Fuzzy discretization with focal elements with different overlapping degrees. The
degrees of overlapping are PT = 0.0, 0.3, 0.5, 0.8, 1.0 from subfigure (a) to (e)
100 4 Linguistic Decision Trees for Classification

Balance Ecoli
0.92 0.9
0.9
0.88
0.88
0.86
0.86
Accuracy

Accuracy
0.84
0.84
0.82
0.82
0.8 0.8

0.78 0.78
0.76 0.76
0.74 0.74
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Glass Heptitis
0.68 0.9
0.66 0.88
0.64 0.86
Accuracy

0.62 0.84
Accuracy

0.6 0.82
0.58 0.8
0.56 0.78
0.54 0.76
0.52 0.74
0.5 0.72
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Iris Liver
0.99 0.74
0.98 0.72

0.97 0.7
Accuracy

0.68
Accuracy

0.96
0.66
0.95
0.64
0.94
0.62
0.93 0.6
0.92 0.58
0.91 0.56
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap

Fig. 4.11 Average accuracy with standard deviation with different PT values
4.5 Comparison Studies 101

Pima Ionosphere
0.9
0.77
0.76 0.89
0.75 0.88

Accuracy
0.74
Accuracy

0.87
0.73
0.72 0.86
0.71 0.85
0.7 0.84
0.69
0.83
0.68
0.67 0.82
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Heart-C Breast-Cancer
0.76
0.82
0.74
0.8 0.72
Accuracy
Accuracy

0.78 0.7
0.68
0.76
0.66
0.74
0.64
0.72 0.62

0.7 0.6
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap
Breast-w Wine
0.975 1

0.97 0.98

0.965 0.96
Accuracy

0.94
Accuracy

0.95
0.955 0.92
0.95 0.9
0.945 0.88
0.94 0.86
0.935 0.84
0.93 0.82
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Overlap Overlap

Fig. 4.12 Average accuracy with standard deviation with different PT values
102 4 Linguistic Decision Trees for Classification

but effective probability estimation method and neural networks are a black-box
model well known for its high predictive accuracy. We then carried out paired t-
tests with confidence level of 90% to compare LID3-Uniform, LID3-Entropy and
LID3-Percentile with each of the three algorithms [3] . A summary of the results is
shown in Table 4.6 .

Table 4.6 Summary of comparisons of LID3 based on different discretization methods with
three other well-known machine learning algorithms

LID3-Uniform LID3-Entropy LID3-Percentile


Decision Tree (C4.5) vs. 9 wins-4 ties-1 losses 9 wins-5 ties-0 losses 10 wins-4 ties-0 losses
Naive Bayes vs. 3 wins-8 ties-3 losses 7 wins-4 ties-3 losses 7 wins-4 ties-3 losses
Neural Network vs. 5 wins-6 ties-3 losses 5 wins-8 ties-1 losses 5 wins-8 ties-1 losses

Across the data sets, all LID3 algorithms (with different discretization
techniques) outperform C4.5, with LID3-Percentile achieving the best results with
10 wins, 4 ties and no losses. The performance of the Naive Bayes algorithm and
LID3-Uniform is roughly equivalent although LID3-Entropy and LID3-Percentile
outperform Naive Bayes. From Table 4.5 , we can see that the datasets on which
Naive Bayes outperforms LID3 are those with a mixture of continuous and discrete
attributes, namely Heart-C, Heart-Statlog and Heptitis. Most of the comparisons
with the Neural Network result in ties rather than wins or losses, especially
for LID3-Entropy and LID3-Percentile. Due to the limited number and type of
datasets we used for evaluation, we may not draw the strong conclusion that LID3
outperforms all the other 3 algorithms. However we can at least conclude that based
on our experiments, the LID3 outperforms C4.5 and has equivalent performance to
Naive Bayes and the Neural Networks. For the datasets with only numerical values,
LID3 outperforms both C4.5 and Naive Bayes. Between different discretization
methods, percentile-based and entropy-based approaches achieve better results than
uniform discretization.

4.6 Merging of Branches

In the previous section, we showed that LID3 performs at least as well as and often
better than, three well-known classification algorithms across a range of datasets.
However, even with only 2 fuzzy sets for discretization, the number of branches
increases exponentially with the depth of the tree. Unfortunately, the transparency
of the LDT decreases with the increasing number of branches. To help to maintain
transparency by generating more compact trees, a forward merging algorithm based
on the LDT model is introduced in this section and experimental results are given to
support the validity of our approach.
4.6 Merging of Branches 103

4.6.1 Forward Merging Algorithm


See Fig. 4.13 for instance, with the first round of merging, the adjacent branches L2
and L3 , L5 and L6 are merged into two new branches. If any of two adjacent branches
have sufficiently similar class probabilities according to some criteria, these two
branches give similar classification results and therefore can then be merged into
one branch in order to obtain a more compact tree. We employ a merging threshold
to determine whether or not two adjacent branches can be merged.
Definition 4.6 (Merging threshold) In a linguistic decision tree, if the maximum
difference between class probabilities of two adjacent branches B1 and B2 is less
than or equal to a given merging threshold Tm , then the two branches can be merged
into one branch. Formally, if

Tm ≥ max |P(c|B1 ) − P(c|B2 )| (4.22)


c∈C

where C = {C1 , . . . ,C|C| } is the set of classes, then B1 and B2 can be merged into
one branch MB. A merged linguistic decision tree MT can be represented by a set
of merged branches MB, or formally

MT = {MB1 , . . . , MB|MT | }

where each merged MB j has a class distribution

P(C1 |MB j ), . . . , P(C|C| |MB j )

Definition 4.7 (Merged branch) A merged branch MB nodes is defined as

MB = M j1 , . . . , M j|MB| 

|MB| are the number of nodes for the branch MB where the node is defined as:
|M j |
M j = {Fj1 , . . . , Fj }

Each node is a set of focal elements such that Fji is adjacent to Fji+1 for i =
1, . . . , |M j | − 1. If |M| > 1, it is called compound node, which means it is a
compound of more than one focal elements because of merging (e.g., see Fig. 4.13
). The associate mass for M j is given by
|M j |
mx (M j ) = ∑ mx (Fji ) (4.23)
i=1

where w is the number of merged adjacent focal elements for attribute j.


Based on Eq. (4.8), we can obtain
|MB|
P(Ct |x) = ∏ mxr (Mr ) (4.24)
r=1
104 4 Linguistic Decision Trees for Classification
Internal Node
A1 Single Leaf Node
{small1} {big1} Compound Leaf Node
{small1, big1}
A2 L4 A2
{big2 } {small2} {big2}
{small2} (0.1, 0.9)
{small2 ,big2} {small2 ,big2}
L1 L2 L3 L5 L6 L7

(0.3, 0.7) (0.5, 0.5) (0.6, 0.4) (0.6, 0.4) (0.7, 0.3) (0.55, 0.45)

A1
{small1} {big1}
{small1, big1}
A2 L3 A2

{small2} (0.1, 0.9) {big2 }


{small2 ,big2 }v{big2} {small2}v {small2 ,big2}
L1 L2 L4 L5

(0.3, 0.7) (0.58,0.42) (0.65,0.35) ... (0.55, 0.45)

A1
{small1} {big1}
{small1,big1}
A2 L3 A2

{small2} (0.1, 0.9)


{small2 ,big2}v{big2}
{small2}v {small2 , big2 }v {big2 }
L1 L2 L4

(0.3, 0.7) (0.58, 0.42) (0.62, 0.38)

Fig. 4.13 An illustration of forward branch merging process. Given the merging threshold
0.1, branches L2 and L3 are firstly merged into new merged branch L2 and branches L5 and
L6 are merged into a new branch L4 . There is one more step merging of the new L4 and L5
because these branches have close probabilities less than or equal to the threshold probability
1. Finally, the original LDT with 7 branches is merged into a new LDT with 4 branches only
4.6 Merging of Branches 105

Therefore, based on Eqs. (4.12), (4.13) and (4.23) we use the following formula to
calculate the class probabilities given a merged branch.

∑i∈DBt P(Ct |xi )


P(Ct |MB) = (4.25)
∑i∈DB P(Ct |xi )
When the merging algorithm is applied in learning a linguistic decision tree,
the adjacent branches meeting the merging criteria will be merged and re-evaluated
according to Eq. (4.25). Then the adjacent branches after the first round of merging
will be examined in a further round of merging, until all adjacent branches cannot
be merged further. We then proceed to the next depth. In Fig. 4.13 , leaves L2 and L3 ,
L5 and L6 are merged in the first round of merging, and the new leaves L4 and L5 are
then merged further because they still meet the merging criteria in the second round
of merging. The merged branches can be represented by compound expressions that
will be described in the subsequent sections. The merging is applied as the tree
develops from the root to the maximum depth and hence is referred to as forward
merging.

4.6.2 Dual-Branch LDTs


Dual-Branch LDTs are a special LDT where each branch grain is two neighboring
focal elements. It is a special case of the merged tree introduced in the previous
section. Give a focal set with size l, from each node there are l − 1 branches.
For example, the focal set for a particular attribute x is F = {F1 , . . . , F7 } {{tiny},
{tiny, small}, {small}, {small, normal}, {normal}, {normal, large}, {large}}.
Then we have the branches from the node x: Bx = {B1 , . . . , B6 } = {{F1 , F2 },
{F2 , F3 }, {F3 , F4 }, {F4 , F5 }, {F5 , F6 }, {F6 , F7 }}. The LDT with such nodes are
referred to as a dual-branch LDT. The revised condition of probability of a focal
element Fy ∈ Fy that is appropriate to describe a goal given the branch B, can be
evaluated from a training set DB according to

∑i∈DBy ∏Nr=1 (mxir (Fj ) + mxir (Fj+1 ))


P(Fy |B) = (4.26)
∑i∈DB ∏Nr=1 (mxir (Fj ) + mxir (Fj+1 ))

Different from the normal LDTs, each branch of the dual-branch LDT has the
summed masses of the neighboring focal elements.

4.6.3 Experimental Studies for Forward Merging


We tested the forward merging algorithm on the UCI datasets listed in Table 4.3
with 10 50-50 split experiments and the results are shown in Table 4.7 . Obviously,
there is a tradeoff between the algorithm accuracy and the algorithm transparency in
terms of the number of leaves. The merging threshold Tm plays an important role in
the accuracy-transparency tradeoff problem. Algorithm accuracy tends to increase
while algorithm transparency decreases with decreasing Tm and vice versa.
106 4 Linguistic Decision Trees for Classification

Table 4.7 Comparisons of accuracy and the number of leaves (rules) |T | with different
merging thresholds Tm across a set of UCI datasets [9] . The results for Tm = 0 are obtained
with NF = 2 and results for other Tm values are obtained with NF values listed in the second
column of the table

Tm = 0 Tm = 0.1 Tm = 0.2 Tm = 0.3 Tm = 0.4


# NF
Acc. |T | Acc. |T | Acc. |T | Acc. |T | Acc. |T |
1 2 83.80 77 84.19 51 81.09 25 75.08 10 47.03 1
2 2 73.06 17 71.67 12 71.11 9 59.65 4 61.25 2
3 2 96.43 57 95.80 29 95.74 16 95.63 9 95.49 4
4 3 85.41 345 85.29 445 84.24 203 83.88 104 82.65 57
5 3 65.69 329 62.84 322 64.04 190 44.31 86 35.41 49
6 2 76.71 37 78.68 31 78.55 22 78.42 18 68.49 11
7 3 76.52 31 78.37 35 78.44 23 77.85 12 72.22 7
8 3 82.95 11 81.28 24 80.77 18 80.64 15 80.77 13
9 3 88.98 45 87.90 78 88.47 41 89.20 30 89.20 26
10 3 96.00 21 95.47 23 95.20 18 95.20 14 94.27 10
11 2 58.73 83 56.30 43 55.90 11 57.34 4 57.92 3
12 2 76.12 27 75.31 20 74.45 5 73.85 3 65.10 1
13 2 86.54 615 88.46 516 85.58 337 81.73 93 49.04 6
14 3 95.33 67 93.78 80 94.11 50 93.56 36 89.67 24

The number of fuzzy sets NF in the merging algorithm is also a key parameter.
Compared to NF = 3, setting NF = 2 can achieve better transparency, but for some
datasets, with NF = 2, the accuracy is greatly reduced although the resulting trees
have significantly fewer branches. For example, Figs. 4.14 and 4.15 show the
change in test accuracy and the number of leaves (or the number of rules interpreted
from a LDT) for different Tm on the Breast-w dataset. Fig. 4.14 is with NF = 2 and
Fig. 4.15 with NF = 3. Fig. 4.14 shows that the accuracy is not greatly influenced
by merging, but the number of branches is greatly reduced. This is especially true
for the curve marked by ‘+’ corresponding to Tm = 0.3 where applying forward
merging, the best accuracy (at the depth 4) is only reduced by approximately 1%,
whereas, the number of branches is reduced by roughly 84%. However, in Fig. 4.15
, at the depth 4 with Tm = 0.3, the accuracy also reduces about 1% but the number of
branches only reduces by 55%. So, for this dataset, we should choose NF = 2 rather
than NF = 3.
However, this is not always the case. For the dataset Iris, the change in accuracy
and the number of branches against depth with NF = 2 and NF = 3 is shown in
Figs. 4.16 and 4.17 , respectively. As we can see from Fig. 4.16 , by applying the
forward merging algorithm, the accuracy is greatly changed. The best accuracy with
4.6 Merging of Branches 107
Breast-w with merging
0.97
Tm=0
Tm=0.1
0.96 Tm=0.2
Tm=0.3
Tm=0.4
0.95

0.94
Accuracy

140
0.93 120

Number of leaves
100
0.92 80
60
0.91 40
20
0.9 0
1 2 3 4 5
Depth
0.89
1 2 3 4 5
Depth

Fig. 4.14 The change in accuracy and number of leaves as Tm varies in the Breast-w dataset
with NF = 2

Breast-w with merging


0.97
Tm=0
0.96 Tm=0.1
Tm=0.2
Tm=0.3
0.95 Tm=0.4

0.94
Accuracy

0.93 80
Number of leaves

60
0.92
40
0.91
20
0.9
0
1 2 Depth 3 4
0.89
1 2 3 4
Depth

Fig. 4.15 The change in accuracy and number of leaves as Tm varies in the Breast-w dataset
with NF = 3. While the dot trial Tm = 0 is with NF = 2
108 4 Linguistic Decision Trees for Classification
Iris with merging

0.95
0.9

0.85

0.8
Accuracy

0.75 50

Number of leaves
0.7 40
30
0.65
20
0.6 Tm=0
Tm=0.1 10
Tm=0.2
0.55 Tm=0.3 01 2 3 4
Tm=0.4 Depth
0.5 4
1 2 3
Depth

Fig. 4.16 The change in accuracy and number of leaves as Tm varies in the Iris dataset with
NF = 2

Iiris with merging


1

0.98

0.96

0.94
Accuracy

0.92
50
0.9
40
Number of leaves

0.88
30
0.86
20
0.84 Tm=0
Tm=0.1 10
0.82 Tm=0.2
Tm=0.3 01 2 3 4
Tm=0.4 Depth
0.8
1 2 3 4
Depth

Fig. 4.17 The change in accuracy and number of leaves as Tm varies in the Iris dataset with
NF = 3. While the dot trial Tm = 0 is with NF = 2
4.6 Merging of Branches 109

merging is roughly 10% worse than with the non-merging algorithm. But for NF = 3,
as we can see from Fig. 4.17 , the accuracy is not that greatly reduced compared to
NF = 2, and we still obtain a reduced number of branches, compared to the accuracy
for Tm = 0 obtained from NF = 2. In this case we should prefer NF = 3.
Table 4.7 shows the results with optimal NF and different Tm ranging from 0
to 0.4, where Tm = 0 represents no merging. Acc represents the average accuracy
from 10 runs of experiments and |T | is the average number of rules (leaves).
Unless otherwise stated, the results obtained in this section are with the threshold
probability set to T = 1. The results for Tm from 0.1 to 0.4 are obtained at
the depth where the optimal accuracy is found when Tm = 0. As we can see
from the table, for most cases, the accuracy before and after merging is not
significantly different but the number of branches is dramatically reduced. In some
cases, the merging algorithm even outperforms the LID3 without merging. The
possible reason for this is because the merging algorithm generates self-adapting
granularities based on class probabilities. Compared to other methods that discretize
attributes independently, merging may generate a more reasonable tree with more
appropriate information granules.

4.6.4 ROC Analysis for Forward Merging


In this section, we will use ROC analysis to study the forward merging algorithm for
LDTs. Six binary datasets from the UCI machine learning repository are tested here:
Breast-Cancer, Wisconsin-Cancer, Heart-C, Heart, Heptitis and Pima Indian [9] .
Descriptions of these datasets including the number of examples, the number of
attributes (features) and whether the attributes are a mixture of Num. and Nom., is
shown in Table 4.3 . The LDT parameters for each data set are set individually: the
number of fuzzy sets used for discretization (NF ) is also shown in Table 4.8 . The
maximum depth for the Breast-Cancer dataset is 2 and for the other five data sets
is 3. These parameter settings are based on a few test-and-trail experiments [10] . For
each data set, the examples were equally divided into two subsets, one for training
and the other one for the test. This is referred to as a 50-50 split experiment.
Table 4.8 shows the average AUC and standard deviation from 10 runs of 50-
50 split experiments by applying merging with the merging threshold Tm ranging
from 0 (no merging) to 0.3. The average size of the trees |T | from 10 runs of
experiments is also shown in the table, where the size of the tree is in terms of
the number of branches (this also corresponds to the the number of rules that can
be extracted from an LDT). According to the t-test with confidence level 0.9, the
AUC values for the merged LDTs are not reduced significantly compared to the non-
merging case. Indeed for some data sets, (e.g., Breast-Cancer) the merged trees even
performs a little better than non-merged trees, although no statistically significant
differences are found. On the other hand, the tree sizes are reduced significantly.
These facts can be seen from FigS. 4.18 and 4.19 . Fig. 4.18 shows the accuracy
comparison and Fig. 4.19 shows the comparison of the number of branches,
respectively. The possible reason for this is that the merging algorithm generates
110 4 Linguistic Decision Trees for Classification

self-adapting granularities based on class probabilities. Compared to other methods


that discretize attributes independently, merging may generate more reasonable trees
with more appropriate information granules. However, this claim still needs more
investigation.

Table 4.8 Mean AUC values with standard deviation of six data sets with different merging
thresholds

Tm = 0 Tm = 0.1 Tm = 0.2 Tm = 0.3


Data NF
AUC |T | AUC |T | AUC |T | AUC |T |
Breast-Cancer 3 73.69±7.73 13 71.45±7.12 11 74.29±8.44 9 81.11±3.25 4
Wisconsin-Cancer 3 98.76±0.72 44 99.02±0.54 12 98.69±0.59 9 98.99±0.60 7
Heart-C 2 85.36±2.58 35 84.02±3.46 32 84.79±3.39 25 84.32±4.38 17
Heart-S 2 84.41±3.64 29 85.16±2.90 25
81.12±15.05 19 82.36±5.71 11
Heptitis 2 73.26±6.89 19 73.99±5.36 11 74.80±4.83 9 74.25±5.99 7
Pima Indian 2 81.08±0.97 27 81.92±1.79 14 74.74±15.13 5 81.90±4.84 2

100
Tm=0
Tm=0.1
95 Tm=0.2
Tm=0.3
Average AUC Values

90

85

80

75

70
1 2 3 4 5 6
Datasets

Fig. 4.18 Comparison between non-merged trees and merged trees with Tm ranging from
0.1 to 0.3 on the given test data in terms of accuracy
4.7 Linguistic Reasoning 111
45
Tm=0
40 Tm=0.1
Tm=0.2
Tm=0.3
35
Average number of branches

30

25

20

15

10

0
1 2 3 4 5 6
Datasets

Fig. 4.19 Comparison between non-merged trees and merged trees with Tm ranging from
0.1 to 0.3 on the given test data in terms of the number of branches

4.7 Linguistic Reasoning


As we have known from Chapter 3, the main logical connectives for fuzzy labels are
interpreted as follows: ¬L means that L is not an appropriate label, L1 ∧ L2 means
that both L1 and L2 are appropriate labels, L1 ∨ L2 means that either L1 or L2 are
appropriate labels, and L1 → L2 means that L2 is an appropriate label whenever L1
is. In this section we use label semantics to provide a linguistic interpretation for
LDTs and merged LDTs. We also use this framework to show how LDTs can be
used to classify data with linguistic constraints on attributes. In addition, a method
for classification of fuzzy data is proposed and supported with empirical studies of
a toy problem.

4.7.1 Linguistic Interpretation of an LDT


Based on the α -function (Definition 3.12), a branch of a linguistic decision tree in
random set forms (i.e. {small}, {small, medium}, {medium}) can be represented by
a linguistic rule that joined by logical connectives (i.e., ¬large) . The motivation of
this mapping is shown in Fig. 4.20. Given a focal set {s, m} this states that the labels
appropriate to describe the attribute are exactly small and medium. Hence, they
 By applying the α -function, the logical expression for {small}, {small, medium},
{medium} is ¬large, see Section 3.4.3.
112 4 Linguistic Decision Trees for Classification

include s and m and exclude all other labels that occur in focal sets that are supersets
of {s, m}. Given a set of focal sets {{s, m}, {m}} this provides the information that
the set of labels is either {s, m} or {m} and hence the sentence providing the same
information should be the disjunction of the α sentences for both focal sets (see
Section 3.4.3).

Dx3
{{m3 , l3}, {l3}}
{s3 }
{{s3 , m3 }{m 3 }}

Dx4 Dx4

{l4 } {s4} {{m4 , l4}, {l4}}


{s4}
{{s3 , m4}{m4 }} {m4 , l4} {{s3 , m4}{m4 }}

Fig. 4.20 A merged linguistic decision tree for the Iris problem

As discussed in the last section, a merged LDT was obtained from a real-world
dataset Iris at the depth 2 when Tm = 0.3 and where L j ={small j (s j ), medium j (m j ),
large j (l j )| j = 1, . . . , 4} (see Fig 4.21).

MTiris = {MB1 , MB2 , MB3 , MB4 , MB5 , MB6 , MB7 , MB8 } =


{{s3 }, 1.0000, 0.0000, 0.0000
{{s3 , m3 }, {m3 }}, {s4 }, 1.0000, 0.0000, 0.0000
{{s3 , m3 }, {m3 }}, {{s4 , m4 }, {m4 }}, 0.0008, 0.9992, 0.0000
{{s3 , m3 }, {m3 }}, {m4 , l4 }, 0.0000, 0.5106, 0.494
{{s3 , m3 }, {m3 }}, {l4 }, 0.0000, 0.0556, 0.9444
{{m3 , l3 }, {l3 }}, {s4 }, 0.3333, 0.3333, 0.3333
{{m3 , l3 }, {l3 }}, {{s4 , m4 }, {m4 }}, 0.000, 0.8423, 0.1577
{{m3 , l3 }, {l3 }}, {{m4 , l4 }, {l4 }}, 0.000, 0.0913, 0.9087}

We can then translate this tree into a set of linguistic expressions as follows:

MTiris = {
s3 ∧ ¬(m3 ∨ l3 ), 1.0000, 0.0000, 0.0000
m3 ∧ ¬l3 , s4 ∧ ¬(m4 ∨ l4 ), 1.0000, 0.0000, 0.0000
m3 ∧ ¬l3 , m4 ∧ ¬l4 , 0.0008, 0.9992, 0.0000
m3 ∧ ¬l3 , ¬s4 ∧ m4 ∧ l4 , 0.0000, 0.5106, 0.4894
m3 ∧ ¬l3 , ¬(s4 ∨ m4 ) ∧ l4 , 0.0000, 0.0556, 0.9444
4.7 Linguistic Reasoning 113

l3 , s4 ∧ ¬(m4 ∨ l4 ), 0.3333, 0.3333, 0.3333


l3 , m4 ∧ ¬l4 , 0.000, 0.8423, 0.1577
l3 , l4 , 0.000, 0.0913, 0.9087}

Furthermore, the tree itself can be rewritten as a set of fuzzy rules. For example
branch 2 corresponds to the rule:

IF Attribute 3 is medium but not large and Attribute 4 is only small, THEN the
class probabilities given these branches are (1.0000, 0.0000, 0.0000).

Dx3

s3 (m3 l3) l3
m3 l3

Dx4 Dx4

^
(s4 m4) ^ l4
s4 (m4 l4) l4
m4 l4 s4 ^ m4 ^ l4 s4 (m4 l4) m4 l4

Fig. 4.21 A merged linguistic decision tree in logical expressions for the LDT shown in Fig.
4.20

4.7.2 Linguistic Constraints


Here we consider that the linguistic constraints take the form of θ = x1 is θ1 , . . . , xn
is θn , where θ j represents a label expression based on L j : j = 1, . . . , n. Consider the
vector of linguistic constraint θ = θ1 , . . . , θn , where θ j is the linguistic constraint
on attribute j. We can evaluate a probability value for class Ct conditional on this
information using a given linguistic decision tree as follows. The mass assignment
given a linguistic constraint θ is evaluated by
 pm(Fj )
∑ pm(Fj ) if : Fj ∈ λ (θ j )
∀Fj ∈ F j mθ j (Fj ) = F j (θ j )
∈λ (4.27)
0 otherwise
where pm(Fj ) is the prior mass for focal elements Fj ∈ F j derived from the prior
distribution p(x j ) on Ω j as follows:

pm(Fj ) = mx (Fj )p(x j )dx j (4.28)
Ωj
114 4 Linguistic Decision Trees for Classification

Usually, we assume that p(x j ) is the uniform distribution over Ω j so that



pm(Fj ) ∝ mx (Fj )dx j (4.29)
Ωj

More details of the calculation of mass assignment given a linguistic constraint are
given in Example 4.3. For branch B the probability of B given θ is evaluated by
|B|
P(B|θ ) = ∏ mθ jr (Fjr ) (4.30)
r=1

where |B| is the number of nodes in branch B. By Jeffrey’s rule [11] , we can obtain
|T |
P(Ct |θ ) = ∑ P(Ct |Bv )P(Bv |θ ) (4.31)
v=1

Example 4.3 Given the LDT in Example 4.2, suppose we know that for a particular
data element “x1 is not large and x2 is small”. We then can translate this knowledge
into the following linguistic constraint vector:

θ = θ1 , θ2  = ¬large1 , small2 

By applying the λ -function (Definition 3.10), we can generate the associated label
sets, so that:

λ (¬large1 ) = {{small1 }}
λ (small2 ) = {{small2 }, {small2 , large2 }}

Suppose the prior mass assignments are

pm1 = {small1 } : 0.4, {small1 , large1 } : 0.3, {large1 } : 0.3


pm2 = {small2 } : 0.3, {small2 , large2 } : 0.2, {large2 } : 0.5

From this, according to Eq. (4.27) we obtain that,

mθ1 = {small1 } : 0.4/0.4 = {small1 } : 1


mθ2 = {small2 } : 0.3/(0.3 + 0.2), {small2 , large2 } : 0.2/(0.2 + 0.3)
= {small2 } : 0.6, {small2 , large2 } : 0.4

This gives

P(B1 |θ ) = mθ1 ({small1 }) × mθ2 ({small2 }) = 1 × 0.6 = 0.6


P(B2 |θ ) = mθ1 ({small1 }) × mθ2 ({small2 , large2 }) = 1 × 0.4 = 0.4
P(B3 |θ ) = P(B4 |θ ) = P(B5 |θ ) = P(B6 |θ ) = P(B7 |θ ) = 0

Hence, according to Jeffrey’s rule


4.7 Linguistic Reasoning 115
7
P(C1 |θ ) = ∑ P(Bv |θ )P(C1 |Bv ) = ∑ P(Bv |θ )P(C1 |Bv )
v=1 v=1,2
= 0.6 × 0.3 + 0.4 × 0.5 = 0.38
7
P(C2 |θ ) = ∑ P(Bv |θ )P(C2 |Bv ) = ∑ P(Bv |θ )P(C2 |Bv )
v=1 v=1,2
= 0.6 × 0.7 + 0.4 × 0.5 = 0.62

The methodology for classification under linguistic constraints allows us to fuse


the background knowledge in linguistic form into classification. This is one of the
advantages of using high-level knowledge representation language models such as
label semantics.

4.7.3 Classification of Fuzzy Data


In previous discussions LDTs have only been used to classify crisp data where
objects are described in terms of precise attribute values. However, in many real-
world applications limitations of measurement accuracy mean that only imprecise
values can be realistically obtained. In this section we introduce the idea of fuzzy
data and show how LDTs can be used for classification in this context.
Formally, a fuzzy database is defined to be a set of elements or objects each
described by linguistic expressions rather than crisp values. In other words

FD = {θ1 (i), . . . , θn (i) : i = 1, . . . , N}

Currently, there are very few benchmark problems of this kind with fuzzy attribute
values. This is because, traditionally, only crisp data values are recorded even in
cases where this is inappropriate. Hence, we have generated a fuzzy database from
a toy problem where the aim is to identify the interior of a figure of eight shape.
Specifically, a figure of eight shape was generated according to the equation

x = 2(−0.5) (sin(2t) − sin(t)) (4.32)


(−0.5)
y=2 (sin(2t) + sin(t)) (4.33)

where t ∈ [0, 2π] (See Fig. 4.23 ). Points in [−1.6, 1.6]2 are classified as legal if they
lie within the ‘eight’ shape (marked with ×) and illegal if they lie outside (marked
with points).
To form the fuzzy database we first generated a crisp database by uniformly
sampling 961 points across [−1.6, 1.6]2 . Then each data vector x1 , x2  was
converted to a vector of linguistic expressions θ1 , θ2  as follows: θ j = θR j where

R j = {F ∈ F j : mx j (F) > 0}

An LDT was then learnt by applying the LID3 algorithm to the crisp database. This
tree was then used to classify both the crisp and fuzzy data. The results are shown
in Table 4.9 and the results with NF = 7 are shown in Fig. 4.22 .
116 4 Linguistic Decision Trees for Classification

2 2
1.5 1.5
1 1
0.5 0.5
0 0
0.5 0.5
1 1
1.5 1.5
2 2
2 1.5 1 0.5 0 0.5 1 1.5 2 2 1.5 1 0.5 0 0.5 1 1.5 2
(a) (b)

Fig. 4.22 Classification of crisp dataset (a) and fuzzy data without masses (b), where each
attribute is discretized uniformly by 7 fuzzy sets

Table 4.9 Classification accuracy based on crisp data and fuzzy data of the “eight” problem

NF = 3 NF = 4 NF = 5 NF = 6 NF = 7
Crisp Data 87.72% 94.17% 95.94% 97.29% 98.54%
Fuzzy Data 79.29% 85.85% 89.39% 94.17% 95.01%

As we can see from Table 4.9 , our model gives a reasonable approximation of
the legal data area, though it is not as accurate as testing on crisp data. The accuracy
increases with NF the number of fuzzy sets used for discretization. These results
show that the LDT model can perform well in dealing with fuzzy and ambiguous
data. Here the “eight” problem is also used for testing classification with linguistic
constraints in the following example.
Example 4.4 Suppose an LDT is trained on the “eight” database where each
attribute is discretized by five fuzzy sets uniformly: verysmall (vs), small (s),medium
(m), large (l) and verylarge (vl). Further, suppose we are given the following
description of data points:

θ 1 = xisvs ∨ s ∧ ¬m, yisvs ∨ s ∧ ¬m


θ 2 = xism ∧ l, yiss ∧ m
θ 3 = xiss ∧ m, yisl ∨ vl

Experimental results obtained based on the approach introduced in Section 4.7 are
as follows:
4.8 Summary 117

Pr(C1 |θ 1 ) = 1.000 Pr(C2 |θ 1 ) = 0.000


Pr(C1 |θ 2 ) = 0.000 Pr(C2 |θ 2 ) = 1.000
Pr(C1 |θ 3 ) = 0.428 Pr(C2 |θ 3 ) = 0.572

As we can see from Fig. 4.23 , the above 3 linguistic constraints roughly
correspond to the areas 1, 2 and 3, respectively. By considering the occurrence of
legal and illegal examples within these areas, we can verify the correctness of our
approach.
Very Large Large Medium

1.5
Area 3

Area 2
0.5

0.5
Small Very Small

1
Area 1
1.5 Very Small Small Medium Large Very Large

2.5

2.5 2 1.5 1 0.5 0 0.5 1 1.5

Fig. 4.23 Testing on the “eight” problem with linguistic constraints θ , where each attribute
is discretized by 5 trapezoidal fuzzy sets: very small, small, medium, large and very large

4.8 Summary
In this chapter, a decision tree learning algorithm is proposed based on label
semantics. Unlike classical decision trees, the new algorithm uses probability
estimation based on linguistic labels. The linguistic labels are based on fuzzy
discretization using a number of different methods including uniform partitioning,
a percentile-based partitioning and an entropy-based partitioning. We found that the
percentile-based discretization and entropy-based discretization outperform uniform
118 References

discretization, but no statistical significance was found. By testing our new model
on real-world datasets and compared with three well-known machine learning
algorithms, we found that LID3 outperformed C4.5 on all given datasets and
outperforms Naive Bayes on datasets with numerical attributes only. Also it has
equivalent classification accuracy and better transparency when compared to back
propagation Neural Networks.
In order to obtain compact trees, a forward merging algorithm was proposed
and the experimental results show that the number of branches can be greatly
reduced without a significant loss in accuracy. Finally, we introduce the method
of interpreting a linguistic decision tree into a set of linguistic rules joined by
logical connectives. The methods for classification with linguistic constraints and
fuzzy data classification are discussed, supported by a test on toy problems. In the
subsequent chapter, we will focus on extending the LDT model from classification
problems to prediction problems.

References
[1] Quinlan J. R.: Induction of decision trees, Machine Learning, 1: pp. 81-106.
(1986).
[2] Quinlan J. R.:C4.5: Programs for Machine Learning, San Mateo: Morgan
Kaufmann. (1993).
[3] Mitchell T.: Machine Learning, McGraw-Hill, New York. (1997).
[4] Berthold M., Hand D. J.: Ed. , Intelligent Data Analysis, Springer-Verlag,
Berlin Heidelberg. (1999).
[5] Peng Y., Flach P. A.: Soft discretization to enhance the continuous decision
trees, Integrating Aspects of Data Mining, Decision Support and Meta-
Learning, C. Giraud-Carrier, N. Lavrac and S. Moyle, editors, pp. 109-118,
ECML/PKDD’01 workshop. (2001).
[6] Baldwin J. F., Lawry J., Martin T. P.: Mass assignment fuzzy ID3 with
applications. Proceedings of the Unicom Workshop on Fuzzy Logic:
Applications and Future Directions, pp. 278-294, London. (1997).
[7] Janikow C. Z.: Fuzzy decision trees: issues and methods, IEEE Trans. on
Systems, Man, and Cybernetics-Part B: Cybernetics, 28/1: pp. 1-14. (1998).
[8] Olaru C., Wehenkel L., A complete fuzzy decision tree technique, Fuzzy Sets
and Systems, 138: pp.221-254. (2003).
[9] Blake C., Merz C. J.: UCI machine learning repository.
[10] Qin Z., Lawry J.: A tree-structured model classification model based on label
semantics, Proceedings of the 10th International Conference on Information
Processing and Management of Uncertainty in Knowledge-based Systems
(IPMU-04), pp. 261-268, Perugia, Italy. (2004).
[11] Jeffrey R. C.: The Logic of Decision, Gordon & Breach Inc., New York.
(1965).
[12] Provost F., Domingos P.: Tree induction for probability-based ranking,
Machine Learning, 52, pp. 199-215. (2003).
References 119

[13] Witten I. H., Frank E.: Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations, Morgan Kaufmann. (1999).

You might also like