Fuzzy Decision Tree
Fuzzy Decision Tree
143
Keywords: fuzzy rules, fuzzy decision trees, classification, fuzzy logic, decision
making, machine learning
1
INTRODUCTION
Now, data repositories are growing quickly and contain huge amount of data from
commercial, scientific, and other domain areas. According to some estimates, the
amount of data in the world is doubling every twenty months [7]. Because of it,
information and database systems are widely used. The current generation of database
systems is based mainly on a small number of primitives of Structured Query
Language (SQL). The database systems usually use relational databases which consist
of relations (tables) [11]. The tables have attributes in columns and instances in rows.
Because of increasing data, these tables contain more and more covert information.
The covert information can't be found out and transformed into knowledge by classical
SQL queries any more. One approach how to extract knowledge from the tables is to
search dependences from the data.
There are a lot of ways how to describe the dependences among data: statistical
methods Nave Bayes Classifier, k-Nearest Neighbor Classifier etc., neural networks,
decision tables, decision trees, classification rules.
144
145
146
algorithm uses minimal entropy as a criterion at each node. So-called Fuzzy ID3 is a
direct generalization of ID3 to fuzzy case. Such an algorithm uses minimal fuzzy
entropy as a criterion, e. g. [19, 17]. Notice that there are several definitions of fuzzy
entropy [1]. A variant of Fuzzy ID3 is [22] which uses minimal classification
ambiguity instead of minimal fuzzy entropy. An interesting investigation to ID3-based
algorithms, which uses cumulative information estimations, was proposed in [10].
These estimations allow us to build FDTs with different properties (unordered,
ordered, stable, etc.). There are also some algorithms, which are quite different from
ID3. For example, Wang et al. [18] present optimization principles of fuzzy decision
trees based on minimizing the total number and average depth of leaves. The algorithm
minimizes fuzzy entropy and classification ambiguity at node level, and uses fuzzy
clustering so as to merge branches. Another non-Fuzzy ID3 algorithm is discussed in
[4]. The algorithm uses look-ahead method. Its goal is to evaluate so-called
classifiability of instances that are split along branches of a give node.
One way how to make fuzzy rules is a transformation of fuzzy decision trees. In a
crisp case, each leaf of the tree corresponds to one rule. The condition of the rule is a
group of attribute is attributes value which are connected with operator AND. These
attributes in the condition are attributes associated with the nodes in the path from the
root to a considered leaf. The attributes values are the values associated with outgoing
branch of a particular attributes node in the path. The conclusion of the rule is class
attribute is class. The conclusion is the output for the condition. In other words, its
classification for a known instance whose values correspond to the condition. This
notion is based on rules made from crisp decision tree [16]. In this paper in section 3,
we describe a generalization for a fuzzy case. For a fuzzy case, the class values for a
new instance (or, more exactly, membership functions values for particular classes)
are described by several rules which belong to one or more leaves.
3
147
denoted by C = {c1, ..., ck, , cmc}. Let the FDT have R leaves L = {l1, ..., lr, ..., lR}. It
is also considered there is a value Fkr for each r-th leaf lr and each k-th class ck. This
value Fkr means the certainty degree of the class ck attached to the leaf node lr.
Now, the transformation of this FDT into fuzzy rules is described. In fuzzy cases,
a new instance e may be classified into different classes with different degrees. In other
words, the instance e is classified into each possible class ck with a degree. For that
reason, a separate group of classification rules can be considered for each class ck,
1kmc. Lets suppose the group of classification rules for ck. Then, each leaf lr L
corresponds to one r-th classification rule. The condition of the r-th classification rule
is a group of attribute is attributes value which are connected with operator AND.
These attributes in the condition of r-th rule are attributes associated with the nodes in
the path from the root to the r-th leaf. The attributes values are the values associated
with the respective outgoing branches of the nodes in the path. The conclusion of the rth rule is C is ck. Lets consider that in the path from the root to the r-th leaf there are
q nodes associated with attributes Ai1, Ai2, , Aiq and respectively their q outgoing
branches associated with the values ai1,j1, ai2,j2, ,aiq,jq. Then the r-th rule has the
following form:
IF Ai1 is ai1,j1 AND Ai2 is ai2,j2 AND AND Aiq is aiq,jq THEN C is ck (with
thruthfulness Fkr ).
In the two following paragraphaps, two ways how to use these fuzzy classification
rules for classification of a new instance e into a class ck, 1kmc, are described. The
first way uses only one classification rule whereas the second uses one or more rules
for the classification. The classification means that the degree the instance e is
classified into the class ck with is denoted by membership function ck(e). ck(e) has a
value in the continuous interval 0 to 1. It is also supposed that all attributes values
membership functions of the new instance e, ai,j(e), ai,j Ai, Ai A, are known.
The first way is based on [16] and was initially meant for classification in crisp
cases. In crisp case, an attributes possible value either is the attributes value or isnt.
In other words, either ai,j(e)=1 or ai,j(e)=0 for ai,j Ai, Ai A. Also, just one ai,j(e) is
equal 1 for ai,j Ai, all the others are equal 0. In fuzzy case, there can be more than one
ai,j(e)>0 for ai,j Ai. To be able to use this first way, ai,j(e) with the maximal value
among all ai,j Ai can be set equal 1 and all the others equal 0. The instance with such
membership functions values is called rounded one. Then one rule is chosen from the
made group of rules. It is the r-th rule whose attributes values ai1,,j, ai2,,j, , aiq,,j
respectively match with ai,j with ai,j(e)=1 of the rounded instance. After that, ck(e) is
set equal Fkqr.
148
The second way uses one or more classification rules from cks group for
classification of e into a class ck. The reason why it is like this is that because there
may be several ai,j(e)>0 for a node in the FDT. Thats why, there may be several paths
whose all outgoing nodes branches are associated with ai,j where ai,j(e)>0. And each
this path corresponds to one classification rule. But because of the ai,j(e) are not all
equal 1, it is clear that such each rule should be included in the final value of ck(e)
with a certain weight. The weight is for instance e and the r-th rule in cks group given
by the following form:
Wr (e ) =
a path _ r
(e) ,
where a(e) is the value of the membership function of a attributes value a and path_r
is a set of all attributes values in the condition of r-th classification rule. The weight
Wr(e) is equal 0 if there is a attributes value a whose a(e)=0 in the condition of the rth rule. The membership functions value for the new instance e and for class ck is:
R
r
k
where F is the thruthfulness of the r-th rule, or in other words, the certainty degree of
the class ck attached to the leaf node lr.
If, in the first or the second way, classification only into one class is needed,
instance e is classified into the class ck whose ck(e) k=1, 2, , cmc is maximal.
Let us describe the process of transformation of the FDT into fuzzy rules and the
two mentioned ways of their use for classification with the following examples.
Example 1 An instance is described by N=4 attributes A = {A1, A2, A3, A4} = {Outlook,
Temp(erature), Humidity, Wind} and the instances are classified with one class
attribute C. Each attribute Ai = {ai,1, ..., ai,j, , ai,mi} is defined as follows: A1 = {a1,1,
a1,2, a1,3} = {sunny, cloudy, rain}, A2 = {a2,1, a2,2, a2,3} = {hot, mild, cool}, A3 = {a3,1,
a3,2} = {humid, normal}, A4 = {a4,1, a4,2} = {windy, not windy}. The class attribute C =
{c1, ..., ck, , cmc} is defined as follows: C = Plan= {c1, c2, c3} = {volleyball,
swimming, weight lifting}. There is a FDT in Fig.1. The FDT was made from a
database by using cumulative information. The database and the mechanism how to
build it are described in [9]. The FDT has R=9 leaves. In the leaves there are written
values of Fkr for respective ck, k=1, 2, 3. Our goal is to determine ck(e), k=1, 2, 3 for a
new instance e which is described in the Tab.1 on basis of the FDT.
149
Temp (A2)
hot (a2,1)
cool (a2,3)
mild (a2,2)
Outlook (A1)
Outlook (A1)
sunny (a1,1)
cloudy (a1,2)
F19=16.3%
F19= 4.9%
F19=78.8%
rain (a1,3)
rain (a1,3)
sunny (a1,1)
F18=30.7%
F15=32.9%
8
5
F2 =23.9% cloudy (a1,2) F2 =10.6%
8
F2 =58.7%
F35=43.2%
F13=37.6% F14=11.0%
F23=50.4% F24=16.3%
F33=12.0% F34=72.7%
Wind (A4)
Wind (A4)
windy (a4,1)
not windy (a4,2)
windy (a4,1)
F11=14.2%
F21=67.8%
F31=18.0%
F12=33.2%
F22=62.3%
F32= 4.5%
F16=25.1%
F26= 6.7%
F26=68.2%
F17=55.1%
F27=14.2%
F37=30.7%
l1
l2
l6
l7
A1,2
A1
a1,3
a2,1
A2
A2,2
a2,3
a3,1
A3
a3,2
a4,1
A4
A4,2
0.1
0.0
1.0
0.0
0.0
0.8
0.2
0.4
0.6
The classification rules for class c1 (c1group of rules) made from the FDT in Fig.1.
have the following form:
r=1: IF Temp is hot AND Outlook is sunny AND Wind is windy THEN Plan is
volleyball (thruthfulness F11 = 0.142)
r=2: IF Temp is hot AND Outlook is sunny AND Wind is not windy THEN Plan is
volleyball (thruthfulness F12 = 0.332)
150
r=3: IF Temp is hot AND Outlook is cloudy THEN Plan is volleyball (thruthfulness F13
= 0.376)
A2
A3
A4
a2,2
a1,2
a1,3
a2,1
a2,2
a1,2
a4,2
It means that, for each Ai (i=1, 2, 3, 4), ai,j(e) (j=1, , mi) with the maximal
value among all ai,j Ai is set equal 1 and all the others equal 0. When ai,j with
ai,j(e)=1 are chosen, the following set {a1,1, a2,1, a3,1, a4,2} = {sunny, hot, humid, not
windy} is obtained. The elements of the set matches with attributes values of fuzzy
classification rule r=2 in Example 1. Therefore, c1(e) for class c1 is computed by
c1(e) = F12 = 0.332. Similarly, c2(e) = F22 = 0.623 on bases of c2group of rules, c3(e)
= F32 = 0.045 on bases of c3group of rules. The maximum value has c2(e). Because of
it, if classification only into one class is needed, instance e is classified into class c2.
Example 3 (classification with several fuzzy rules). Paths, whose all outgoing nodes
branches which are associated with ai,j whose ai,j(e)>0, are shown in several bold
branches in Fig.1. In this situation, its needed to use rules which correspond to leaves
l1, l2, l3. The respective classification rules r = 1, 2, 3 for class c1 are mentioned in
Example 1. For the rule r=1, path_r = {a2,1, a1,1, a4,1}. And so, for instance e described
in Tab.1., W1(e) = 0.9*1.0*0.4 = 0.36. Similarly, W2(e) = 0.9*1.0*0.6= 0.54, W3(e) =
0.1*1.0 = 0.1. All the other Wr(e) are equal 0, r=4, 5, , 9.
Then, c1(e) = 0.142*0.36 + 0.332*0.54 + 0.376*0.1 + 0.11*0 + + 0.163*0 = 0.268.
When its done for c1, c2, c2(e) = 0.631 and c3(e) = 0.101 will be obtained. The
maximum value has c2(e). And so, if classification only into one class is needed,
instance e is classified into class c2.
151
EXPERIMENTAL RESULTS
The main purpose of our experimental study is to compare the two mentioned
ways of using fuzzy classification rules made from FDT based on cumulative
information for classification. Their classification accuracy is also compared with other
methods. They all have been coded in Java and the experimental results are obtained
on AMD Turion64 1.6 GHz with 1024 MB of RAM.
The experiments have carried out on selected Machine Learning databases [6].
First, if it was needed, the databases were fuzzyfied with an algorithm introduced in
[8]. Then we separated each database into 2 parts. We used the first part (70% from the
database) for building classification models. The second part (30% from initial
database) was used for verification these models. This process of separation and
verification was repeated 100 times to obtain the models error rate for respective
databases. The error rate is calculated as the ratio of the number of misclassification
combinations to the total number of combinations.
The results of our experiments are in Tab.3, where CA-MM denotes FDT-based
fuzzy classification rules algorithm proposed in [22], CA-SP denotes a CA-MMs
modification which is introduced in [3], CI-RM-M denotes algorithm [10] for making
FDT with cumulative information and which uses the second way of fuzzy
classification rules induction from previous section 3, CI-RM-0 uses the first way of
fuzzy classification rules induction from previous section 3, NBC denotes Nave Bayes
Classifier, k-NN denotes k-Nearest Neighbour Classifier. The numbers in round
brackets denote the order of the method for respective database, number 1 denotes the
least and also the best error rate. The last row contains average error rate for all
databases and respective methods.
Table 3 The error rates comparison of respective methods.
Database
BUPA
Ecoli
Glass
Haberman
Iris
Pima
Wine
Average
152
153
REFERENCES
[1]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
154
[16] QUINLAN, J. R.: Induction of decision trees. Machine Learning 1 (1986), 81106.
[17] UMANOL, M., OKAMOTO, H., HATONO, I., TAMURA, H., KAWACHI, F.,
UMEDZU, U., KINOSHITA J.: Fuzzy decision trees by fuzzy ID3 algorithm
and its application to diagnosis system. IEEE Int. Conf. on Fuzzy Systems
(1994), 2113-2118.
[18] WANG, X. Z., CHEN, B., QIAN, G., YE, F.: On the optimization of fuzzy
decision trees. Fuzzy Sets and Systems 2, Vol. 112 (2000), 117-125.
[19] WEBER, R.: Fuzzy ID3: A class of methods for automatic knowledge
acquisition. 2nd Int. Conf. on Fuzzy Logic and Neural Networks (1992), 265268.
[20] WITTEN, I. H., FRANK, E.: Data Mining Practical Machine Learning Tools
and Techniques with Java Implementations. Morgan Kaufman, 1999.
[21] YEN, J.: Fuzzy Logic - A modern perspective. IEEE Transactions on
Knowledge and Data Engineering 1, Vol. 11 (1999), 153-165.
[22] YUAN, Y., SHAW, M. J: Induction of fuzzy decision trees. Fuzzy Sets and
Systems 69 (1995), 125-139.
[23] ZADEH, L. A.: Fuzzy set. Information and Control 8 (1965), 338-353.