Learning Predictive Clustering Rules
Learning Predictive Clustering Rules
Abstract. The two most commonly addressed data mining tasks are
predictive modelling and clustering. Here we address the task of predic-
tive clustering, which contains elements of both and generalizes them
to some extent. We propose a novel approach to predictive clustering
called predictive clustering rules, present an initial implementation and
its preliminary experimental evaluation.
1 Introduction
122
closely to the predictive modelling variants thereof). We propose a novel ap-
proach to predictive clustering, namely, predictive clustering rules (PCRs) and
present a learning method for them.
The task of learning PCRs generalizes the task of rule induction, on one hand,
and clustering, and in particular itemset constrained clustering, on the other. It
is thus important for constraint-based data mining and inductive databases. As
IDBs are the most promising approach to finding a general framework for data
mining, bringing the two most common data mining tasks a step closer is a step in
this direction. Also, constraint-based clustering is certainly an under-researched
topic in constraint-based data mining.
In the next section, we discuss in more detail prediction and clustering, then
define the task of predictive clustering and in particular the task of learning
PCRs. We then describe an algorithm for learning PCRs, followed by a discussion
of related work. Before concluding and outlining directions for further work, we
present a preliminary experimental evaluation.
The tasks of predictive modelling and clustering are two of the oldest and most
commonly addressed tasks of data analysis (and data mining). Here we briefly
introduce each of them and introduce predictive clustering, a task that combines
elements of both prediction and clustering. We also discuss in some more detail
the more specific task of learning predictive clustering rules that is the focus of
this paper.
Predictive modeling aims at constructing models that can predict a target prop-
erty of an object from a description of the object. Predictive models are learned
from sets of examples, where each example has the form (A, T ), with A being an
object description and T a target property value. While a variety of languages
ranging from propositional to first order logic have been used for A, T is almost
always considered to consist of a single dependent variable called the class: if this
variable is discrete we are dealing with a classification problem and if continuous
with a regression problem.
In practice, A is most commonly a vector of independent variables, called
attributes (attribute-value representation). In the remainder of the paper, we
will consider both A and T to be vectors of attributes (discrete or real-valued).
Given T is a vector with several target variables, we call the task of predictive
modelling multi-objective prediction. If T only contains discrete variables we
speak of multi-objective classification. If T only contains continuous variables
we speak of multi-objective regression.
Predictive models (that are learned from data) can take many different forms
that range from linear equations to logic programs. Two most commonly used
123
types of models are decision trees [13] and rules [7]. Unlike (regression) equa-
tions that provide a single predictive model for the entire example space, trees
and rules partition the space of examples into subspaces and provide a simple
prediction or predictive model for each of these. While the subspaces defined by
rules can overlap, those defined by trees (i.e., their leafs) do not overlap.
124
2.3 Predictive clustering
The task of predictive clustering combines elements from both prediction and
clustering. As is common in clustering, we seek for clusters of examples that are
similar to each other (and dissimilar to examples in other clusters). In addition,
we seek for a predictive model to be associated with each cluster; the model
gives a prediction of the target variables T in terms of the attributes A for all
examples that are established to belong to that cluster.
In the simplest and most common case, the predictive model associated to a
cluster would be the projection on T of the prototype of the examples that belong
to that cluster. This would be a simple average when T is a single continuous
variable. In the discrete case, it would be a probability distribution across the
discrete values or the mode thereof. When T is a vector, the prototype would be
a vector of averages and distributions/modes.
We will be interested in the case where each cluster has both a symbolic de-
scription (in terms of a language bias over A) and a predictive model (a prototype
in T ) associated to it. The corresponding tree-based and rule-based representa-
tions are called predictive clustering trees [2] and predictive clustering rules (this
paper). Unlike clustering, which only considers homogeneity for A and predic-
tive modelling, which only considers homogeneity for T , predictive clustering in
general takes both A and T into account.
The task of predictive clustering is thus defined as follows: Given
In this paper, we will address the task of predictive clustering and focus on learn-
ing rules for multi-objective classification. Extending the proposed approach to
multi-objective regression and multi-objective prediction in general will be the
subject of immediate further work. The examples will be vectors, divided into
attributes and target variables, where attributes can be either discrete or contin-
uous. The declarative bias will restrict our hypothesis language to rules consisting
of conjunctions of attribute-value conditions over the attributes A. (Additional
language constraints are planned for consideration in the near future.)
125
The distance measure d would typically only consider T , since we are dealing
with prediction. However, we will be considering a distance measure that consists
of two components, one over the attributes and the other over the targets:
d = (1 − τ )dA + τ dT .
This measure will be used as a heuristic in the search for rules. The distance
d (including dA and dT ) is computed as a weighted sum of distances along
each (normalized) dimension. For continuous variables, the absolute difference
is taken. Values of discrete variables are treated as probability distributions
thereof, the distances between probability distributions being the sum of absolute
differences over the probabilities for all possible values.
Given a (partial) rule that covers a set of examples S, its quality is estimated
as the average distance of an example in S to the prototype of S. We will call
this the ”compactness” of the rule. For evaluating the predictive performance
(accuracy), only the dT part is relevant. The dA part also takes into account the
compactness of (average distance between) the examples covered by the rules,
in the attribute space.
The prototype function takes a set of examples as input and returns a pro-
totype that is computed per dimension. For continuous attributes the average is
taken; for discrete variables, the probability distribution across possible values.
This section describes our system for learning PCRs. The majority of rule in-
duction methods are based on the covering algorithm and among these the CN2
algorithm [4] is well known. Our system is based on this algorithm, but several
important parts are modified. In this section we first briefly describe the original
CN2 algorithm, and then we present our modifications.
The CN2 algorithm iteratively constructs rules that cover examples with ho-
mogenous target variable values. The heuristic used to guide the search is simply
the accuracy of the rule under construction. After a rule has been constructed,
the examples covered by this rule are removed from the training set, and the
procedure is repeated on the new data set until the data set is empty or no new
rules are found. The rules constructed in this way are ordered, meaning that
they can be used for prediction as a decision list; we test rules on a new example
one by one and the first rule that fires is used for prediction of the target value
of this example. Alternatively, CN2 can also construct unordered rules if only
correctly classified examples are removed from the training set after finding each
rule and if rules are built for each possible class value. When using unordered
rules for prediction, several rules can fire on each example in which case weighted
voting is used to get the final prediction.
126
3.2 The search heuristic: compactness
The main difference between standard rule induction with the CN2 covering
algorithm and the approach presented in this paper is the heuristic used for
guiding the search of rules. The purpose of the heuristic is the evaluation of
different rules, so it should measure a property which is somehow important for
the quality of each rule and/or the whole rule set.
One of the most important properties of rules (and other models) is their
accuracy, and standard CN2 simply uses this as a heuristic. Accuracy is only
connected to the target attribute. Our goal when developing predictive clustering
rules was (besides accuracy) that the induced rules should cover compact subsets
of examples, just as clustering does. For this purpose we need a heuristic which
takes into account the target variables as well as the attributes.
As explained above, we will use the compactness (average distance of an
example covered by a rule to the prototype of this set of examples). The com-
pactness takes into account both the attribute and target variable dimensions
and is a weighted sum of the compactness along each of the dimensions (the
latter are normalized to be between 0 and 1). At present only a general weight
τ is applied for putting the emphasis on the targets (τ = 1) or the attributes
(τ = 0): target attributes should in general have higher weights in order to
guide the search towards accurate rules. Because the attributes can be in gen-
eral nominal or numeric, different measures for each type are needed which are
then combined (added) into a single measure.
The compactness measure of a set of examples along one nominal attribute
is calculated as the average distance of all examples in a set from its prototype.
The prototype is simply a tuple with relative frequencies of each of the possible
attribute values in the given set of examples. For the attribute with K possible
values (v1 to vK ) the prototype is of the form (f1 , f2 , ... , fK ). The distance of an
example with a value vk from this prototype is equal to (1−fk ). The compactness
measure for one numeric attribute is equal to the mean absolute error of the
attributes’ value and its mean. The values of numeric attributes are normalized
in advance. Note that our compactness heuristic is actually an incompactness
heuristic, since smaller values mean more compact sets of examples.
3.3 Weighted covering
The standard covering algorithm removes the examples covered by a rule from
the training set in each iteration. As a consequence, subsequent rules are con-
structed on smaller example subsets which can be improperly biased and can
have small coverage. To overcome these shortages we employ the weighted cov-
ering algorithm [11]. The difference is that once the example is covered by a new
rule, the example is not removed from the training set but instead, its weight
is decreased. As a result, the already covered example will be less likely covered
in the next iterations. We use the additive weighting scheme, which means that
1
the weight of the example after being covered m times is equal to 1+m . Finally,
when the example is covered more than a predefined number of times (in our
experiments five times), the example is completely removed from the training
set.
127
3.4 Probabilistic classification
As already mentioned, the original CN2 algorithm can induce ordered or un-
ordered rules. In case of ordered rules or decision list the classification is straight-
forward. We scan the rules one by one and whichever rule fires first on a given
example is used for prediction. If no rule fires, the default rule is used. When
classifying with unordered rules, CN2 collects class distributions of all rules that
fire on an example and uses them for weighted voting. We use the same prob-
abilistic classification scheme even though our unordered rules are not induced
for each possible class value separately.
4 Related work
Predictive modeling and clustering are regarded as quite different tasks. While
there are many approaches addressing each of predictive modelling and cluster-
ing, few approaches look at both or try to relate them. A different viewpoint is
taken by Langley [10] that predictive modeling and clustering have many simi-
larities, which has motivated some recent research on combining prediction and
clustering.
The approach presented in this paper is closely related to clustering trees
[2], which also address the task of predictive clustering. The systems TILDE
[2] and CLUS [3] use a modified top-down induction of decision trees algorithm
to construct clustering trees (which can predict values of more than one target
variables simultaneously). So far, however, distances used in TILDE and CLUS
systems have considered attributes or classes separately, but not both together,
even though the idea was presented in [1].
Our approach uses a rule-based representation for predictive clustering. As
such, it is closely related to approaches for rule induction, and among these
in particular CN2 [4]. However, it extends rule induction to the more general
task of multi-objective prediction. While some work exists on multi-objective
classification with decision trees (e.g., [16]), the authors are not aware of any
work on rule-induction for multi-objective classification. Also, little work exists
on rule-based regression (some recent examples come from the are of ILP, e.g.,
FORS [8]), let alone rule-based multi-objective regression (or multi-objective
prediction in general, with mixed continuous and discrete targets).
Related to rule induction is subgroup discovery [11], which tries to find and
describe interesting groups of examples. While subgroup discovery algorithms
are similar to rule induction ones, they have introduced interesting innovations,
including the weighted covering approach used in our system.
Another related approach to combining clustering and classification is itemset
constrained clustering [14, 15]. Here the attributes describing each example are
separated in two groups, called feature items and objective attributes. Clustering
is done on the objective attributes, but only clusters which can be described in
terms of frequent item sets (using the feature items attributes) are constructed.
As a result each cluster can be classified by a corresponding frequent item set.
As in our approach, itemset classified clustering tries to find groups of ex-
amples with small variance of the objective attributes. As compared to itemset
128
Table 1. The attributes of the river water quality data set.
classified clustering, our approach allows both discrete (and not only binary at-
tributes / items) and continuous variables on the feature/attribute side, as well
as the objective/target side. Itemset constrained clustering is also related to sub-
group discovery, as it tries to find interesting groups of examples, rather than a
set of (overlapping) clusters that cover all examples.
5 Experiments
The current implementation of predictive clustering rules has been tested on the
water quality domain with multiple target attributes. Two sets of experiments
have been performed. First, we tried to test the performance of our method
when predicting multiple target attributes at once in comparison to single target
attribute prediction task. In the second set of experiments we investigated the
influence of the target weighting parameter (τ ) on the accuracy and compactness
of induced rules.
The data set used in our experiments comprises biological and chemical data
that were collected through regular monitoring of rivers in Slovenia. The data
come from the Environmental Agency of the Republic of Slovenia that performs
water quality monitoring for most Slovenian rivers and maintains a database of
water quality samples. The data cover a six year period, from 1990 to 1995 and
have been previously used in [5].
129
Table 2. The default accuracy and accuracies of predictive clustering rules (PCR)
used for multiple target prediction of all taxa together, for predicting plant and animal
taxa separately, and for single target prediction of each taxa separately. The last two
columns presents the accuracies of predictive clustering trees (PCT) as taken from [6].
PCR PCT∗
Target attribute Default All Plants/ Indiv. Plants/ Indiv.
Animals Animals
Cladophora sp. 0.570 0.599 0.613 0.608 0.624 0.656
Gongrosira incrustans 0.733 0.733 0.733 0.733 0.725 0.732
Oedogonium sp. 0.712 0.713 0.709 0.713 0.710 0.713
Stigeoclonium tenue 0.794 0.793 0.794 0.795 0.800 0.814
Melosira varians 0.528 0.581 0.574 0.562 0.620 0.601
Nitzschia palea 0.558 0.675 0.649 0.691 0.723 0.684
Audouinella chalybea 0.753 0.753 0.753 0.753 0.752 0.752
Erpobdella octoculata 0.718 0.733 0.725 0.725 0.724 0.716
Gammarus fossarum 0.585 0.626 0.672 0.654 0.669 0.702
Baetis rhodani 0.676 0.676 0.676 0.683 0.680 0.676
Hydropsyche sp. 0.561 0.568 0.572 0.574 0.660 0.545
Rhyacophila sp. 0.685 0.685 0.681 0.702 0.730 0.673
Simulium sp. 0.633 0.638 0.633 0.638 0.618 0.625
Tubifex sp. 0.707 0.722 0.733 0.731 0.652 0.625
Biological samples are taken twice a year, once in summer and once in winter,
while physical and chemical analyses are performed several times a year for each
sampling site. The physical and chemical samples include the measured values
of 15 different parameters. The biological samples include a list of all taxa (plant
and animal species) present at the sampling site. All the attributes of the data
set are listed in Table 1. In total, 1060 water samples are available in the data
set. In our experiments we have considered the physical and chemical properties
as independent attributes, and presences/absences of taxa as target attributes.
5.2 Results
The first set of experiments was performed in order to test the appropriateness of
predictive clustering rules for multiple target prediction. In all experiments the
minimal number of examples covered by a rule was 20, and the weight of target
attributes (τ ) was set to 0.9. The results of 10-fold cross validation can be seen
in Table 2. In the first column the default accuracies for each target attribute
(taxon) are given, in the second the accuracies of predictions by multiple target
model predicting all 14 class values together, and in the third column the ac-
curacies of predictions by two multiple target models; one predicting plant taxa
together (first seven classes) and another one predicting animal taxa together
130
Table 3. The accuracy of predictive clustering rules used for multiple target prediction
of all taxa together with different target attributes weightings (τ ).
τ
Target attributes 1.0 0.9 0.8 0.7 0.6 0.5
Cladophora sp. 0.594 0.599 0.608 0.602 0.587 0.585
Gongrosira incrustans 0.733 0.733 0.733 0.733 0.733 0.733
Oedogonium sp. 0.713 0.713 0.713 0.711 0.709 0.709
Stigeoclonium tenue 0.795 0.793 0.793 0.795 0.793 0.793
Melosira varians 0.568 0.581 0.581 0.571 0.562 0.552
Nitzschia palea 0.685 0.675 0.664 0.676 0.669 0.660
Audouinella chalybea 0.751 0.753 0.753 0.753 0.753 0.753
Erpobdella octoculata 0.721 0.733 0.735 0.738 0.736 0.742
Gammarus fossarum 0.628 0.626 0.641 0.639 0.650 0.658
Baetis rhodani 0.676 0.676 0.676 0.676 0.676 0.676
Hydropsyche sp. 0.586 0.568 0.562 0.567 0.561 0.563
Rhyacophila sp. 0.686 0.685 0.687 0.685 0.685 0.685
Simulium sp. 0.633 0.638 0.633 0.632 0.633 0.633
Tubifex sp. 0.728 0.722 0.728 0.737 0.736 0.738
(last seven classes). The fourth column presents the accuracies of single target
attribute models; one model was built for each taxa separately. The last two
columns present the accuracies of predictive clustering trees (PCT) for multiple
and single prediction as taken from [6]. The last row in the table gives average
accuracies across all 14 target attributes. Looking at these average accuracies
we can see that the performance of a model predicting all classes together is
only slightly worse than the performance of single target models, while the per-
formance of models predicting plants and animals together is somewhere in be-
tween. When comparing predictive clustering rules to predictive clustering trees,
the performance of the latter is somewhat better, but the results are not directly
comparable because of different experimental setup when evaluating PCTs.
The task of the second set of experiments was to evaluate the influence of
the target weighting parameter (τ ) on the accuracy and compactness of induced
rules (Table 3). The rules were induced for predicting all taxa together with
six different values of the τ parameter. At the bottom of the table we have
average accuracies of 10-fold cross-validation, average compactness of subsets
of examples covered by rules in each model, and average coverage of rules in
each model. The rules induced with larger weighting of the nontarget attributes
(smaller τ ) are on average more compact (smaller number for compactness means
more compact subsets) and on average cover more examples. This trend is quite
clear for values of τ from 1.0 to 0.8 or 0.7. Small weights for non-targets slightly
131
increase both overall compactness and accuracy, while large weights increase
overall compactness at the expense of accuracy.
In this paper, we have considered the data mining task of predictive clustering.
This is a very general task that contains many features of (and thus to a large
extent generalizes over) the tasks of predictive modelling and clustering. While
this task has been considered before, we have defined it both more precisely and
in a more general form (i.e., to consider distances on both target and attribute
variables and to consider clustering rules in addition to trees).
We have introduced the notion of clustering rules and focused on the task
of learning predictive clustering rules for multi-objective prediction. The task
of inducing PCRs generalizes the task of rule induction, extending it to multi-
objective classification, regression and in general prediction. It also generalizes
some forms of distance-based clustering and in particular itemset constrained
clustering. Given this, we believe this task is important for constraint-based data
mining and inductive databases. Namely, IDBs are the most promising approach
to finding a general framework for data mining: bringing the two most common
data mining tasks a step closer is a step in this direction. Also, constraint-
based clustering is certainly a (under-researched) topic in constraint-based data
mining.
We have implemented a preliminary version of a system for learning PCRs
for multi-objective classification. We have also performed some preliminary ex-
periments on a real-world dataset. The results show that a single rule-set for
MOC can be as accurate as the collection of rule-sets for individual prediction of
each target. The accuracies are also comparable to those of predictive clustering
trees. Experiments in varying the weight of target vs. non-target attributes in
the compactness heuristic used in the search for rules show that small weights
for non-targets slightly increase both overall compactness and accuracy, while
large weights increase overall compactness at the expense of accuracy.
Note, however, that many more experiments are necessary to evaluate the
proposed paradigm and implementation. These would include experiments on
additional datasets. The latter should include both single-objective prediction
and multi-objective prediction. In the latter case, classification, regression and
a mixture thereof should be considered. Also, a comparison to other approaches
to constrained clustering would be in order.
Other directions for further work concern further development of the PCR
paradigm and its implementation. At present, our implementation only considers
multi-objective classification, but can be easily extended to regression problems,
and also to mixed, classification/regression problems. Currently, the heuristic
guiding the search for rules does not take the number of covered examples in
consideration. Consequently, construction of overly specific rules can only be
prevented by means of setting the minimum number of examples covered by
a rule. Adding a coverage dependant part to the heuristic would enable the
132
induction of compact rules with sufficient coverage. Another possibility is the
use of some sort of significance testing analogous to significance testing of the
target variable distribution employed by CN2.
Finally, the selection of weights for calculating the distance measure (and
the compactness heuristic) is an open issue. One side of this is the weighting of
target vs. non-target variables. Another side is the assignment of relevance-based
weights to the attributes: while this has been considered for single-objective
classification, we need to extend it to multi-objective prediction.
References
1. Blockeel, H. (1998): Top-down induction of first order logical decision trees. PhD
thesis, Department of Computer Science, Katholieke Universiteit, Leuven.
2. Blockeel, H., De Raedt, L., and Ramon, J. (1998): Top-down induction of clustering
trees. Proceedings of the 15th International Conference on Machine Learning, pages
55–63, Morgan Kaufmann.
3. Blockeel, H. and Struyf, J. (2002): Efficient algorithms for decision tree cross-
validation, Journal of Machine Learning Research, 3(Dec):621–650, Microtome Pub-
lishing.
4. Clark, P. and Niblett, T. (1989): The CN2 Induction Algorithm, Machine Learning,
3:261–283, Kluwer.
5. Džeroski, S., Demšar, D., and Grbović, J. (2000): Predicting chemical parameters
of river water quality from bioindicator data. Applied Intelligence, 13(1): 7–17.
6. Džeroski, S., Blockeel, H., and Grbović. (2001): Predicting river water communities
with logical decision trees. Presented at the Third European Ecological Modelling
Conference, Zagreb, Croatia.
7. Flach, P. and Lavrač, N. (1999): Rule induction. In Intelligent Data Analysis, eds.
Berthold, M. and Hand, D. J., pages 229–267, Springer.
8. Karalič, A. and Bratko, I. (1997): First Order Regression. Machine Learning, 26:147–
176, Kluwer.
9. Kaufman, L. and Rousseeuw, P. J. (1990): Finding groups in data: An introduction
to cluster analysis, John Wiley & Sons.
10. Langley, P. (1996): Elements of Machine Learning. Morgan Kaufman.
11. Lavrač, N., Kavšek, B., Flach, P., and Todorovski, L. (2004): Subgroup discovery
with CN2-SD, Journal of Machine Learning Research, 5(Feb):153–188, Microtome
Publishing.
12. Michalski, R. S. (1980): Knowledge acquisition through conceptual clustering: A
theoretical framework and algorithm for partitioning data into conjunctive concepts.
International Journal of Policy Analysis and Information Systems, 4:219–243.
13. Quinlan, J. R. (1993): C4.5: Programs for Machine Learning. Morgan Kaufmann.
14. Sese, J. and Morishita, S. (2004): Itemset Classified Clustering. Proceedings of the
Eighth European Conference on Principles and Practice of Knowledge Discovery in
Databases (PKDD’04), pages 398–409, Springer.
15. Sese, J., Kurokawa, Y., Kato, K., Monden, M., and Morishita, S. (2004) Con-
strained clusters of gene expression profiles with pathological features. Bioinfor-
matics.
16. Suzuki, E., Gotoh,M., and Choki, Y. (2001): Bloomy Decision Tree for Multi-
objective Classification. Proceedings of the Fifth European Conference on Princi-
ples and Practice of Knowledge Discovery in Databases (PKDD’01), pages 436-447,
Springer.
133