A Generic Framework For Rule-Based Classification
A Generic Framework For Rule-Based Classification
Classification
1 Introduction
Many pattern extraction methods have been proposed in the constraint-based
data mining field. On the one hand, individual patterns are declaratively specified
under a local model using various class of constraints. The local pattern mining
algorithms, e.g. a typical association rule mining algorithm, extracts individual
patterns that are capable of describing some portions of the whole underlying
database from which they have been generated. The descriptive property of the
local patterns are independent from each other. On the other hand, any arbi-
trary combination of individual patterns can be assumed as a global pattern
[14]. In a classification task, it should also be described how the global pattern
is applied to comply with the user’s request, i.e. class label prediction for un-
classified data objects. Therefore, classification is a predictive global modeling
task among data mining problems. Given a set of (labeled) training examples,
i.e. a training database, the classification task constructs a classifier. A classifier
is a global model which not only describes the whole training database achieving
some level of accuracy, but also is used to predict the class label for data objects
that are unlabeled. Intuitively, not only the problem of finding a (minimal) set
of patterns describing the whole training database and achieving the maximum
accuracy is intractable, but also this problem is intractable even for the binary-
class classification problem [10] w.r.t the domain size of the training data set.
Moreover, although the local pattern extraction methods can be adapted and
exploited for classifier construction, however, the problem remains intractable
for general measure functions. More clearly, when the pattern evaluation mea-
sure is an arbitrary measure it lacks some useful properties to be used to reduce
the search space of patterns.
2 Formal Framework
In this section, we give the formal definitions of the notions used throughout the
paper to describe the rule-based classification process. This process consists of
using a training set of labeled objects (called dataset of examples in what follows,
see Section 2.1) from which classification rules are extracted (see Section 2.2) by
the employment of basic operations (see Section 2.3) for building a classifier (see
Section 2.4), and using it to predict the class label of a given unlabeled object
(see Section 2.5).
Data examples Let Class be a set of class labels. A data example (example for
short) over a given schema A is a tuple ho, ci where o ∈ O, A is the schema of o
and c ∈ Class. Given an example e = ho, ci, we note c = cl(e). In what follows,
we consider examples over a fixed schema A, and we denote by E this set of
examples over A.
A data set E is a subset of E, for which |{c/∃o ∈ O, ho, ci ∈ E}| is denoted
by N bClass. Given a class label cj ∈ Class and a data set E ⊆ E, we denote by
Ej the set of examples in E of class cj , i.e., Ej = {e ∈ E|cl(e) = cj }.
Rules Given the set of examples E over a schema A, a rule is a tuple ho, ci, where
o ∈ O, the schema of o is a schema {A1 , . . . , An } ⊆ A and c ∈ Class. A rule is
noted o → c, and we use |r| to denote |sch(o)|. We denote by R the set of rules.
We now introduce the two basic operations used in the description of the classi-
fication process. Note that we give here a generic definition of these operations,
in the sense that they are given for any language L.
The first operation is the theory computation operation T h that extracts
from a data set the elements of a given language L satisfying a given selection
predicate.
The second operation, Topk , extracts the k best elements of a language L w.r.t.
a given order on L. This order may depend on a given data set. To this end we
first define the notion of data dependent order.
Data dependent order (E-order) A data dependent order (or E-order) on a given
set L is a relation α ⊆ L × L × 2E such that, for a given E ∈ 2E , the relation
α[E] = {hϕ, ϕ0 i|hϕ, ϕ0 , Ei ∈ α} is an order on L. In what follows, we note
α(ϕ, ϕ0 , E) = true if hϕ, ϕ0 , Ei ∈ α.
Classifier inclusion Let C1 = hR1 , <R1 i and C2 = hR2 , <R2 i be two classifiers.
We say that C1 is included in C2 , noted C1 v C2 , if R1 ⊆ R2 and <R1 ⊆<R2
Concatenation Let C1 = hR1 , <R1 i and C2 = hR2 , <R2 i be two classifiers such
that R1 ∩ R2 = ∅. The concatenation of the two classifiers is defined as C1 C2 =
hR1 ∪ R2 , < i, where < =<R1 ∪ <R2 ∪(R1 × R2 ). Note that operator does not
commute, since × does not commute.
Union Let C1 = hR1 , <R1 i and C2 = hR2 , <R2 i be two classifiers such that
R1 ∩R2 = ∅. The union of the two classifiers is defined as C1 ∪C2 = hR1 ∪R2 , <∪ i,
where <∪ =<R1 ∪ <R2 .
Difference Let C1 = hR1 , <R1 i and C2 = hR2 , <R2 i be two classifiers. The
difference C1 \C2 is the classifier hR1 \R2 , <R1 \{hr, r0 i ∈<R1 |r ∈ R2 ∨r0 ∈ R2 }i.
By abuse of notation we sometime write C1 \ R2 to denote C1 \ hR2 , ∅i and
R2 \ C1 to denote R2 \ R1 .
2.5 Prediction
Finally we give the definitions used for describing the predication process. A
class prediction operator is a mapping from O × C to E. Below are two examples
of a class prediction operator.
Best rule prediction Let C = hR, <R i be a classifier such that <R is a total order,
and o be an object. BestRule(o, C) = ho, cl(rp )i where rp = max<R {r ∈ R}.
– R is a rule language.
– E is a training database.
– K is the number of rules that should be extracted at each iteration of the
algorithm.
– OrdGen is a function used to generate an E-order on R from a classifier
(i.e., mainly a set of rules). For instance, if C is a classifier, and E is the
set of examples not yet covered by the rules of C, OrdGen(C) = α where α
is such that, for every r, r0 ∈ R2 α(r, r0 , E) = true iff the confidence of r is
greater than the confidence of r0 or, if they are equal, the size of r is smaller
than the size of r0 .
– P redGen is a function used to describe a selection predicate for R under
which the extraction of a set of rules is taking place. For example, if C is
a classifier and E is the set of examples not yet covered by the rules of C,
P redGen(C) = q where q is such that, for every r ∈ R we have q(r, E) = true
iff the support of r is greater than a given threshold.
– V is a function used to describe a selection predicate for C, in order to
evaluate the quality (or accuracy) of a classifier w.r.t the whole training
dataset. For example, if C is a classifier and E a dataset, V (C, E) can be
such that it outputs true iff C covers all examples of E.
– O is an E-order on C. For example, if C1 and C2 are two classifiers and E
is a dataset, O(C1 , C2 , E) = true iff the number of examples covered by C1
is greater than that of C2 .
during the previous steps with the newly extracted best rules along with the
order relation on them (Line 7). Finally, out of all the classifiers constructed
during the loop, the best one w.r.t O is returned (Line 9 and 10).
The proposed operators and ICCA uniformly describe in a as declarative as
possible fashion various classification approaches and algorithms, that have their
own requirements and properties. To the best of our knowledge, this is the first
time that all the related definitions are formally specified in the classification
context. In order to illustrate the generality of our framework, the next sec-
tion shows how our approach integrates and represents requirements of different
classification methods using ICCA. Optimization aspects are also discussed in
Section 5.
4.1 AQ
4.2 CN2
By comparison with AQ, the CN 2 algorithm builds directly one classifier for
multi-class problems. Formally, the CN 2 classifier, denoted by CCN 2 , is defined
by:
0 0
CCN 2 = CCN 2 Cdef ault (E \ covered(E, CCN 2 )) with
0
CCN 2 = ICCA(R, E, 1, OrdGenCN 2 , P redGenCN 2 , VCN 2 , OCN 2 )
where the functions OrdGenCN 2 , P redGenCN 2 , VCN 2 and OCN 2 are defined for
every classifier C, C1 , C2 ∈ C as follows:
– OrdGenCN 2 (C)= αCN 2 where for every r, r0 ∈ R2 and E ⊆ E, we have
αCN 2 (r, r0 , E) = true iff:
• entropy(r, E 0 ) < entropy(r0 , E 0 ), or
• entropy(r, E 0 ) = entropy(r0 , E 0 ) and |r| < |r0 |
where E 0 = E \ covered(E, C).
– P redGenCN 2 (C) = qCN 2 where for every r ∈ R and E ⊆ E, qCN 2 (r, E) =
true iff F (r, E 0 ) ≥ τ , where F is a significance measure such as χ2 (or
Likelihood ratio), τ is a minimum threshold and E 0 = E \ covered(E, C).
– For every E ⊆ E, VCN 2 (C, E) = true iff E = covered(E, C).
– For every E ⊆ E, OCN 2 (C1 , C2 , E) = true iff |C1 | > |C2 |.
4.3 CBA
where the functions OrdGenCBA , P redGenCBA , VCBA , and OCBA are defined
for every classifier C, C1 , C2 ∈ C by:
As for CN 2, since the result classifier CCBA is totally ordered, CBA can used
the BestRule prediction operator. For every object o ∈ O, we have: P redCBA (o,
CCBA ) = BestRule(o, CCBA ).
CMAR and CorClass In this paper, we do not describe CM AR in detail
since CM AR is mainly an extension of CBA. In comparison with CBA, CM AR
selects only positively correlated rules (by χ2 testing). Moreover, instead of re-
moving an example as soon as it is covered by a rule, it only removes examples
that are covered by more than δ rules, where δ is a parameter of CM AR. Fi-
nally, in the prediction operator used by CM AR, instead of confidence, w(r, E)
is the weighted−χ2 ; this measure is used to overcome the minority class favoring
problem.
In comparison with CBA and CM AR, CorClass directly extracts the k rules
with the highest significance measures (χ2 , information gain, etc.) on the data
set, meaning that ICCA iterates only once. On the other hand, the classifiers
built by CorClass are evaluated using different prediction operators (Best Rule
or Aggregate prediction with different weighted combinations of rules).
4.4 FOIL
As the algorithm AQ, the algorithm F OIL has to be applied on each class for
multi-class problems. Moreover, in order to compare rules, F OIL uses a specific
gain measure defined as follows.
Definition 2. (Foil Gain). Given two rules r1 and r2 such that r2 ≺R r1 , the
gain to specialize r2 to r1 w.r.t. a set of examples E is defined by:
|P1 | |P2 |
gain(r1 , r2 , E) = |P1 |(log( ) − log( ))
|P1 | + |N1 | |P2 | + |N2 |
SN Formally,
bClass j
the F OIL classifier, denoted by CF OIL , is specified by CF OIL =
j=1 CF OIL where for every j ∈ {1, . . . , N bClass}:
CFj OIL = ICCA(R, E, 1, OrdGenjF OIL , P redGenjF OIL , VFj OIL , OFj OIL )
where the functions OrdGenjF OIL , P redGenjF OIL , VFj OIL , and OFj OIL are de-
fined for any classifiers C, C1 , C2 by:
where r1 ∧ r2 is the most specific rule that is more general than r1 and r2
(r1 ∧ r2 = min≺R {r ∈ R | r ≺R r1 , r ≺R r2 }) and E 0 = Ej \ covered(Ej , C).
– P redGenjF OIL (C) = qFj OIL where for every r ∈ R and E ⊆ E, we have
qFj OIL (r, E) = true iff cl(r) = cj and |r| ≤ L where L is a parameter of
F OIL.
– For every E ⊆ E, VFj OIL (C, E) = true iff Ej = covered(Ej , C).
– For every E ⊆ E, OFj OIL (C1 , C2 , E) = true iff |C1 | > |C2 |.
Given a classifier CF OIL , F OIL can use the prediction operator defined for
every object o ∈ O by: P redictF OIL (o, CAQ ) = AggP red+∞,w,agg (o, CF OIL ),
where for every rule r ∈ R and data sets E ⊆ E, w(r, E) is the confidence of rule
r in E, and agg is the sum aggregation function.
4.5 HARMONY
6 Conclusion
References
1. Peter Clark and Tim Niblett. The cn2 induction algorithm. Mach. Learn., 3(4):261–
283, 1989.
2. William W. Cohen. Fast effective rule induction. In Armand Prieditis and Stuart
Russell, editors, Proc. of the 12th International Conference on Machine Learning,
pages 115–123, Tahoe City, CA, july 1995. Morgan Kaufmann.
3. Guozhu Dong, Xiuzhen Zhang, Limsoon Wong, and Jinyan Li. Caep: Classification
by aggregating emerging patterns. In Discovery Science, pages 30–42, 1999.
4. Arno J. Knobbe and Eric K. Y. Ho. Pattern teams. In PKDD, pages 577–584,
2006.
5. Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Instance-based classifi-
cation by emerging patterns. In PKDD ’00: Proceedings of the 4th European Con-
ference on Principles of Data Mining and Knowledge Discovery, pages 191–200,
London, UK, 2000. Springer-Verlag.
6. Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Making use of the most
expressive jumping emerging patterns for classification. In Pacific-Asia Conference
on Knowledge Discovery and Data Mining, pages 220–232, 2000.
7. Wenmin Li, Jiawei Han, and Jian Pei. Cmar: Accurate and efficient classification
based on multiple class-association rules. In ICDM ’01: Proceedings of the 2001
IEEE International Conference on Data Mining, pages 369–376, Washington, DC,
USA, 2001. IEEE Computer Society.
8. Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association
rule mining. In Knowledge Discovery and Data Mining, pages 80–86, 1998.
9. Ryszard S. Michalski. On the quasi-minimal solution of the general covering prob-
lem. In Proceedings of the V International Symposium on Information Processing
(FCIP 69)(Switching Circuits, volume A3, pages 125–128, 1969.
10. Yasuhiko Morimoto, Takeshi Fukuda, Hirofumi Matsuzawa, Takeshi Tokuyama,
and Kunikazu Yoda. Algorithms for mining association rules for binary segmen-
tations of huge categorical databases. In VLDB ’98: Proceedings of the 24rd In-
ternational Conference on Very Large Data Bases, pages 380–391, San Francisco,
CA, USA, 1998. Morgan Kaufmann Publishers Inc.
11. J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
12. J. Ross Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 1993.
13. J. Ross Quinlan and R. Mike Cameron-Jones. FOIL: A midterm report. In Machine
Learning: ECML-93, European Conference on Machine Learning, Proceedings, vol-
ume 667, pages 3–20. Springer-Verlag, 1993.
14. Luc De Raedt and Albrecht Zimmermann. Constraint-based pattern set mining.
In SDM, 2007.
15. Michalski Ryszard S., Mozetic Igor, Hong Jiarong, and Lavrac Nada. The aq15 in-
ductive learning system: An overview and experiments. In Reports of the Intelligent
Systems Group, ISG 86-20, UIUCDCS-R-86-1260, 1986.
16. Jianyong Wang and George Karypis. Harmony: Efficiently mining the best rules
for classification. In SDM, 2005.
17. Xiaoxin Yin and Jiawei Han. Cpar: Classification based on predictive association
rules. In SDM, 2003.
18. Albrecht Zimmermann and Luc De Raedt. Corclass: Correlated association rule
mining for classification. In Discovery Science, pages 60–72, 2004.