Frequent Pattern For Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Discriminative Frequent Pattern Analysis for Effective Classification∗

Hong Cheng† Xifeng Yan‡ Jiawei Han† Chih-Wei Hsu†



University of Illinois at Urbana-Champaign

IBM T. J. Watson Research Center
{hcheng3, hanj, chsu}@cs.uiuc.edu, [email protected]

Abstract broad applications in areas like association rule min-


ing, indexing, and clustering [1, 23, 20]. The applica-
The application of frequent patterns in classification tion of frequent patterns in classification also achieved
appeared in sporadic studies and achieved initial suc- some success in the classification of relational data
cess in the classification of relational data, text docu- [14, 13, 25, 6, 19], text [15], and graphs [7].
ments and graphs. In this paper, we conduct a system- Frequent patterns reflect strong associations be-
atic exploration of frequent pattern-based classification, tween items and carry the underlying semantics of the
and provide solid reasons supporting this methodol- data. They are potentially useful features for classi-
ogy. It was well known that feature combinations (pat- fication. In this paper, we investigate systematically
terns) could capture more underlying semantics than the framework of frequent pattern-based classification,
single features. However, inclusion of infrequent pat- where a classification model is built in the feature space
terns may not significantly improve the accuracy due of single features as well as frequent patterns. The
to their limited predictive power. By building a connec- idea of frequent pattern-based classification has been
tion between pattern frequency and discriminative mea- exploited by previous studies in different domains, in-
sures such as information gain and Fisher score, we cluding: (1) associative classification [14, 13, 25, 6, 19],
develop a strategy to set minimum support in frequent where association rules are generated and analyzed for
pattern mining for generating useful patterns. Based classification; and (2) graph classification [7], text cat-
on this strategy, coupled with a proposed feature se- egorization [15] and protein classification [12], where
lection algorithm, discriminative frequent patterns can subgraphs, phrases, or substrings are used as features.
be generated for building high quality classifiers. We All these related studies demonstrate, to some ex-
demonstrate that the frequent pattern-based classifica- tent, the usefulness of frequent patterns in classifica-
tion framework can achieve good scalability and high tion. Although it is known that frequent patterns are
accuracy in classifying large datasets. Empirical stud- useful, there is a lack of theoretical analysis on their
ies indicate that significant improvement in classifica- principles in classification. The following critical ques-
tion accuracy is achieved (up to 12% in UCI datasets) tions remain unexplored.
using the so-selected discriminative frequent patterns. • Why are frequent patterns useful for classification?
Why do frequent patterns provide a good substi-
tute for the complete pattern set?
• How does frequent pattern-based classification
1. Introduction achieve both high scalability and accuracy for the
classification of large datasets?
Frequent pattern mining has been a focused theme
in data mining research with a large number of scal- • What is the strategy for setting the minimum sup-
able methods proposed for mining various kinds of pat- port threshold?
terns including itemsets [2, 10, 27], sequences [3, 16, 26] • Given a set of frequent patterns, how should we
and graphs [11, 22]. Frequent patterns have found select high quality ones for effective classification?
∗ The work was supported in part by the U.S. National Science In this paper, we will systematically answer the above
Foundation NSF IIS-05-13678/06-42771 and NSF BDI-05-15813. questions.
Feature combinations are shown to be useful for classification. By analyzing the relationship be-
classification by mapping data to a higher dimensional tween pattern frequency and its predictive power,
space. For example, word phrases can improve the ac- we demonstrate that frequent patterns provide
curacy of document classification. Given a categori- high quality features for classification.
cal dataset D with n features, we can explicitly enu-
merate all (2n ) feature combinations and use them in • Frequent pattern-based classification could exploit
classification. However, there are two significant draw- the state-of-the-art frequent pattern mining algo-
backs for this approach. First, since the number of rithms for feature generation, thus achieving much
feature combinations is exponential to the number of better scalability than the method of enumerating
single features, in many cases, it is computationally in- all feature combinations.
tractable to enumerate them when the number of sin-
gle features is large (the scalability issue). Second, in- • We establish a formal connection between our
clusion of combined features that appear rarely could framework with an information gain-based fea-
decrease the classification accuracy due to the “overfit- ture selection approach and show that the min sup
ting” issue—features are not representative. The first threshold is equivalent to an information gain
problem can be partially solved by the kernel tricks threshold at filtering low quality features. Such an
which derive a subset of combined features based on analysis suggests a strategy for setting min sup.
parameter tuning. However, the kernel approach re-
quires an intensive search for good parameters to avoid • An effective and efficient feature selection algo-
overfitting. rithm is proposed to select a set of frequent and
Through analysis, we found that the discriminative discriminative patterns for classification.
power of a low-support feature is bounded by a low
The rest of the paper is organized as follows. Sec-
value due to its limited coverage in the dataset; hence
tion 2 gives the problem formulation. In Section 3, we
the contribution of low-support features in classifica-
provide a framework for frequent pattern-based classi-
tion is limited, which justifies the usage of frequent
fication. We study the usefulness of frequent patterns,
patterns in classification. Furthermore, existing fre-
figure out a connection between support and feature fil-
quent pattern mining algorithms can facilitate the pat-
tering measures, discuss the minimum support setting
tern generation, thus solving the scalability issue in the
strategy and propose a feature selection algorithm. Ex-
classification of large datasets.
tensive experimental results are presented in Section 4,
As to the minimum support (denoted as min sup)
and related work is discussed in Section 5, followed by
threshold setting in frequent pattern mining, a map-
conclusions in Section 6.
ping is built between support threshold and discrim-
inative measures such as information gain and Fisher
score, so that features filtered by an information gain 2 Problem Formulation
threshold cannot exceed the corresponding min sup
threshold either. This result can be used to set min sup Assume a dataset has k categorical attributes, where
for generating useful patterns. each attribute has a set of values, and m classes C =
Since frequent patterns are generated solely based {c1 , ..., cm }. Each (attribute, value) pair is mapped to
on frequency without considering the predictive power, a distinct item in I = {o1 , ..., od }. Assume a pair
the use of frequent patterns without feature selection (att, val) → oi , where att is an attribute and val is
will still result in a huge feature space. This might not a value. Let x be the feature vector of a data point
only slow down the model learning process, but even s. Then xi = 1 if att(s) = val; xi = 0 if att(s) 6= val.
worse, the classification accuracy deteriorates (another For numerical attributes, the continuous values are dis-
kind of overfitting issue—features are too many). In cretized first. Following the mapping, the dataset is
this paper, we demonstrate that feature selection is represented in Bd as D = {x i , yi }ni=1 , where x i ∈ Bd
necessary to single out a small set of discriminative and yi ∈ C. xij ∈ B = {0, 1}, ∀i ∈ [1, n], j ∈ [1, d].
frequent patterns, which is essential for high quality
classifiers. Coupled with feature selection, frequent Definition 1 (Combined Feature) A combined
pattern-based classification is able to solve the scalabil- feature α = {oα1 ...oαk } is a subset of I, where
ity issue and the overfitting issue smoothly and achieve oαi ∈ {o1 , ..., od }, ∀1 ≤ i ≤ k. oi ∈ I is
excellent classification accuracy. a single feature. Given a dataset D = {xi },
In summary, our contributions include the set of data that contains α is denoted as
• We propose a framework of frequent pattern-based Dα = {xi |xiαj = 1, ∀oαj ∈ α}.
(a) Austral (b) Breast (c) Sonar

Figure 1. Information Gain vs. Pattern Length on UCI data

Definition 2 (Frequent Combined Feature) For 3.1 Why Are Frequent Patterns Good Features?
a dataset D, a combined feature α is frequent if
θ = |D α| |Dα |
|D| ≥ θ0 , where θ = |D| is the relative support Frequent patterns have two properties: (1) each pat-
of α, and θ0 is the min sup threshold, 0 ≤ θ0 ≤ 1. The tern is a combination of single features, and (2) they are
set of frequent combined features is denoted as F. frequent. We will analyze these properties and explain
why frequent patterns are useful for classification.

Given a dataset D = {x i , yi }ni=1 and a set of fre-


3.1.1 The Usefulness of Combined Features
quent patterns F, D is mapped into a higher dimen-
0
sional space Bd with d0 features in I ∪ F. The data is Frequent pattern is a form of non-linear feature com-
0
denoted as D0 = {x 0i , yi }ni=1 , where x 0i ∈ Bd . Notice bination over the set of single features. With inclu-
that F is parameterized with min sup θ0 . sion of non-linear feature combinations, the expressive
power of the new feature space increases. The “Ex-
Frequent Pattern-Based Classification is learning a
clusive OR” is an example where the data is linearly
classification model in the feature space of single fea-
separable in B3 = (x, y, xy), but not so in the original
tures as well as frequent patterns, where frequent pat-
space B2 = (x, y). Non-linear mapping is widely used,
terns are generated w.r.t. min sup.
e.g., string kernel [15, 12] for text or biosequence clas-
sification. In frequent pattern-based classification, the
single feature vector x is explicitly transformed from
3 The Framework of Frequent Pattern- 0
the space Bd where d = |I| to a larger space Bd where
based Classification d0 = |I ∪ F|. This will likely increase the chance of in-
cluding important features.
In addition, the discriminative power of some fre-
In this section, we examine the framework of fre-
quent patterns is higher than that of single features
quent pattern-based classification which includes three
because they capture more underlying semantics of the
steps: (1) feature generation, (2) feature selection, and
data. We retrieved three UCI datasets and plotted in-
(3) model learning.
formation gain [17] of both single features and frequent
In the feature generation step, frequent patterns are patterns in Figure 1. It is clear that some frequent pat-
generated with a user-specified min sup. The data is terns have higher information gain than single features.
partitioned according to the class label. Frequent pat-
terns are discovered in each partition with min sup.
3.1.2 Discriminative Power versus Pattern
The collection of frequent patterns F is the feature
Frequency
candidates. In the second step, feature selection is ap-
plied on F. The set of selected features is Fs . Given In this subsection, we study the relationship between
0
Fs , the dataset D is transformed to D0 in Bd . The the discriminative power of a feature and its support
feature space includes all the single features as well as and demonstrate that the discriminative power of low-
the selected frequent patterns. Finally, a classification support features is limited. In addition, they could
model is built on the dataset D0 . harm the classification accuracy due to overfitting.
First, a classification model which uses frequent fea- The partial derivative of Hlb (C|X)|q=1 w.r.t. θ is
tures for induction has statistical significance, thus gen-
eralizes well to the test data. If an infrequent feature ∂Hlb (C|X)|q=1 p−θ p−1 1−p
= log − −
is used, the model cannot generalize well to the test ∂θ 1−θ 1−θ 1−θ
data since it is built based on statistically minor obser- p−θ
vations. This is referred to as overfitting. = log
1−θ
Second, the discriminative power of a pattern is ≤ log 1
closely related to its support. Take information gain ≤ 0
as an example. For a pattern α represented by a ran-
dom variable X, the information gain is The above analysis demonstrates that the informa-
tion gain upper bound IGub (C|X) is a function of
IG(C|X) = H(C) − H(C|X) (1)
support θ. Hlb (C|X)|q=1 is monotonically decreasing
where H(C) is the entropy and H(C|X) is the con- with θ, i.e., the smaller θ is, the larger Hlb (C|X), and
ditional entropy. Given a dataset with a fixed class the smaller IGub (C|X). When θ is small, IGub (C|X)
distribution, H(C) is a constant. The upper bound of is small. Therefore, the discriminative power of low-
the information gain, IGub , is frequency patterns is bounded by a small value. For
the symmetric case θ ≥ p, a similar conclusion could
IGub (C|X) = H(C) − Hlb (C|X) (2) be drawn: The discriminative power of very high-
frequency patterns is bounded by a small value, ac-
where Hlb (C|X) is the lower bound of H(C|X). As- cording to the similar rationale.
sume the support of α is θ, we will show in the fol- To support the analysis above, we depict empirical
lowing that, IGub (C|X) is closely related to θ. When results on three UCI datasets in Figure 2. The x axis
θ is small, IGub (C|X) is low. That is, the infrequent represents the (absolute) support of a pattern and the
features have a very low information gain upper bound. y axis represents the information gain. We can clearly
To simplify the analysis, assume X ∈ {0, 1} and see that the information gain of a low-support pattern
C = {0, 1}. Let P (x = 1) = θ, P (c = 1) = p and is bounded by a small value. In addition, for each abso-
P (c = 1|x = 1) = q. Then lute support, we also plot the theoretical upper bound
IGub (C|X)|q=1 if θ ≤ p or IGub (C|X)|q=p/θ if θ > p,
X X given the fixed p = P (c = 1) from the real dataset.
H(C|X) = − P (x) P (c|x) log P (c|x) We can see that the upper bound of information gain
x∈{0,1} c∈{0,1}
at very low support (and very high support) is small,
= −θq log q − θ(1 − q) log(1 − q) which confirms our analysis. For example, for a sup-
p − θq port count of 31 (i.e., θ = 5%) in Figure 2 (a), the
+ (θq − p) log
1−θ information gain upper bound is as low as 0.06.
(1 − p) − θ(1 − q) Another interesting observation is, at a medium
+ (θ(1 − q) − (1 − p)) log large support (e.g., support = 300 in Figure 2 (a))
1−θ
where the upper bound reaches the maximum possi-
ble value IGub = H(C), there is a big margin between
H(C|X) is a function of p, q and θ. Given a dataset, the information gain of frequent patterns and the upper
p is a fixed value. As H(C|X) is a concave function, bound. However, it does not necessarily demonstrate
it reaches its lower bound w.r.t. q, for fixed p and θ that frequent patterns cannot have very high discrimi-
at the following conditions. If θ ≤ p, H(C|X) reaches native power. As a matter of fact, the set of available
its lower bound when q = 0 or 1. If θ > p, H(C|X) frequent patterns and their predictive power is closely
reaches its lower bound when q = p/θ or 1 − (1 − p)/θ. related to the dataset and the class distribution.
The cases of θ ≤ p and θ ≥ p are symmetric. Due to Besides information gain, Fisher score [8] is also
space limit, we only discuss the case when θ ≤ p and popularly used to measure the discriminative power of
the analysis for the other is similar. a feature. We analyze the relationship between Fisher
Since q = 0 and q = 1 are symmetric for the case score and pattern support. Fisher score is defined as
θ ≤ p, we only discuss the case q = 1. In that case, the Pc
lower bound Hlb (C|X) is n (µ − µ)2
F r = i=1 Pc i i 2 (4)
p−θ p−θ 1−p 1−p i=1 ni σi
Hlb (C|X)|q=1 = (θ−1)( log + log ) where ni is the number of data samples in class i, µi is
1−θ 1−θ 1−θ 1−θ
(3) the average feature value in class i, σi is the standard
1

0.9 InfoGain
IG_UpperBnd
0.8

0.7

Information Gain
0.6

0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600 700
Support

(a) Austral (b) Breast (c) Sonar

Figure 2. Information Gain and the Theoretical Upper Bound vs. Support on UCI data

3.5

FisherScore
3 FS_UpperBnd

2.5
Fisher Score

1.5

0.5

0
0 100 200 300 400 500 600 700
Support

(a) Austral (b) Breast (c) Sonar

Figure 3. Fisher Score and the Theoretical Upper Bound vs. Support on UCI data

deviation of the feature value in class i, and µ is the According to Eq. (6), as θ increases, F rub|q=1 in-
average feature value in the whole dataset. creases monotonically, for a fixed p. For θ ≤ p, Fisher
We use the notation of p, q and θ as defined before score upper bound of a low-frequency pattern is smaller
and assume we only have two classes. Assume θ ≤ p than that of a high-frequency one. Note, as θ increases,
(the analysis for θ > p is symmetric), then F r is, F rub|q=1 will have a very large value. When θ → p,
F rub|q=1 → ∞.
θ(p − q)2 Another interesting evidence to show the relation-
Fr = (5)
p(1 − p)(1 − θ) − θ(p − q)2 ship between F r and θ is the sign of ∂F r
∂θ . For Eq. (5),
the partial derivative of F r w.r.t. θ is
In Eq. (5), let Y = p(1−p)(1−θ) and Z = θ(p−q)2 .
Then Y ≥ 0 and Z ≥ 0. If Y = 0, we can verify that
Z = 0 too. Then F r is undefined in Eq. (5). In this ∂F r (p − q)2 p(1 − p)
= ≥0 (7)
case, F r = 0 according to Eq. (4). For the case when ∂θ (p − p2 − θq 2 − θp + 2θpq)2
Y > 0 and Z ≥ 0, Eq. (5) is equivalent to
The inequality holds because p ∈ [0, 1]. Therefore,
Z when θ ≤ p, F r monotonically increases with θ, for
Fr =
Y −Z fixed p and q. The result shows that, Fisher score of
For fixed p and θ, Y is a positive constant. Then F r a high-frequency feature is larger than that of a low-
monotonically increases with Z = θ(p − q)2 . Assume frequency one, if p and q are fixed.
p ∈ (0, 0.5] (p ∈ [0.5, 1) is symmetric), then when q = 1, Figure 3 shows the Fisher score of each pattern vs.
F r reaches its maximum value w.r.t. q, for fixed p and its (absolute) support. We also plot the Fisher score
θ. We denote this maximum value as F rub . Put q = 1 upper bound F rub w.r.t. support. As mentioned above,
into Eq. (5), we have for θ ≤ p, as θ increases, F rub will have very large val-
ues. F rub → ∞ as θ approaches p. Hence, we only
θ(1 − p) plot a portion of the curve which shows the trend very
F rub|q=1 = (6) clearly. The result is similar to Figure 2. These empir-
p−θ
ical results demonstrate that, features of low support 3.2 The Minimum Support Effect
have very limited discriminative power, which is due to
their limited coverage in the dataset. Features of very Since the set of frequent patterns F is generated
high support have very limited discriminative power according to min sup, we study the impact of min sup
too, which is due to their commonness in the data. on the classification accuracy and propose a strategy
to set min sup.
3.1.3 The Justification of Frequent Pattern- If min sup is set with a large value, the patterns
Based Classification in F correspond to very frequent ones. In the con-
text of classification, they may not be the best feature
Based on the above analysis, we will demonstrate that candidates, since they appear in a large portion of the
the frequent pattern-based classification is a scalable dataset, in different classes. We can clearly observe in
and effective methodology. The justification is done by Figures 2 and 3 that at a very large min sup value, the
building a connection between a well-established infor- theoretical upper bound decreases, due to the “over-
mation gain-based feature selection approach and our whelming” occurrences of the high-support patterns.
frequent pattern-based method. This is analogous to the stop word in text retrieval
Assume the problem context is using combined fea- where those highly frequent words are removed before
tures for classification. In a commonly used feature se- document retrieval or text categorization.
lection approach, assume all feature combinations are As min sup lowers down, it is expected that the
generated as feature candidates. A subset of high qual- trend of classification accuracy increases, as more dis-
ity features are selected for classification, with an in- criminative patterns with medium frequency are dis-
formation gain threshold IG0 (or a Fisher score thresh- covered. However, as min sup decreases to a very low
old). According to the analysis in Section 3.1.2, one can value, the classification accuracy stops increasing, or
always find a min sup threshold θ∗ , which satisfies: even starts dropping due to overfitting. As we ana-
lyzed in Section 3.1, features with low support have
θ∗ = arg max (IGub (θ) ≤ IG0 ) (8) low discriminative power. They could even harm the
θ
classification accuracy if they are included for classifi-
where IGub (θ) is the information gain upper bound cation, due to the overfitting effect. In addition, the
at support θ. That is, θ∗ is the maximum support costs of time and space at both the frequent pattern
threshold where the information gain upper bound at mining and the feature selection step become very high
this point is no greater than IG0 . with a low min sup.
The feature selection approach filters all the com- We propose a strategy to set min sup, the major
bined features whose information gain is less than IG0 ; steps of which are outlined below.
accordingly, in the frequent pattern-based method, fea-
tures with support θ ≤ θ∗ can be safely skipped be- • Compute the theoretical information gain (or
cause IG(θ) ≤ IGub (θ) ≤ IGub (θ∗ ) ≤ IG0 . Compared Fisher score) upper bound as a function of sup-
with the information gain-based approach, it is equiva- port θ;
lent to generate the feature with min sup = θ∗ , then ap- • Choose an information gain threshold IG0 for fea-
ply feature selection on the frequent patterns only. The ture filtering purpose;
latter is our frequent pattern-based approach. Since
the number of all the feature combinations is usu- • Find θ∗ = arg maxθ (IGub (θ) ≤ IG0 );
ally very large, the enumeration and feature selection
• Mine frequent patterns with min sup = θ∗ .
over such a huge feature space is computationally in-
tractable. In contrast, frequent pattern-based method First, compute the theoretical information gain up-
achieves the same result but in a much more efficient per bound as a function of support θ. This only in-
way. Obviously it can benefit from the state-of-the-art volves with the class distribution p, without generat-
frequent pattern mining algorithms. The choice of the ing frequent patterns. Then decide an information gain
information gain threshold IG0 in the first approach threshold IG0 and find the corresponding θ∗ . Then for
corresponds to the setting of the min sup parameter in θ ≤ θ∗ , IGub (θ) ≤ IGub (θ∗ ) ≤ IG0 . In this way, fre-
our framework. If IG0 is large, the corresponding θ∗ is quent patterns are generated efficiently without miss-
large and vice versa. As it is important to determine ing any feature candidates w.r.t. IG0 . As there are
the information gain threshold in most feature selec- more mature studies on how to set the information
tion algorithms, the strategy of setting an appropriate gain threshold in feature selection methods [24], we can
min sup is equally crucial. We will discuss this issue in borrow their strategy and map the selected information
Section 3.2. gain threshold to a min sup threshold in our method.
3.3 Feature Selection Algorithm MMRFS Algorithm 1 Feature Selection Algorithm MMRFS
Input: Frequent patterns F, Coverage threshold δ,
Although frequent patterns are shown to be useful Relevance S, Redundancy R
for classification, not every frequent pattern is equally Output: A selected pattern set Fs
useful. It is necessary to perform feature selection
to single out a subset of discriminative features and 1: Let α be the most relevant pattern;
remove non-discriminative ones. In this section, we 2: Fs = {α};
propose an algorithm MMRFS. The notion is bor- 3: while (true)
rowed from the Maximal Marginal Relevance (MMR) 4: Find a pattern β such that the gain g(β) is the
[4] heuristic in information retrieval, where a document maximum among the set of patterns in F − Fs ;
has high marginal relevance if it is both relevant to the 5: If β can correctly cover at least one instance
query and contains minimal marginal similarity to pre- 6: Fs = Fs ∪ {β};
viously selected documents. We first define relevance 7: F = F − {β};
and redundancy of a frequent pattern in the context of 8: If all instances are covered δ times or F = φ
classification. 9: break;
10: return Fs
Definition 3 (Relevance) A relevance measure S is
a function mapping a pattern α to a real value such
that S(α) is the relevance w.r.t. the class label.
An interesting question arises: How many frequent
Relevance models the discriminative power of a fre-
patterns should be selected for effective classification?
quent pattern w.r.t. the class label. Measures like infor-
A promising method is to add a database coverage con-
mation gain and Fisher score can be used as a relevance
straint δ, as in [13]. The coverage parameter δ is set to
measure.
ensure that each training instance is covered at least δ
Definition 4 (Redundancy) A redundancy measure times by the selected features. In this way, the num-
R is a function mapping two patterns α and β to a ber of features selected is automatically determined,
real value such that R(α, β) is the redundancy between given a user-specified parameter δ. The algorithm is
them. described in Algorithm 1.

Redundancy measures the extent by which two pat-


4 Experimental Results
terns are similar. In this paper, we use a variant of
the Jaccard measure [18] to measure the redundancy
between different features. In this section, we report a systematic experimental
study for the evaluation of our frequent pattern-based
P (α, β) classification framework and our proposed feature se-
R(α, β) = × min(S(α), S(β))
P (α) + P (β) − P (α, β) lection algorithm MMRFS.
(9) A series of datasets from UCI Machine Learning
According to the redundancy definition, we use the Repository are tested. Continuous attributes are dis-
closed frequent patterns [27] as features instead of fre- cretized. We use FPClose [9] to generate closed pat-
quent ones in our framework, since for a closed pattern terns and MMRFS algorithm to do the feature selec-
α and its non-closed sub-pattern β, β is completely tion. LIBSVM [5] and C4.5 in Weka [21] are chosen as
redundant w.r.t. α. two classification models. Each dataset is partitioned
The MMRFS algorithm searches over the feature into ten parts evenly. Each time, one part is used for
space in a heuristic way. A feature is selected if it test and the other nine are used for training. We did
is relevant to the class label and contains very low re- 10-fold cross validation on each training set and picked
dundancy to the features already selected. Initially, the best model for test. The classification accuracies
the feature with the highest relevance measure is se- on the ten test datasets are averaged and reported.
lected. Then the algorithm incrementally selects more
patterns from F with an estimated gain g. A pattern 4.1 Frequent Pattern-based Classification
is selected if it has the maximum gain among the re-
maining patterns. The gain of a pattern α given a set We test the performance of the frequent pattern-
of already selected patterns Fs is based classification. For each dataset, a set of frequent
patterns F is generated. A classification model is built
g(α) = S(α) − max R(α, β) (10) using features in I ∪ F, denoted as Pat All. MMRFS is
β∈Fs
Table 1. Accuracy by SVM on Frequent Com- Table 2. Accuracy by C4.5 on Frequent Com-
bined Features vs. Single Features bined Features vs. Single Features

Data Single Feature Freq. Pattern Dataset Single Features Frequent Patterns
Item All Item F S Item RBF P at All P at F S Item All Item F S P at All P at F S
anneal 99.78 99.78 99.11 99.33 99.67 anneal 98.33 98.33 97.22 98.44
austral 85.01 85.50 85.01 81.79 91.14 austral 84.53 84.53 84.21 88.24
auto 83.25 84.21 78.80 74.97 90.79 auto 71.70 77.63 71.14 78.77
breast 97.46 97.46 96.98 96.83 97.78 breast 95.56 95.56 95.40 96.35
cleve 84.81 84.81 85.80 78.55 95.04 cleve 80.87 80.87 80.84 91.42
diabetes 74.41 74.41 74.55 77.73 78.31 diabetes 77.02 77.02 76.00 76.58
glass 75.19 75.19 74.78 79.91 81.32 glass 75.24 75.24 76.62 79.89
heart 84.81 84.81 84.07 82.22 88.15 heart 81.85 81.85 80.00 86.30
hepatic 84.50 89.04 85.83 81.29 96.83 hepatic 78.79 85.21 80.71 93.04
horse 83.70 84.79 82.36 82.35 92.39 horse 83.71 83.71 84.50 87.77
iono 93.15 94.30 92.61 89.17 95.44 iono 92.30 92.30 92.89 94.87
iris 94.00 96.00 94.00 95.33 96.00 iris 94.00 94.00 93.33 93.33
labor 89.99 91.67 91.67 94.99 95.00 labor 86.67 86.67 95.00 91.67
lymph 81.00 81.62 84.29 83.67 96.67 lymph 76.95 77.62 74.90 83.67
pima 74.56 74.56 76.15 76.43 77.16 pima 75.86 75.86 76.28 76.72
sonar 82.71 86.55 82.71 84.60 90.86 sonar 80.83 81.19 83.67 83.67
vehicle 70.43 72.93 72.14 73.33 76.34 vehicle 70.70 71.49 74.24 73.06
wine 98.33 99.44 98.33 98.30 100 wine 95.52 93.82 96.63 99.44
zoo 97.09 97.09 95.09 94.18 99.00 zoo 91.18 91.18 95.09 97.09

applied on F and a classifier is built using features in The degree (i.e., the maximum length) of combined fea-
I ∪ Fs , denoted as Pat FS. For comparison, we test the tures depends on the value of γ where γ is the factor
2
classifiers built on single features, denoted as Item All in K(x, y) = e−γkx−yk , i.e., the degree increases as
(using all single features) and Item FS (selected single γ grows. Given a particular γ, the combined features
features), respectively. Table 1 shows the results by F p of length ≤ p are used without discriminating their
SVM and Table 2 shows the results by C4.5. In LIB- frequency or predictive power, while the combined fea-
SVM, all the above four models use linear kernel. In tures of length > p are filtered out.
addition, an SVM model is built using RBF kernel on We also observe that the performance of Pat All is
single features, denoted as Item RBF. much worse than that of Pat FS, which confirms our
From Table 1, it is clear that Pat FS achieves the reasoning that, redundant and non-discriminative pat-
best classification accuracy in most cases. It has sig- terns often overfit the model and deteriorate the clas-
nificant improvement over Item All and Item FS. This sification accuracy. In addition, MMRFS is shown to
result is consistent with our theoretical analysis that be effective. Generally, any effective feature selection
(1) frequent patterns are useful by mapping the data algorithm can be used in our framework. The emphasis
to a higher dimensional space; and (2) the discrimi- is that feature selection is an important step in frequent
native power of some frequent patterns is higher than pattern-based classification.
that of single features. The above results are also observed in Table 2 for
Another interesting observation is that the perfor- decision tree-based classification.
mance of Item RBF is inferior to that of Pat FS. The
reason is, RBF kernel has a different mechanism for fea-
ture generation from our approach. In our approach,
4.2 Scalability Tests
min sup is used to filter out low-frequency features and
MMRFS is applied to select highly discriminative fea- Scalability tests are performed to show our frequent
tures. On the other hand, the RBF kernel maps the pattern-based framework is very scalable with good
original feature vector to a possibly infinite dimension. classification accuracy. Three dense datasets, Chess,
Waveform and Letter Recognition1 from UCI reposi- C4.5. When min sup = 1, the enumeration of all the
tory are used. On each data, min sup = 1 is used to patterns cannot complete in days, thus blocking model
enumerate all feature combinations and feature selec- construction. Our framework, benefiting from higher
tion is applied over them. In comparison, the frequent support threshold, can accomplish the mining of fre-
pattern-based classification method is tested with vari- quent patterns in seconds and achieve satisfactory clas-
ant support threshold settings. sification accuracy.
Tables 4 and 5 show similar results on the other two
datasets. When min sup = 1, millions of patterns are
Table 3. Accuracy & Time on Chess Data enumerated. Feature selection fails with such a large
number of patterns. In contrast, our frequent pattern-
min sup #Patterns Time (s) SVM (%) C4.5 (%)
based method is very efficient and achieves good accu-
1 N/A N/A N/A N/A
racy within a wide range of minimum support thresh-
2000 68,967 44.703 92.52 97.59
olds.
2200 28,358 19.938 91.68 97.84
2500 6,837 2.906 91.68 97.62
2800 1,031 0.469 91.84 97.37 5 Related Work
3000 136 0.063 91.90 97.06
The frequent pattern-based classification is related
to associative classification. In associative classifica-
tion, a classifier is built based on high-confidence, high-
Table 4. Accuracy & Time on Waveform Data support association rules [14, 13, 25, 6, 19]. The asso-
ciation between frequent patterns and class labels is
min sup #Patterns Time (s) SVM (%) C4.5 (%) used for prediction.
1 9,468,109 N/A N/A N/A A recent work on top-k rule mining [6] discovers top-
80 26,576 176.485 92.40 88.35 k covering rule groups for each row of gene expression
100 15,316 90.406 92.19 87.29 profiles. Prediction is then performed based on a clas-
150 5,408 23.610 91.53 88.80 sification score which combines the support and confi-
200 2,481 8.234 91.22 87.32 dence measures of the rules.
HARMONY [19] is another rule-based classifier
which directly mines classification rules. It uses an
instance-centric rule-generation approach and assures
Table 5. Accuracy & Time on Letter Recogni- for each training instance, that one of the highest-
tion Data confidence rules covering the instance is included in
the rule set. HARMONY is shown to be more efficient
and scalable than previous rule-based classifiers. On
min sup #Patterns Time (s) SVM (%) C4.5 (%) several datasets that were tested by both our method
1 5,147,030 N/A N/A N/A and HARMONY, our classification accuracy is signifi-
3000 3,246 200.406 79.86 77.08 cantly higher, e.g., the improvement is up to 11.94%
3500 2,078 103.797 80.21 77.28 on Waveform and 3.40% on Letter Recognition.
4000 1,429 61.047 79.57 77.32 Our work is different from associative classification
4500 962 35.235 79.51 77.42 in the following aspects: (1) We use frequent patterns
to represent the data in a different feature space, in
which any learning algorithm can be used, whereas
In Table 3, we show the result by varying min sup associative classification builds a classification model
on the Chess data which contains 3, 196 instances, 2 using rules only; (2) in associative classification, the
classes and 73 items. #P atterns gives the number of prediction process is to find one or several top ranked
closed patterns. T ime gives the sum of pattern min- rule(s) for prediction, whereas in our case, the predic-
ing and feature selection time. We do not include the tion is made by the classification model; and (3) more
classification time in the table because our goal is to importantly, we provide in-depth analysis on why fre-
show that the proposed framework has good scalabil- quent patterns provide a good solution for classifica-
ity in feature generation and selection. The last two tion, by studying the relationship between the discrim-
columns give the classification accuracy by SVM and inative power and pattern support. By establishing
1 The discretized Letter Recognition data is obtained from a connection with an information gain-based feature
www.csc.liv.ac.uk/∼frans/KDD/Software/LUCS-KDD-DN/DataSets selection approach, we propose a strategy for setting
min sup as well. In addition, we demonstrate the im- [8] R. Duda, P. Hart, and D. Stork. Pattern Classification.
portance of feature selection on the frequent pattern Wiley Interscience, 2nd edition, 2000.
features and propose a feature selection algorithm. [9] G. Grahne and J. Zhu. Efficiently using prefix-trees
Other related work includes classification which uses in mining frequent itemsets. In ICDM Workshop on
Frequent Itemset Mining Implementations (FIMI’03),
string kernels [15, 12], or word combinations in NLP or
2003.
structural features in graph classification [7]. In all [10] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
these studies, frequent patterns are generated and the without candidate generation. In Proc. of SIGMOD,
data is mapped to a higher dimensional feature space. pages 1–12, 2000.
Data which are not linearly separable in the original [11] M. Kuramochi and G. Karypis. Frequent subgraph
space become linearly separable in the mapped space. discovery. In Proc. of ICDM, pages 313–320, 2001.
[12] C. Leslie, E. Eskin, and W. S. Noble. The spectrum
kernel: A string kernel for svm protein classification.
6 Conclusions In Proc. of PSB, pages 564–575, 2002.
[13] W. Li, J. Han, and J. Pei. CMAR: Accurate and effi-
In this paper, we propose a systematic framework cient classification based on multiple class-association
for frequent pattern-based classification and give the- rules. In Proc. of ICDM, pages 369–376, 2001.
[14] B. Liu, W. Hsu, and Y. Ma. Integrating classification
oretical answers to several critical questions raised by
and association rule mining. In Proc. of KDD, pages
this framework. Our study shows frequent patterns are 80–86, 1998.
high quality features and have good model generaliza- [15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristian-
tion ability. Connected with a commonly used feature ini, and C. Watkins. Text classification using string
selection approach, our method is able to overcome two kernels. Journal of Machine Learning Research, 2:419–
kinds of overfitting problems and shown to be scal- 444, 2002.
able. A strategy for setting min sup is also suggested. [16] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen,
In addition, we propose a feature selection algorithm U. Dayal, and M.-C. Hsu. PrefixSpan: Mining se-
quential patterns efficiently by prefix-projected pat-
to select discriminative frequent patterns. Experimen-
tern growth. In Proc. of ICDE, pages 215–226, 2001.
tal studies demonstrate that significant improvement [17] J. Quinlan. C4.5: Programs for Machine Learning.
is achieved in classification accuracy using the frequent Morgan Kaufmann, 1993.
pattern-based classification framework. [18] P. Tan, V. Kumar, and J. Srivastava. Selecting the
The framework is also applicable to more complex right interestingness measure for association patterns.
patterns, including sequences and graphs. In the fu- In Proc. of KDD, pages 32–41, 2002.
ture, we will conduct research in this direction. [19] J. Wang and G. Karypis. HARMONY: Efficiently min-
ing the best rules for classification. In Proc. of SDM,
pages 205–216, 2005.
References [20] K. Wang, C. Xu, and B. Liu. Clustering transactions
using large items. In Proc. of CIKM, pages 483–490,
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining asso- 1999.
ciation rules between sets of items in large databases. [21] I. H. Witten and E. Frank. Data Mining: Practical
In Proc. of SIGMOD, pages 207–216, 1993. machine learning tools and techniques. Morgan Kauf-
[2] R. Agrawal and R. Srikant. Fast algorithms for mining mann, 2nd edition, 2005.
association rules. In Proc. of VLDB, pages 487–499, [22] X. Yan and J. Han. gSpan: Graph-based substructure
1994. pattern mining. In Proc. of ICDM, pages 721–724,
[3] R. Agrawal and R. Srikant. Mining sequential pat- 2002.
terns. In Proc. of ICDE, pages 3–14, 1995. [23] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A fre-
[4] J. Carbonell and J. Coldstein. The use of mmr, quent structure-based approach. In Proc. of SIGMOD,
diversity-based reranking for reordering documents pages 335–346, 2004.
[24] Y. Yang and J. O. Pedersen. A comparative study
and producing summaries. In Proc. of SIGIR, pages
on feature selection in text categorization. In Proc. of
335–336, 1998.
[5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for ICML, pages 412–420, 1997.
[25] X. Yin and J. Han. CPAR: Classification based on
support vector machines, 2001. Software available at
predictive association rules. In Proc. of SDM, pages
http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
331–335, 2003.
[6] G. Cong, K. Tan, A. Tung, and X. Xu. Mining top-k
[26] M. J. Zaki. SPADE: An efficient algorithm for mining
covering rule groups for gene expression data. In Proc.
frequent sequences. Machine Learning, 42(1/2):31–60,
of SIGMOD, pages 670–681, 2005.
2001.
[7] M. Deshpande, M. Kuramochi, and G. Karypis. Fre- [27] M. J. Zaki and C. Hsiao. CHARM: An efficient al-
quent sub-structure-based approaches for classifying gorithm for closed itemset mining. In Proc. of SDM,
chemical compounds. In Proc. of ICDM, pages 35–42, pages 457–473, 2002.
2003.

You might also like