Frequent Pattern For Classification
Frequent Pattern For Classification
Frequent Pattern For Classification
Definition 2 (Frequent Combined Feature) For 3.1 Why Are Frequent Patterns Good Features?
a dataset D, a combined feature α is frequent if
θ = |D α| |Dα |
|D| ≥ θ0 , where θ = |D| is the relative support Frequent patterns have two properties: (1) each pat-
of α, and θ0 is the min sup threshold, 0 ≤ θ0 ≤ 1. The tern is a combination of single features, and (2) they are
set of frequent combined features is denoted as F. frequent. We will analyze these properties and explain
why frequent patterns are useful for classification.
0.9 InfoGain
IG_UpperBnd
0.8
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Support
Figure 2. Information Gain and the Theoretical Upper Bound vs. Support on UCI data
3.5
FisherScore
3 FS_UpperBnd
2.5
Fisher Score
1.5
0.5
0
0 100 200 300 400 500 600 700
Support
Figure 3. Fisher Score and the Theoretical Upper Bound vs. Support on UCI data
deviation of the feature value in class i, and µ is the According to Eq. (6), as θ increases, F rub|q=1 in-
average feature value in the whole dataset. creases monotonically, for a fixed p. For θ ≤ p, Fisher
We use the notation of p, q and θ as defined before score upper bound of a low-frequency pattern is smaller
and assume we only have two classes. Assume θ ≤ p than that of a high-frequency one. Note, as θ increases,
(the analysis for θ > p is symmetric), then F r is, F rub|q=1 will have a very large value. When θ → p,
F rub|q=1 → ∞.
θ(p − q)2 Another interesting evidence to show the relation-
Fr = (5)
p(1 − p)(1 − θ) − θ(p − q)2 ship between F r and θ is the sign of ∂F r
∂θ . For Eq. (5),
the partial derivative of F r w.r.t. θ is
In Eq. (5), let Y = p(1−p)(1−θ) and Z = θ(p−q)2 .
Then Y ≥ 0 and Z ≥ 0. If Y = 0, we can verify that
Z = 0 too. Then F r is undefined in Eq. (5). In this ∂F r (p − q)2 p(1 − p)
= ≥0 (7)
case, F r = 0 according to Eq. (4). For the case when ∂θ (p − p2 − θq 2 − θp + 2θpq)2
Y > 0 and Z ≥ 0, Eq. (5) is equivalent to
The inequality holds because p ∈ [0, 1]. Therefore,
Z when θ ≤ p, F r monotonically increases with θ, for
Fr =
Y −Z fixed p and q. The result shows that, Fisher score of
For fixed p and θ, Y is a positive constant. Then F r a high-frequency feature is larger than that of a low-
monotonically increases with Z = θ(p − q)2 . Assume frequency one, if p and q are fixed.
p ∈ (0, 0.5] (p ∈ [0.5, 1) is symmetric), then when q = 1, Figure 3 shows the Fisher score of each pattern vs.
F r reaches its maximum value w.r.t. q, for fixed p and its (absolute) support. We also plot the Fisher score
θ. We denote this maximum value as F rub . Put q = 1 upper bound F rub w.r.t. support. As mentioned above,
into Eq. (5), we have for θ ≤ p, as θ increases, F rub will have very large val-
ues. F rub → ∞ as θ approaches p. Hence, we only
θ(1 − p) plot a portion of the curve which shows the trend very
F rub|q=1 = (6) clearly. The result is similar to Figure 2. These empir-
p−θ
ical results demonstrate that, features of low support 3.2 The Minimum Support Effect
have very limited discriminative power, which is due to
their limited coverage in the dataset. Features of very Since the set of frequent patterns F is generated
high support have very limited discriminative power according to min sup, we study the impact of min sup
too, which is due to their commonness in the data. on the classification accuracy and propose a strategy
to set min sup.
3.1.3 The Justification of Frequent Pattern- If min sup is set with a large value, the patterns
Based Classification in F correspond to very frequent ones. In the con-
text of classification, they may not be the best feature
Based on the above analysis, we will demonstrate that candidates, since they appear in a large portion of the
the frequent pattern-based classification is a scalable dataset, in different classes. We can clearly observe in
and effective methodology. The justification is done by Figures 2 and 3 that at a very large min sup value, the
building a connection between a well-established infor- theoretical upper bound decreases, due to the “over-
mation gain-based feature selection approach and our whelming” occurrences of the high-support patterns.
frequent pattern-based method. This is analogous to the stop word in text retrieval
Assume the problem context is using combined fea- where those highly frequent words are removed before
tures for classification. In a commonly used feature se- document retrieval or text categorization.
lection approach, assume all feature combinations are As min sup lowers down, it is expected that the
generated as feature candidates. A subset of high qual- trend of classification accuracy increases, as more dis-
ity features are selected for classification, with an in- criminative patterns with medium frequency are dis-
formation gain threshold IG0 (or a Fisher score thresh- covered. However, as min sup decreases to a very low
old). According to the analysis in Section 3.1.2, one can value, the classification accuracy stops increasing, or
always find a min sup threshold θ∗ , which satisfies: even starts dropping due to overfitting. As we ana-
lyzed in Section 3.1, features with low support have
θ∗ = arg max (IGub (θ) ≤ IG0 ) (8) low discriminative power. They could even harm the
θ
classification accuracy if they are included for classifi-
where IGub (θ) is the information gain upper bound cation, due to the overfitting effect. In addition, the
at support θ. That is, θ∗ is the maximum support costs of time and space at both the frequent pattern
threshold where the information gain upper bound at mining and the feature selection step become very high
this point is no greater than IG0 . with a low min sup.
The feature selection approach filters all the com- We propose a strategy to set min sup, the major
bined features whose information gain is less than IG0 ; steps of which are outlined below.
accordingly, in the frequent pattern-based method, fea-
tures with support θ ≤ θ∗ can be safely skipped be- • Compute the theoretical information gain (or
cause IG(θ) ≤ IGub (θ) ≤ IGub (θ∗ ) ≤ IG0 . Compared Fisher score) upper bound as a function of sup-
with the information gain-based approach, it is equiva- port θ;
lent to generate the feature with min sup = θ∗ , then ap- • Choose an information gain threshold IG0 for fea-
ply feature selection on the frequent patterns only. The ture filtering purpose;
latter is our frequent pattern-based approach. Since
the number of all the feature combinations is usu- • Find θ∗ = arg maxθ (IGub (θ) ≤ IG0 );
ally very large, the enumeration and feature selection
• Mine frequent patterns with min sup = θ∗ .
over such a huge feature space is computationally in-
tractable. In contrast, frequent pattern-based method First, compute the theoretical information gain up-
achieves the same result but in a much more efficient per bound as a function of support θ. This only in-
way. Obviously it can benefit from the state-of-the-art volves with the class distribution p, without generat-
frequent pattern mining algorithms. The choice of the ing frequent patterns. Then decide an information gain
information gain threshold IG0 in the first approach threshold IG0 and find the corresponding θ∗ . Then for
corresponds to the setting of the min sup parameter in θ ≤ θ∗ , IGub (θ) ≤ IGub (θ∗ ) ≤ IG0 . In this way, fre-
our framework. If IG0 is large, the corresponding θ∗ is quent patterns are generated efficiently without miss-
large and vice versa. As it is important to determine ing any feature candidates w.r.t. IG0 . As there are
the information gain threshold in most feature selec- more mature studies on how to set the information
tion algorithms, the strategy of setting an appropriate gain threshold in feature selection methods [24], we can
min sup is equally crucial. We will discuss this issue in borrow their strategy and map the selected information
Section 3.2. gain threshold to a min sup threshold in our method.
3.3 Feature Selection Algorithm MMRFS Algorithm 1 Feature Selection Algorithm MMRFS
Input: Frequent patterns F, Coverage threshold δ,
Although frequent patterns are shown to be useful Relevance S, Redundancy R
for classification, not every frequent pattern is equally Output: A selected pattern set Fs
useful. It is necessary to perform feature selection
to single out a subset of discriminative features and 1: Let α be the most relevant pattern;
remove non-discriminative ones. In this section, we 2: Fs = {α};
propose an algorithm MMRFS. The notion is bor- 3: while (true)
rowed from the Maximal Marginal Relevance (MMR) 4: Find a pattern β such that the gain g(β) is the
[4] heuristic in information retrieval, where a document maximum among the set of patterns in F − Fs ;
has high marginal relevance if it is both relevant to the 5: If β can correctly cover at least one instance
query and contains minimal marginal similarity to pre- 6: Fs = Fs ∪ {β};
viously selected documents. We first define relevance 7: F = F − {β};
and redundancy of a frequent pattern in the context of 8: If all instances are covered δ times or F = φ
classification. 9: break;
10: return Fs
Definition 3 (Relevance) A relevance measure S is
a function mapping a pattern α to a real value such
that S(α) is the relevance w.r.t. the class label.
An interesting question arises: How many frequent
Relevance models the discriminative power of a fre-
patterns should be selected for effective classification?
quent pattern w.r.t. the class label. Measures like infor-
A promising method is to add a database coverage con-
mation gain and Fisher score can be used as a relevance
straint δ, as in [13]. The coverage parameter δ is set to
measure.
ensure that each training instance is covered at least δ
Definition 4 (Redundancy) A redundancy measure times by the selected features. In this way, the num-
R is a function mapping two patterns α and β to a ber of features selected is automatically determined,
real value such that R(α, β) is the redundancy between given a user-specified parameter δ. The algorithm is
them. described in Algorithm 1.
Data Single Feature Freq. Pattern Dataset Single Features Frequent Patterns
Item All Item F S Item RBF P at All P at F S Item All Item F S P at All P at F S
anneal 99.78 99.78 99.11 99.33 99.67 anneal 98.33 98.33 97.22 98.44
austral 85.01 85.50 85.01 81.79 91.14 austral 84.53 84.53 84.21 88.24
auto 83.25 84.21 78.80 74.97 90.79 auto 71.70 77.63 71.14 78.77
breast 97.46 97.46 96.98 96.83 97.78 breast 95.56 95.56 95.40 96.35
cleve 84.81 84.81 85.80 78.55 95.04 cleve 80.87 80.87 80.84 91.42
diabetes 74.41 74.41 74.55 77.73 78.31 diabetes 77.02 77.02 76.00 76.58
glass 75.19 75.19 74.78 79.91 81.32 glass 75.24 75.24 76.62 79.89
heart 84.81 84.81 84.07 82.22 88.15 heart 81.85 81.85 80.00 86.30
hepatic 84.50 89.04 85.83 81.29 96.83 hepatic 78.79 85.21 80.71 93.04
horse 83.70 84.79 82.36 82.35 92.39 horse 83.71 83.71 84.50 87.77
iono 93.15 94.30 92.61 89.17 95.44 iono 92.30 92.30 92.89 94.87
iris 94.00 96.00 94.00 95.33 96.00 iris 94.00 94.00 93.33 93.33
labor 89.99 91.67 91.67 94.99 95.00 labor 86.67 86.67 95.00 91.67
lymph 81.00 81.62 84.29 83.67 96.67 lymph 76.95 77.62 74.90 83.67
pima 74.56 74.56 76.15 76.43 77.16 pima 75.86 75.86 76.28 76.72
sonar 82.71 86.55 82.71 84.60 90.86 sonar 80.83 81.19 83.67 83.67
vehicle 70.43 72.93 72.14 73.33 76.34 vehicle 70.70 71.49 74.24 73.06
wine 98.33 99.44 98.33 98.30 100 wine 95.52 93.82 96.63 99.44
zoo 97.09 97.09 95.09 94.18 99.00 zoo 91.18 91.18 95.09 97.09
applied on F and a classifier is built using features in The degree (i.e., the maximum length) of combined fea-
I ∪ Fs , denoted as Pat FS. For comparison, we test the tures depends on the value of γ where γ is the factor
2
classifiers built on single features, denoted as Item All in K(x, y) = e−γkx−yk , i.e., the degree increases as
(using all single features) and Item FS (selected single γ grows. Given a particular γ, the combined features
features), respectively. Table 1 shows the results by F p of length ≤ p are used without discriminating their
SVM and Table 2 shows the results by C4.5. In LIB- frequency or predictive power, while the combined fea-
SVM, all the above four models use linear kernel. In tures of length > p are filtered out.
addition, an SVM model is built using RBF kernel on We also observe that the performance of Pat All is
single features, denoted as Item RBF. much worse than that of Pat FS, which confirms our
From Table 1, it is clear that Pat FS achieves the reasoning that, redundant and non-discriminative pat-
best classification accuracy in most cases. It has sig- terns often overfit the model and deteriorate the clas-
nificant improvement over Item All and Item FS. This sification accuracy. In addition, MMRFS is shown to
result is consistent with our theoretical analysis that be effective. Generally, any effective feature selection
(1) frequent patterns are useful by mapping the data algorithm can be used in our framework. The emphasis
to a higher dimensional space; and (2) the discrimi- is that feature selection is an important step in frequent
native power of some frequent patterns is higher than pattern-based classification.
that of single features. The above results are also observed in Table 2 for
Another interesting observation is that the perfor- decision tree-based classification.
mance of Item RBF is inferior to that of Pat FS. The
reason is, RBF kernel has a different mechanism for fea-
ture generation from our approach. In our approach,
4.2 Scalability Tests
min sup is used to filter out low-frequency features and
MMRFS is applied to select highly discriminative fea- Scalability tests are performed to show our frequent
tures. On the other hand, the RBF kernel maps the pattern-based framework is very scalable with good
original feature vector to a possibly infinite dimension. classification accuracy. Three dense datasets, Chess,
Waveform and Letter Recognition1 from UCI reposi- C4.5. When min sup = 1, the enumeration of all the
tory are used. On each data, min sup = 1 is used to patterns cannot complete in days, thus blocking model
enumerate all feature combinations and feature selec- construction. Our framework, benefiting from higher
tion is applied over them. In comparison, the frequent support threshold, can accomplish the mining of fre-
pattern-based classification method is tested with vari- quent patterns in seconds and achieve satisfactory clas-
ant support threshold settings. sification accuracy.
Tables 4 and 5 show similar results on the other two
datasets. When min sup = 1, millions of patterns are
Table 3. Accuracy & Time on Chess Data enumerated. Feature selection fails with such a large
number of patterns. In contrast, our frequent pattern-
min sup #Patterns Time (s) SVM (%) C4.5 (%)
based method is very efficient and achieves good accu-
1 N/A N/A N/A N/A
racy within a wide range of minimum support thresh-
2000 68,967 44.703 92.52 97.59
olds.
2200 28,358 19.938 91.68 97.84
2500 6,837 2.906 91.68 97.62
2800 1,031 0.469 91.84 97.37 5 Related Work
3000 136 0.063 91.90 97.06
The frequent pattern-based classification is related
to associative classification. In associative classifica-
tion, a classifier is built based on high-confidence, high-
Table 4. Accuracy & Time on Waveform Data support association rules [14, 13, 25, 6, 19]. The asso-
ciation between frequent patterns and class labels is
min sup #Patterns Time (s) SVM (%) C4.5 (%) used for prediction.
1 9,468,109 N/A N/A N/A A recent work on top-k rule mining [6] discovers top-
80 26,576 176.485 92.40 88.35 k covering rule groups for each row of gene expression
100 15,316 90.406 92.19 87.29 profiles. Prediction is then performed based on a clas-
150 5,408 23.610 91.53 88.80 sification score which combines the support and confi-
200 2,481 8.234 91.22 87.32 dence measures of the rules.
HARMONY [19] is another rule-based classifier
which directly mines classification rules. It uses an
instance-centric rule-generation approach and assures
Table 5. Accuracy & Time on Letter Recogni- for each training instance, that one of the highest-
tion Data confidence rules covering the instance is included in
the rule set. HARMONY is shown to be more efficient
and scalable than previous rule-based classifiers. On
min sup #Patterns Time (s) SVM (%) C4.5 (%) several datasets that were tested by both our method
1 5,147,030 N/A N/A N/A and HARMONY, our classification accuracy is signifi-
3000 3,246 200.406 79.86 77.08 cantly higher, e.g., the improvement is up to 11.94%
3500 2,078 103.797 80.21 77.28 on Waveform and 3.40% on Letter Recognition.
4000 1,429 61.047 79.57 77.32 Our work is different from associative classification
4500 962 35.235 79.51 77.42 in the following aspects: (1) We use frequent patterns
to represent the data in a different feature space, in
which any learning algorithm can be used, whereas
In Table 3, we show the result by varying min sup associative classification builds a classification model
on the Chess data which contains 3, 196 instances, 2 using rules only; (2) in associative classification, the
classes and 73 items. #P atterns gives the number of prediction process is to find one or several top ranked
closed patterns. T ime gives the sum of pattern min- rule(s) for prediction, whereas in our case, the predic-
ing and feature selection time. We do not include the tion is made by the classification model; and (3) more
classification time in the table because our goal is to importantly, we provide in-depth analysis on why fre-
show that the proposed framework has good scalabil- quent patterns provide a good solution for classifica-
ity in feature generation and selection. The last two tion, by studying the relationship between the discrim-
columns give the classification accuracy by SVM and inative power and pattern support. By establishing
1 The discretized Letter Recognition data is obtained from a connection with an information gain-based feature
www.csc.liv.ac.uk/∼frans/KDD/Software/LUCS-KDD-DN/DataSets selection approach, we propose a strategy for setting
min sup as well. In addition, we demonstrate the im- [8] R. Duda, P. Hart, and D. Stork. Pattern Classification.
portance of feature selection on the frequent pattern Wiley Interscience, 2nd edition, 2000.
features and propose a feature selection algorithm. [9] G. Grahne and J. Zhu. Efficiently using prefix-trees
Other related work includes classification which uses in mining frequent itemsets. In ICDM Workshop on
Frequent Itemset Mining Implementations (FIMI’03),
string kernels [15, 12], or word combinations in NLP or
2003.
structural features in graph classification [7]. In all [10] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
these studies, frequent patterns are generated and the without candidate generation. In Proc. of SIGMOD,
data is mapped to a higher dimensional feature space. pages 1–12, 2000.
Data which are not linearly separable in the original [11] M. Kuramochi and G. Karypis. Frequent subgraph
space become linearly separable in the mapped space. discovery. In Proc. of ICDM, pages 313–320, 2001.
[12] C. Leslie, E. Eskin, and W. S. Noble. The spectrum
kernel: A string kernel for svm protein classification.
6 Conclusions In Proc. of PSB, pages 564–575, 2002.
[13] W. Li, J. Han, and J. Pei. CMAR: Accurate and effi-
In this paper, we propose a systematic framework cient classification based on multiple class-association
for frequent pattern-based classification and give the- rules. In Proc. of ICDM, pages 369–376, 2001.
[14] B. Liu, W. Hsu, and Y. Ma. Integrating classification
oretical answers to several critical questions raised by
and association rule mining. In Proc. of KDD, pages
this framework. Our study shows frequent patterns are 80–86, 1998.
high quality features and have good model generaliza- [15] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristian-
tion ability. Connected with a commonly used feature ini, and C. Watkins. Text classification using string
selection approach, our method is able to overcome two kernels. Journal of Machine Learning Research, 2:419–
kinds of overfitting problems and shown to be scal- 444, 2002.
able. A strategy for setting min sup is also suggested. [16] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen,
In addition, we propose a feature selection algorithm U. Dayal, and M.-C. Hsu. PrefixSpan: Mining se-
quential patterns efficiently by prefix-projected pat-
to select discriminative frequent patterns. Experimen-
tern growth. In Proc. of ICDE, pages 215–226, 2001.
tal studies demonstrate that significant improvement [17] J. Quinlan. C4.5: Programs for Machine Learning.
is achieved in classification accuracy using the frequent Morgan Kaufmann, 1993.
pattern-based classification framework. [18] P. Tan, V. Kumar, and J. Srivastava. Selecting the
The framework is also applicable to more complex right interestingness measure for association patterns.
patterns, including sequences and graphs. In the fu- In Proc. of KDD, pages 32–41, 2002.
ture, we will conduct research in this direction. [19] J. Wang and G. Karypis. HARMONY: Efficiently min-
ing the best rules for classification. In Proc. of SDM,
pages 205–216, 2005.
References [20] K. Wang, C. Xu, and B. Liu. Clustering transactions
using large items. In Proc. of CIKM, pages 483–490,
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining asso- 1999.
ciation rules between sets of items in large databases. [21] I. H. Witten and E. Frank. Data Mining: Practical
In Proc. of SIGMOD, pages 207–216, 1993. machine learning tools and techniques. Morgan Kauf-
[2] R. Agrawal and R. Srikant. Fast algorithms for mining mann, 2nd edition, 2005.
association rules. In Proc. of VLDB, pages 487–499, [22] X. Yan and J. Han. gSpan: Graph-based substructure
1994. pattern mining. In Proc. of ICDM, pages 721–724,
[3] R. Agrawal and R. Srikant. Mining sequential pat- 2002.
terns. In Proc. of ICDE, pages 3–14, 1995. [23] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A fre-
[4] J. Carbonell and J. Coldstein. The use of mmr, quent structure-based approach. In Proc. of SIGMOD,
diversity-based reranking for reordering documents pages 335–346, 2004.
[24] Y. Yang and J. O. Pedersen. A comparative study
and producing summaries. In Proc. of SIGIR, pages
on feature selection in text categorization. In Proc. of
335–336, 1998.
[5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for ICML, pages 412–420, 1997.
[25] X. Yin and J. Han. CPAR: Classification based on
support vector machines, 2001. Software available at
predictive association rules. In Proc. of SDM, pages
http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
331–335, 2003.
[6] G. Cong, K. Tan, A. Tung, and X. Xu. Mining top-k
[26] M. J. Zaki. SPADE: An efficient algorithm for mining
covering rule groups for gene expression data. In Proc.
frequent sequences. Machine Learning, 42(1/2):31–60,
of SIGMOD, pages 670–681, 2005.
2001.
[7] M. Deshpande, M. Kuramochi, and G. Karypis. Fre- [27] M. J. Zaki and C. Hsiao. CHARM: An efficient al-
quent sub-structure-based approaches for classifying gorithm for closed itemset mining. In Proc. of SDM,
chemical compounds. In Proc. of ICDM, pages 35–42, pages 457–473, 2002.
2003.