Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach
Integration of Classification and Pattern Mining: A Discriminative and Frequent Pattern-Based Approach
Classification Overview
Associative Classification
Pattern-growth approach
Divide-and-conquer, depth-first search
FP-Growth, PrefixSpan, MoFa, gSpan, Gaston
• Graph:
5
Pattern Growth Approach
• Depth-first search, grow a size-k pattern to
size-(k+1) one by adding one element
6
Vertical Data Approach
• Major operation: transaction list intersection
t ( AB) t ( A) t ( B)
Item Transaction id
A t1, t2, t3,…
B t2, t3, t4,…
C t1, t3, t4,…
… …
7
Mining High Dimensional Data
8
Mining Colossal Patterns
[Zhu et al., ICDE’07]
• Mining colossal patterns: challenges
– A small number of colossal (i.e., large) patterns, but a
very large number of mid-sized patterns
– If the mining of mid-sized patterns is explosive in size,
there is no hope to find colossal patterns efficiently by
insisting “complete set” mining philosophy
• A pattern-fusion approach
– Jump out of the swamp of mid-sized results and
quickly reach colossal patterns
– Fuse small patterns to large ones directly
9
Impact to Other Data Analysis Tasks
• Association and correlation analysis
– Association: support and confidence
– Correlation: lift, chi-square, cosine, all_confidence, coherence
– A comparative study [Tan, Kumar and Srivastava, KDD’02]
10
Classification Overview
Training Model
Instances Learning
Positive
Test Prediction
Instances Model
Negative
no yes yes
LungCancer Emphysema
PositiveXRay Dyspnea
Text Categorization
Drug Design
Spam
Detection
Classifier
Frequent Pattern
Classification
Analysis
Frequent
Pattern-Based
Classification
disambiguation
Sequences vs. … login, changeDir, delFile, appendFile, logout …
single commands … login, setFileType, storeFile, logout …
Predefined
Feature vector
Training
Instances ? Classification
model
Prediction
Model
NO Predefined
Feature vector
Pattern-Based
Discriminative
Training Model
Feature
Frequent
InstancesPatterns Learning
Construction
Positive
Feature
Test Space Prediction
Transformation
Instances Model
Negative
B, C 0 1 1 1 1 1 1 1
0 0 1 0 0 0 0
1 1 0 1 0 0 1
1 0 1 0 1 0 0
0 1 1 0 0 1 0
Frequent Graphs
g1
Active g1 g2 Class
Mining Transform
g2
1 1 0
0 0 1
min_sup=2
1 1 0
Inactive
H Inactive
Descriptor-space
HO
H
Representation Classifier Model
Cl
H Active
H
O
N ?
Class = Active / Inactive
.. H Training
Chemical
2019/6/22 . Compounds ICDM 08 Tutorial
Courtesy of Nikil Wale
21
Applications: Bug Localization
calling graph
Classification Overview
Associative Classification
Representative work
CBA [Liu, Hsu and Ma, KDD’98]
Emerging patterns [Dong and Li, KDD’99]
CMAR [Li, Han and Pei, ICDM’01]
CPAR [Yin and Han, SDM’03]
RCBT [Cong et al., SIGMOD’05]
Lazy classifier [Veloso, Meira and Zaki, ICDM’06]
Integrated with classification models [Cheng et al., ICDE’07]
25
CBA
• Rule mining
• Mine the set of association rules wrt. min_sup and
min_conf
• Rank rules in descending order of confidence and
support
• Select rules to ensure training instance coverage
• Prediction
• Apply the first rule that matches a test case
• Otherwise, apply the default rule
26
CMAR [Li, Han and Pei, ICDM’01]
• Basic idea
– Mining: build a class distribution-associated FP-tree
– Prediction: combine the strength of multiple rules
• Rule mining
– Mine association rules from a class distribution-
associated FP-tree
– Store and retrieve association rules in a CR-tree
– Prune rules based on confidence, correlation and
database coverage
27
Class Distribution-Associated
FP-tree
28
CR-tree: A Prefix-tree to Store and
Index Rules
29
Prediction Based on Multiple Rules
max 2
where max 2
is the upper bound of chi-square of
a rule.
30
CPAR [Yin and Han, SDM’03]
• Basic idea
– Combine associative classification and FOIL-based
rule generation
– Foil gain: criterion for selecting a literal
• Prediction
– Collect all rules matching a test case
– Select the best k rules for each class
– Choose the class with the highest expected accuracy
for prediction
32
Performance Comparison
[Yin and Han, SDM’03]
Data C4.5 Ripper CBA CMAR CPAR
anneal 94.8 95.8 97.9 97.3 98.4
austral 84.7 87.3 84.9 86.1 86.2
auto 80.1 72.8 78.3 78.1 82.0
breast 95.0 95.1 96.3 96.4 96.0
cleve 78.2 82.2 82.8 82.2 81.5
crx 84.9 84.9 84.7 84.9 85.7
diabetes 74.2 74.7 74.5 75.8 75.1
german 72.3 69.8 73.4 74.9 73.4
glass 68.7 69.1 73.9 70.1 74.4
heart 80.8 80.7 81.9 82.2 82.6
hepatic 80.6 76.7 81.8 80.5 79.4
horse 82.6 84.8 82.1 82.6 84.2
hypo 99.2 98.9 98.9 98.4 98.1
iono 90.0 91.2 92.3 91.5 92.6
iris 95.3 94.0 94.7 94.0 94.7
labor 79.3 84.0 86.3 89.7 84.7
… … … … … …
Average 83.34 82.93 84.69 85.22 85.17 33
Emerging Patterns
[Dong and Li, KDD’99]
• Emerging Patterns (EPs) are contrast patterns between
two classes of data whose support changes significantly
between the two classes.
• Each data tuple has several features such as: odor, ring-
number, stalk-surface-bellow-ring, etc.
• Given a test T and a set E(Ci) of EPs for class Ci, the
aggregate score of T for Ci is
score(T, Ci) = S strength(X)
(over X of Ci matching T)
• For each class, may use median (or 85%) aggregated value to
normalize to avoid bias towards class with more EPs
Courtesy of Bailey and Dong
36
Top-k Covering Rule Groups for Gene
Expression Data [Cong et al., SIGMOD’05 ]
• Problem
– Mine strong association rules to reveal correlation between
gene expression patterns and disease outcomes
– Example: gene1[a1 , b1 ],..., genen [an , bn ] class
– Build a rule-based classifier for prediction
• Solution
– Mining top-k covering rule groups with row enumeration
– A classifier RCBT based on top-k covering rule groups
item
tid
Top-k
Top-k
– Advantages
• Search space is reduced/focused
– Cover small disjuncts (support can be lowered)
• Only applicable rules are generated
– A much smaller number of CARs are induced
– Disadvantages
• Several models are generated, one for each test instance
• Potentially high computational cost
• Cache infrastructure
– All CARs are stored in main memory
– Each CAR has only one entry in the cache
– Replacement policy
• LFU heuristic Courtesy of Mohammed Zaki
Feature selection
Select
discriminative features
Remove redundancy and correlation
Model learning
A general classifier based on SVM or C4.5 or other
classification model
0.9 InfoGain
IG_UpperBnd
0.8
Info Gain
Info Gain
Info Gain
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Frequency Frequency
Support
Frequency
IG (C | X ) H (C ) H (C | X )
48
Fisher Score vs. Frequency?
3.5
FisherScore
3 FS_UpperBnd
2.5
fisher
fisher
fisher
Fisher Score
2
1.5
0.5
0
0 100 200 300 400 500 600 700
i1 i i
c
n (u u ) 2
Fr
i1 i i
2 c
n
2019/6/22 ICDM 08 Tutorial 49
Analytical Study on Information Gain
IG (C | X ) H (C ) H (C | X )
m
H (C ) pi log 2 ( pi ) H (C | X ) j P( X x j ) H (Y | X x j )
i 1
p q (1 p) (1 q)
(q p) log ( (1 q) (1 p)) log
1 1
Entropy when feature
not appears (x=0)
Pattern Prob. of
frequency Positive Class
P( x 1) p P(c 1)
2019/6/22 ICDM 08 Tutorial 51
Conditional Entropy in a Pure Case
• When q 1(or q 0 )
p p 1 p 1 p
H (C | X )|q 1 ( 1)( log log )
1 1 1 1
H (C | X )|q 1 p
log log 1 0 since p 1
1
0.9 InfoGain
IG_UpperBnd
0.8
0.7
Information Gain
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600 700
Decision
min_sup # Patterns Time SVM (%)
Tree (%)
1 N/A N/A N/A N/A
2000 68,967 44.70 92.52 97.59
2200 28,358 19.94 91.68 97.84
2500 6,837 2.91 91.68 97.62
2800 1,031 0.47 91.84 97.37
3000 136 0.06 91.90 97.06
Classification Overview
Associative Classification
Basic idea
Extract graph substructures F { g1,..., g n }
Represent a graph with a feature vector x {x1 ,..., xn }, where xi is
the frequency of g i in that graph
Build a classification model
N O
N O
O 1 2 ■ ■ ■ ■
O
■ n
N O N
N
O O
1 2 ■ ■ ■ ■
.. ■ n
..
. .
Courtesy of Nikil Wale
2019/6/22 ICDM 08 Tutorial 59
Maccs Keys (MK)
Each Fragment forms a
fixed dimension in the
descriptor-space
Domain Expert O OH
NH
NH2 O
HO
O
NH2
O
Identify “Important”
Fragments NH2
for bioactivity
NH2 O
O Delete
Bi-connected NH2 Left-over
Components Trees
from the
compound
O
H
N
Sup:+ve:40%-ve:0%
O
Frequent
O
Subgraph
O F
F
H Discovery
H
H Sup:+ve:1% -ve:30%
H
Min.
H
Support.
H H
NH
NH2
• Feature generation
– Frequent topological subgraphs by FSG
– Frequent geometric subgraphs with 3D shape information
• Feature selection
– Sequential covering paradigm
• Classification
– Use SVM to learn a classifier based on feature vectors
– Assign different misclassification costs for different classes to address
skewed class distribution
Classification Overview
Associative Classification
Pattern-Based
Training Model
Feature
Instances Learning
Construction
Positive
Mining Filtering
Frequent Patterns Discriminative
Data 104~106 Patterns
Discriminative
Data
Patterns
FP-tree
Non Monotonic
Anti-Monotonic
Enumerate subgraphs
: small-size to large-size
• Extensions
– Mining top-k discriminative patterns
– Mining approximate/weighted discriminative patterns
80
Harmony
• Prediction
– For a test case, partition the rules into k
groups based on class labels
– Compute the score for each rule group
– Predict based the rule group with the highest
score
81
Accuracy of Harmony
82
Runtime of Harmony
83
DDPMine [Cheng et al., ICDE’08]
• Basic idea
– Integration of branch-and-bound search with
FP-growth mining
– Iteratively eliminate training instance and
progressively shrink FP-tree
• Performance
– Maintain high accuracy
– Improve mining efficiency
a bc bd
cd ce
ab ac
cef ceg
2019/6/22 ICDM 08 Tutorial 85
Branch-and-Bound Search
Examples covered
Examples covered by feature 2
by feature 1 (2nd BB)
Examples covered
(1st BB) by feature 3
(3rd BB)
Training
examples
1. Branch-and-Bound Search
| Di || Di 1 | | T ( i ) | (1 0 ) | Di 1 | ... (1 0 )i | D0 |
• Number of iterations:
n log 1 | D0 |
1 0
Accuracy Comparison
Harmony
DDPMine
Objective functions
Rule of Thumb :
If the frequency difference of a graph pattern in
the positive dataset and the negative dataset
increases, the pattern becomes more interesting
Sibling
Size-5 graph
Size-6 graph
g’: a sibling of g
98
LEAP Algorithm
3. Branch-and-Bound Search
with F(g*)
99
Branch-and-Bound vs. LEAP
Branch-and-Bound LEAP
Feature
Guaranteed Near optimal
Optimality
Data Description
AUC Runtime
• Upper bound:
• Original set:
• Subset:
4. Non-overfitting
4
3
2
1
0
Adult Chess Hypo Sick Sonar
Datasets MbT #Pat #Pat using MbT Ratio (MbT #Pat / #Pat
sup using MbT sup)
Adult 1039.2 252809 0.41%
Chess 46.8 +∞ ~0%
Hypo 14.8 423439 0.0035%
Sick 15.4 4818391 0.00032%
Sonar 7.4 95507 0.00775%
100%
90%
4 Wins
80%
1 loss
70%
Adult Chess Hypo Sick Sonar
much smaller
Log(DT #Pat) Log(MbT #Pat) number of
patterns
4
3
2
1
0
Adult Chess Hypo Sick Sonar
Classification Overview
Associative Classification
– Hybrid weighting
- - -
+ - - - -
-- - - -
----
+- - +- - +- - … +- -
FS-based FS-based
… …
Classification Classification
C1 C2 C3 … Ck
1 k i
f ( x) f ( x)
E
k i 1
The error of each classifier is independent, could be reduced
through ensemble.
2019/6/22 ICDM 08 Tutorial 115
ROC Curve
Sampling and ensemble
Classification Overview
Associative Classification
121
References (2)
G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for
Gene Expression Data, SIGMOD’05.
M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent
Substructure-based Approaches for Classifying Chemical Compounds,
TKDE’05.
G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering
Trends and Differences, KDD’99.
G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating
Emerging Patterns, DS’99
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John
Wiley & Sons, 2001.
W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O.
Verscheure. Direct Mining of Discriminative and Essential Graphical and
Itemset Features via Model-based Search Tree, KDD’08.
J. Han and M. Kamber. Data Mining: Concepts and Techniques (2nd ed.),
Morgan Kaufmann, 2006.
J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate
Generation, SIGMOD’00.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning, Springer, 2001.
D. Heckerman, D. Geiger and D. M. Chickering. Learning Bayesian Networks:
The Combination of Knowledge and Statistical Data, Machine Learning,
1995.
122
References (3)
T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for
Predictive Graph Mining, KDD’04.
J. Huan, W. Wang, and J. Prins. Efficient Mining of Frequent Subgraph
in the Presence of Isomorphism, ICDM’03.
A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based Algorithm for
Mining Frequent Substructures from Graph Data, PKDD’00.
T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to
Graph Classification, NIPS’04.
M. Kuramochi and G. Karypis. Frequent Subgraph Discovery, ICDM’01.
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification
based on Multiple Class-association Rules, ICDM’01.
B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association
Rule Mining, KDD’98.
H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very
High Dimensional Data: A Topdown Row Enumeration Approach,
SDM’06.
S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can
Make a Difference, KDD’04.
F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding
Closed Patterns in Long Biological Datasets, KDD’03
123
References (4)
F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and
Row enumeration for Closed Pattern Discovery, SSDBM’04.
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M-C. Hsu.
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-projected
Pattern Growth, ICDE’01.
R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and
Performance Improvements, EDBT’96.
Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier,
TKDE’06.
P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness
Measure for Association Patterns, KDD’02.
R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06.
N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical
Compound Retrieval and Classification, ICDM’06.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in
Large Data Sets, SIGMOD’02.
J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for
Classification, SDM’05.
X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns
by Scalable Leap Search, SIGMOD’08.
X. Yan and J. Han. gSpan: Graph-based Substructure Pattern Mining,
ICDM’02.
124
References (5)
X. Yan, P.S. Yu, and J. Han. Graph Indexing: A Frequent Structure-
based Approach, SIGMOD’04.
X. Yin and J. Han. CPAR: Classification Based on Predictive
Association Rules, SDM’03.
M.J. Zaki. Scalable Algorithms for Association Mining, TKDE’00.
M.J. Zaki. SPADE: An Efficient Algorithm for Mining Frequent
Sequences, Machine Learning’01.
M.J. Zaki and C.J. Hsiao. CHARM: An Efficient Algorithm for Closed
Itemset mining, SDM’02.
F. Zhu, X. Yan, J. Han, P.S. Yu, and H. Cheng. Mining Colossal Frequent
Patterns by Core Pattern Fusion, ICDE’07.
125
Questions?
[email protected]
http://www.se.cuhk.edu.hk/~hcheng