Concepts and Techniques: Data Mining
Concepts and Techniques: Data Mining
— Chapter 5 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
Data
©2006 Jiawei Han and Mining:
Micheline Kamber, All rights reserved
April 16, Concepts and
2018 Techniques 1
Chapter 5: Mining Frequent Patterns,
Association and Correlations
diaper}
■ i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
■ Vertical data format approach (Charm—Zaki & Hsiao
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
Data 458
Mining:
April 16, Concepts and
2018 Techniques 17
Efficient Implementation of Apriori in SQL
■ Challenges
■ Multiple scans of transaction database
■ Huge number of candidates
■ Tedious workload of support counting for candidates
■ Improving Apriori: general ideas
■ Reduce passes of transaction database scans
■ Shrink number of candidates
■ Facilitate support counting of candidates
Data Mining:
April 16, Concepts and
2018 Techniques 19
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent
patterns
■ Scan 2: consolidate global frequent patterns
■ A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
Data Mining:
April 16, Concepts and
2018 Techniques 20
DHP: Reduce the Number of Candidates
ABCD
■ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ■ Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
Data
counting and implication rules for Mining:
market
Aprilbasket
16, data. In Concepts and
SIGMOD’97
2018 Techniques 23
Bottleneck of Frequent-pattern Mining
Data Mining:
April 16, Concepts and
2018 Techniques 25
Construct FP-tree from a Transaction Database
■ Completeness
■ Preserve complete information for frequent pattern
mining
■ Never break a long pattern of any transaction
■ Compactness
■ Patterns containing p
■ …
■ Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
Data Mining:
p fcam:2, cb:1
April 16, p:2 Concepts
m:1 and
2018 Techniques 29
From Conditional Pattern-bases to Conditional FP-trees
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
Data Mining:cam-conditional FP-tree
April 16, Concepts and
2018 Techniques 31
A Special Case: Single Prefix Path in FP-tree
■ Suppose a (conditional) FP-tree T has a shared
single prefix-path P
■ Mining can be decomposed into two parts
{} ■ Reduction of the single prefix path into one node
a1:n1 ■ Concatenation of the mining results of the two
a2:n2 parts
a3:n3
{} r1
database partition
■ Method
FP-tree
■ Until the resulting FP-tree is empty, or it contains only
Data Mining:
April 16, Concepts and
2018 Techniques 34
Partition-based Projection
Tran. DB
■ Parallel projection needs a lot fcamp
of disk space fcabm
■ Partition projection saves it fb
cbp
fcamp
am-proj DB cm-proj DB
fc f
Data Mining:
…
fc f
April 16, fc Concepts
f and
2018 Techniques 35
FP-Growth vs. Apriori: Scalability With the Support
Threshold
Data Mining:
April 16, Concepts and
2018 Techniques 36
FP-Growth vs. Tree-Projection: Scalability with
the Support Threshold
Data Mining:
April 16, Concepts and
2018 Techniques 37
Why Is FP-Growth the Winner?
■ Divide-and-conquer:
■ decompose both the mining task and DB according to
the frequent patterns obtained so far
■ leads to focused search of smaller databases
■ Other factors
■ no candidate generation, no candidate test
■ compressed database: FP-tree structure
■ no repeated scan of entire database
■ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
Data Mining:
April 16, Concepts and
2018 Techniques 38
Implications of the Methodology
■ CLOSET (DMKD’00)
■ Mining sequential patterns
■ A, B, C, D, E 10 A,B,C,D,E
nd 20 B,C,D,E,
■ 2 scan: find support for
30 A,C,D,F
■ AB, AC, AD, AE, ABCDE
■ BC, BD, BE, BCDE
Potential
■ CD, CE, CDE, DE,
max-patterns
■ Since BCDE is a max-pattern, no need to check BCD, BDE,
CDE in later scan
■ R. Bayardo. Efficiently mining long patterns from
databases. In SIGMOD’98
Data Mining:
April 16, Concepts and
2018 Techniques 40
Mining Frequent Closed Patterns: CLOSET
Data Mining:
April 16, Concepts and
2018 Techniques 44
Visualization of Association Rules: Plane Graph
Data Mining:
April 16, Concepts and
2018 Techniques 45
Visualization of Association Rules: Rule Graph
Data Mining:
April 16, Concepts and
2018 Techniques 46
Visualization of Association Rules
(SGI/MineSet 3.0)
Data Mining:
April 16, Concepts and
2018 Techniques 47
Chapter 5: Mining Frequent Patterns,
Association and Correlations
■ Basic concepts and a road map
■ Efficient and scalable frequent itemset mining
methods
■ Mining various kinds of association rules
■ From association mining to correlation analysis
■ Constraint-based association mining
■ Summary
Data Mining:
April 16, Concepts and
2018 Techniques 48
Mining Various Kinds of Association Rules
Data Mining:
April 16, Concepts and
2018 Techniques 49
Mining Multiple-Level Association Rules
Data Mining:
April 16, Concepts and
2018 Techniques 51
Mining Multi-Dimensional Association
■ Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
■ Multi-dimensional rules: ≥ 2 dimensions or predicates
■ Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”)
■ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
■ Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
■ Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering,
Data Mining: and gradient approaches
April 16, Concepts and
2018 Techniques 52
Mining Quantitative Associations
Data Mining:
April 16, Concepts and
2018 Techniques 56
Chapter 5: Mining Frequent Patterns,
Association and Correlations
■ Basic concepts and a road map
■ Efficient and scalable frequent itemset mining
methods
■ Mining various kinds of association rules
■ From association mining to correlation analysis
■ Constraint-based association mining
■ Summary
Data Mining:
April 16, Concepts and
2018 Techniques 57
Interestingness Measure: Correlations (Lift)
■ play basketball ⇒ eat cereal [40%, 66.7%] is misleading
■ The overall % of students eating cereal is 75% > 66.7%.
■ play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
■ Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)
Data Mining:
April 16, Concepts and
2018 Techniques 58
Are lift and χ2 Good Measures of Correlation?
■ “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading
■ if 85% of customers buy milk
■ Support and confidence are not good to represent correlations
■ So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m Σ
■ small sales (price < $10) triggers big sales (sum >
$200)
■ Interestingness constraint
them
■ Constrained mining vs. query processing in DBMS
as pushing selections
Data deeply
Mining:in query processing
April 16, Concepts and
2018 Techniques 64
Anti-Monotonicity in Constraint Pushing
TDB (min_sup=2)
■ Anti-monotonicity TID Transaction
■ Succinctness:
■ Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
■ Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
■ min(S.Price) ≤ v is succinct
■ sum(S.Price) ≥ v is not succinct
■ Optimization: If C isData
succinct, C is pre-counting pushable
Mining:
April 16, Concepts and
2018 Techniques 67
The Apriori Algorithm — Example
Database D
L1
C1
Scan D
C2 C2
L2 Scan D
C3 Data Mining:
L3
Scan D
April 16, Concepts and
2018 Techniques 68
Naïve Algorithm: Apriori + Constraint
Database D
L1
C1
Scan D
C2 C2
L2 Scan D
C3 Data Mining:
L3 Constraint:
Scan D
April 16, Concepts and Sum{S.price} < 5
2018 Techniques 69
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D
L1
C1
Scan D
C2 C2
L2 Scan D
C3 Data Mining:
L3 Constraint:
Scan D
April 16, Concepts and Sum{S.price} < 5
2018 Techniques 70
The Constrained Apriori Algorithm: Push a
Succinct Constraint Deep
Database D
L1
C1
Scan D
C2 C2
L2 Scan D
not immediately
to be used
C3 Data Mining:
L3 Constraint:
Scan D
April 16, Concepts and min{S.price } <= 1
2018 Techniques 71
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
■ Convert tough constraints into
10 a, b, c, d, f
anti-monotone or monotone by properly
20 b, c, d, f, g, h
ordering items
30 a, c, d, e, f
■ Examine C: avg(S.profit) ≥ 25 40 c, e, f, g
■ Order items in value-descending Item Profit
order a 40
■ <a, f, g, d, b, h, c, e> b 0
c -20
■ If an itemset afb violates C d 10
■ So does afbh, afb* e -30
f 30
■It becomes anti-monotone!
Data Mining: g 20
April 16, Concepts and h -10
2018 Techniques 72
Strongly Convertible Constraints
growth framework! f 30
g 20
Data Mining:
h -10
April 16, Concepts and
2018 Techniques 74
Mining With Convertible Constraints
Item Value
■ C: avg(X) >= 25, min_sup=2 a 40
2018 Techniques 75
Handling Multiple Constraints
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
2018 Techniques 78
A Classification of Constraints
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Data Mining:
April 16, Concepts and
2018 Techniques 83
Ref: Apriori and Its Improvements
Data Mining:
April 16, Concepts and
2018 Techniques 86
Ref: Mining Multi-Level and Quantitative Rules
Data Mining:
April 16, Concepts and
2018 Techniques 92
Ref: Mining Frequent Patterns in Time-Series Data
Data Mining:
April 16, Concepts and
2018 Techniques 98
Data Mining:
April 16, Concepts and
2018 Techniques 99