DM 2
DM 2
Unit - 2
Unit - II
Association Rule Mining
❑ Mining Frequent Patterns
❑ Associations and correlations
❑ Mining Methods
❑ Mining Various kinds of Association Rules
❑ Correlation Analysis
❑ Constraint based Association mining
❑ Graph Pattern Mining, SPM.
2
Pattern Discovery: Basic Concepts
3
What Is Pattern Discovery?
❑ What are patterns?
❑ Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
❑ Patterns represent intrinsic and important properties of datasets
❑ Pattern discovery: Uncovering patterns from massive data sets
❑ Motivation examples:
❑ What products were often purchased together?
❑ What are the subsequent purchases after buying an iPad?
❑ What code segments likely contain copy-and-paste bugs?
❑ What word sequences likely form phrases in this corpus?
4
Basic Concepts: k-Itemsets and Their Supports
❑ Itemset: A set of one or more items Tid Items bought
10 Beer, Nuts, Diaper
❑ k-itemset: X = {x1, …, xk}
20 Beer, Coffee, Diaper
❑ Ex. {Beer, Nuts, Diaper} is a 3-itemset
30 Beer, Diaper, Eggs
❑ (absolute) support (count) of X, sup{X}: 40 Nuts, Eggs, Milk
Frequency or the number of 50 Nuts, Coffee, Diaper, Eggs, Milk
occurrences of an itemset X
❑ Ex. sup{Beer} = 3 ❑ (relative) support, s{X}: The fraction of
transactions that contains X (i.e., the
❑ Ex. sup{Diaper} = 4
probability that a transaction contains X)
❑ Ex. sup{Beer, Diaper} = 3
❑ Ex. s{Beer} = 3/5 = 60%
❑ Ex. sup{Beer, Eggs} = 1
❑ Ex. s{Diaper} = 4/5 = 80%
❑ Ex. s{Beer, Eggs} = 1/5 = 20%
5
Basic Concepts: Frequent Itemsets (Patterns)
❑ An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 10 Beer, Nuts, Diaper
minsup threshold σ 20 Beer, Coffee, Diaper
6
From Frequent Itemsets to Association Rules
❑ Comparing with itemsets, rules can be more telling Tid Items bought
10 Beer, Nuts, Diaper
❑ Ex. Diaper → Beer
20 Beer, Coffee, Diaper
❑ Buying diapers may likely lead to buying beers
30 Beer, Diaper, Eggs
❑ How strong is this rule? (support, confidence) 40 Nuts, Eggs, Milk
❑ Measuring association rules: X → Y (s, c) 50 Nuts, Coffee, Diaper, Eggs, Milk
❑ Both X and Y are itemsets Containing Containing
both diaper
❑ Support, s: The probability that a transaction
contains X Y Beer {Beer} Diaper
{Diaper}
❑ Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
❑ Confidence, c: The conditional probability that a Containing beer
transaction containing X also contains Y {Beer} {Diaper} = {Beer, Diaper}
❑ Calculation: c = sup(X Y) / sup(X) Note: X Y: the union of two itemsets
◼ The set contains both X and Y
❑ Ex. c = sup{Diaper, Beer}/sup{Diaper} = ¾ = 0.75
7
Mining Frequent Itemsets and Association Rules
❑Association rule mining Tid Items bought
❑ Given two thresholds: minsup, minconf 10 Beer, Nuts, Diaper
9
Apriori Pruning and Scalable Mining Methods
❑ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
❑Scalable mining Methods: Three major approaches
❑ Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
❑ Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
❑ Frequent pattern projection and growth: FPgrowth (Han, Pei,
Yin @SIGMOD’00)
10
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
2nd scan {A, C}
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}
11
The Downward Closure Property of Frequent Patterns
❑ Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ We get a frequent itemset: {a1, …, a50}
❑ Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
❑ There must be some hidden relationships among frequent patterns!
❑ The downward closure (also called “Apriori”) property of frequent patterns
❑ If {beer, diaper, nuts} is frequent, so is {beer, diaper}
❑ Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
❑ Apriori: Any subset of a frequent itemset must be frequent
❑ Efficient mining methodology
❑ If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!
12
Apriori: A Candidate Generation & Test Approach
❑ Outline of Apriori (level-wise, candidate generation and test)
❑ Initially, scan DB once to get frequent 1-itemset
❑ Repeat
❑ Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
❑ Test the candidates against DB to find frequent (k+1)-itemsets
❑ Set k := k +1
❑ Until no frequent or candidate set can be generated
❑ Return all the frequent itemsets derived
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k
K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return k Fk // return Fk generated at each level
14
Candidate Generation: An SQL Implementation
self-join self-join
❑ Suppose the items in Fk-1 are listed
abc abd acd ace bcd
in an order
❑ Step 1: self-joining Fk-1 abcd acde
insert into Ck
pruned
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Fk-1 as p, Fk-1 as q
where p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
❑ Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Fk-1) then delete c from Ck
15
Apriori: Improvements and Alternatives
❑ Reduce passes of transaction database scans
To be discussed in
❑ Partitioning (e.g., Savasere, et al., 1995) subsequent slides
❑ Dynamic itemset counting (Brin, et al., 1997)
❑ Shrink the number of candidates
To be discussed in
❑ Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
❑ Pruning by support lower bounding (e.g., Bayardo 1998)
❑ Sampling (e.g., Toivonen, 1996)
❑ Exploring special data structures
❑ Tree projection (Agarwal, et al., 2001)
❑ H-miner (Pei, et al., 2001)
❑ Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
16
Partitioning: Scan Database Only Twice
❑ Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB
18
Why Mining Frequent Patterns by Pattern Growth?
❑ Apriori: A breadth-first search mining algorithm
❑ First find the complete set of frequent k-itemsets
❑ Then derive frequent (k+1)-itemset candidates
❑ Scan DB again to find true frequent (k+1)-itemsets
❑ Motivation for a different mining methodology
❑ Can we develop a depth-first search mining algorithm?
❑ For a frequent itemset ρ, can subsequent search be confined
to only those transactions that containing ρ?
❑ Such thinking leads to a frequent pattern growth approach:
❑ FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns
without Candidate Generation,” SIGMOD 2000)
19
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
After inserting the 1st frequent
300 {b, f, h, j, o, w} f, b
Itemlist: “f, c, a, m, p”
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:1
a 3
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:1
20
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting the 2nd frequent
400 {b, c, k, s, p} c, b, p itemlist “f, c, a, b, m”
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:2
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:2
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p
a 3 a:2
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:1 b:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:1 m:1
21
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting all the
400 {b, c, k, s, p} c, b, p frequent itemlists
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
{}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:4 c:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:3 b:1 b:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p
a 3 a:3 p:1
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:2 b:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:2 m:1
22
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
❑ Pattern mining can be partitioned according to current patterns
❑ Patterns containing p: p’s conditional database: fcam:2, cb:1
❑ p’s conditional database (i.e., the database under the condition that p exists):
❑ transformed prefix paths of item p
❑ Patterns having m but no p: m’s conditional database: fca:2, fcab:1
❑ …… …… {}
min_support = 3 Conditional database of each pattern
25
Expressing Patterns in Compressed Form: Max-Patterns
❑ Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX
❑ Difference from close-patterns?
❑ Do not care the real support of the sub-patterns of a max-pattern
❑ Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ Suppose minsup = 1. How many max-patterns does TDB1 contain?
❑ One: P: “{a1, …, a100}: 1”
❑ Max-pattern is a lossy compression!
❑ We only know {a1, …, a40} is frequent
❑ But we do not know the real support of {a1, …, a40}, …, any more!
❑ Thus in many applications, mining close-patterns is more desirable than mining
max-patterns
26
Mining Multiple-Level Frequent Patterns
❑ Items often form hierarchies Uniform support Reduced support
Milk Level 1
Level 1
❑ Ex.: Dairyland 2% milk; min_sup = 5% [support = 10%] min_sup = 5%
Wonder wheat bread
Level 2
2% Milk Skim Milk Level 2
❑ How to set min-support min_sup = 5% [support = 6%] [support = 2%] min_sup = 1%
thresholds?
❑ Uniform min-support across multiple levels (reasonable?)
❑ Level-reduced min-support: Items at the lower level are
expected to have lower support
❑ Efficient mining: Shared multi-level mining
❑ Use the lowest min-support to pass down the set of
candidates
28
Redundancy Filtering at Mining Multi-Level
Associations
❑ Multi-level association mining may generate many redundant rules
❑ Redundancy filtering: Some rules may be redundant due to “ancestor”
relationships between items
❑ milk wheat bread [support = 8%, confidence = 70%] (1)
❑ 2% milk wheat bread [support = 2%, confidence = 72%] (2)
❑ Suppose the 2% milk sold is about ¼ of milk sold in gallons
❑ (2) should be able to be “derived” from (1)
❑ A rule is redundant if its support is close to the “expected” value, according
to its “ancestor” rule, and it has a similar confidence as its “ancestor”
❑ Rule (1) is an ancestor of rule (2), which one to prune?
29
Customized Min-Supports for Different Kinds of Items
❑ We have used the same min-support threshold for all the items or item sets
to be mined in each association mining
❑ In reality, some items (e.g., diamond, watch, …) are valuable but less
frequent
❑ It is necessary to have customized min-support settings for different kinds of
items
❑ One Method: Use group-based “individualized” min-support
❑ E.g., {diamond, watch}: 0.05%; {bread, milk}: 5%; …
❑ How to mine such rules efficiently?
❑ Existing scalable mining algorithms can be easily extended to cover such
cases
30
Mining Multi-Dimensional Associations
❑ Single-dimensional rules (e.g., items are all in “product” dimension)
❑ buys(X, “milk”) buys(X, “bread”)
❑ Multi-dimensional rules (i.e., items in 2 dimensions or predicates)
❑ Inter-dimension association rules (no repeated predicates)
❑ age(X, “18-25”) occupation(X, “student”) buys(X, “coke”)
❑ Hybrid-dimension association rules (repeated predicates)
❑ age(X, “18-25”) buys(X, “popcorn”) buys(X, “coke”)
❑ Attributes can be categorical or numerical
❑ Categorical Attributes (e.g., profession, product: no ordering among
values): Data cube for inter-dimension association
❑ Quantitative Attributes: Numeric, implicit ordering among values—
discretization, clustering, and gradient approaches
31
Mining Quantitative Associations
❑ Mining associations with numerical attributes
❑ Ex.: Numerical attributes: age and salary
❑ Methods
❑ Static discretization based on predefined concept hierarchies
❑ Discretization on each dimension with hierarchy
❑ age: {0-10, 10-20, …, 90-100} → {young, mid-aged, old}
❑ Dynamic discretization based on data distribution
❑ Clustering: Distance-based association
❑ First one-dimensional clustering, then association
❑ Deviation analysis:
❑ Gender = female Wage: mean=$7/hr (overall mean = $9)
32
Mining Extraordinary Phenomena in Quantitative
Association Mining
❑ Mining extraordinary (i.e., interesting) phenomena
❑ Ex.: Gender = female Wage: mean=$7/hr (overall mean = $9)
❑ LHS: a subset of the population
❑ RHS: an extraordinary behavior of this subset
❑ The rule is accepted only if a statistical test (e.g., Z-test) confirms the
inference with high confidence
❑ Subrule: Highlights the extraordinary behavior of a subset of the
population of the super rule
❑ Ex.: (Gender = female) ^ (South = yes) mean wage = $6.3/hr
❑ Rule condition can be categorical or numerical (quantitative rules)
❑ Ex.: Education in [14-18] (yrs) mean wage = $11.64/hr
❑ Efficient methods have been developed for mining such extraordinary
rules (e.g., Aumann and Lindell@KDD’99)
33
Rare Patterns vs. Negative Patterns
❑ Rare patterns
❑ Very low support but interesting (e.g., buying Rolex watches)
❑ How to mine them? Setting individualized, group-based min-support
thresholds for different groups of items
❑ Negative patterns
❑ Negatively correlated: Unlikely to happen together
❑ Ex.: Since it is unlikely that the same customer buys both a Ford
Expedition (an SUV car) and a Ford Fusion (a hybrid car), buying a
Ford Expedition and buying a Ford Fusion are likely negatively
correlated patterns
❑ How to define negative patterns?
34
Defining Negative Correlated Patterns
❑ A support-based definition
❑ If itemsets A and B are both frequent but rarely occur together, i.e.,
sup(A U B) << sup (A) × sup(B)
❑ Then A and B are negatively correlated
❑ Is this a good definition for large transaction datasets?
❑ Ex.: Suppose a store sold two needle packages A and B 100 times each,
but only one transaction contained both A and B
❑ When there are in total 200 transactions, we have
❑ s(A U B) = 0.005, s(A) × s(B) = 0.25, s(A U B) << s(A) × s(B)
❑ But when there are 105 transactions, we have
❑ s(A U B) = 1/105, s(A) × s(B) = 1/103 × 1/103, s(A U B) > s(A) × s(B)
❑ What is the problem?—Null transactions: The support-based definition
is not null-invariant!
35
Defining Negative Correlation: Need
Null-Invariance in Definition
❑ A good definition on negative correlation should take care of the null-
invariance problem
❑ Whether two itemsets A and B are negatively correlated should not be
influenced by the number of null-transactions
❑ A Kulczynski measure-based definition
❑ If itemsets A and B are frequent but
(s(A U B)/s(A) + s(A U B)/s(B))/2 < є,
where є is a negative pattern threshold, then A and B are negatively
correlated
❑ For the same needle package problem:
❑ No matter there are in total 200 or 105 transactions
❑ If є = 0.01, we have
(s(A U B)/s(A) + s(A U B)/s(B))/2 = (0.01 + 0.01)/2 < є
36
Mining Compressed Patterns
Pat-ID Item-Sets Support ❑ Why mining compressed patterns?
P1 {38,16,18,12} 205227
❑ Too many scattered patterns but not so
P2 {38,16,18,12,17} 205211
meaningful
P3 {39,38,16,18,12,17} 101758
P4 {39,16,18,12,17} 161563 ❑ Pattern distance measure
P5 {39,16,18,12} 161576
❑ Summary
39
Constraint-Based Pattern Mining
❑ Why Constraint-Based Mining?
❑ Different Kinds of Constraints: Different Pruning Strategies
❑ Constrained Mining with Pattern Anti-Monotonicity
❑ Constrained Mining with Pattern Monotonicity
❑ Constrained Mining with Data Anti-Monotonicity
❑ Constrained Mining with Succinct Constraints
❑ Constrained Mining with Convertible Constraints
❑ Handling Multiple Constraints
❑ Constraint-Based Sequential-Pattern Mining
40
Why Constraint-Based Mining?
❑ Finding all the patterns in a dataset autonomously?—unrealistic!
❑ Too many patterns but not necessarily user-interested!
❑ Pattern mining in practice: Often a user-guided, interactive process
❑ User directs what to be mined using a data mining query language (or a
graphical user interface), specifying various kinds of constraints
❑ What is constraint-based mining?
❑ Mine together with user-provided constraints
❑ Why constraint-based mining?
❑ User flexibility: User provides constraints on what to be mined
❑ Optimization: System explores such constraints for mining efficiency
❑ E.g., Push constraints deeply into the mining process
41
Various Kinds of User-Specified Constraints
❑ Knowledge type constraint—Specifying what kinds of knowledge to mine
❑ Ex.: Classification, association, clustering, outlier finding, …
❑ Data constraint—using SQL-like queries
❑ Ex.: Find products sold together in NY stores this year
❑ Dimension/level constraint—similar to projection in relational database
❑ Ex.: In relevance to region, price, brand, customer category
❑ Interestingness constraint—various kinds of thresholds
❑ Ex.: Strong rules: min_sup 0.02, min_conf 0.6, min_correlation 0.7
❑ Rule (or pattern) constraint The focus of this study
❑ Ex.: Small sales (price < $10) triggers big sales (sum > $200)
42
Pattern Space Pruning with Pattern Anti-Monotonicity
TID Transaction ◼ A constraint c is anti-monotone
10 a, b, c, d, f, h ◼ If an itemset S violates constraint c, so does any of its superset
20 b, c, d, f, g, h that is, mining on itemset S can be terminated
30 b, c, d, f, g
◼ Ex. 1: c1: sum(S.price) v is anti-monotone
40 a, c, e, f, g
min_sup = 2 ◼ Ex. 2: c2: range(S.profit) 15 is anti-monotone
Item Price Profit ◼ Itemset ab violates c2 (range(ab) = 40)
a 100 40
◼ So does every superset of ab
b 40 0
c 150 −20
◼ Ex. 3. c3: sum(S.Price) v is not anti-monotone
d 35 −15 ◼ Ex. 4. Is c4: support(S) σ anti-monotone?
e 55 −30 ◼ Yes! Apriori pruning is essentially pruning with an anti-
f 45 −10 monotonic constraint!
g 80 20
Note: item.price > 0
h 10 5 Profit can be negative
43
Pattern Monotonicity and Its Roles
TID Transaction
10 a, b, c, d, f, h ◼ A constraint c is monotone: If an itemset S satisfies the
20 b, c, d, f, g, h
constraint c, so does any of its superset
30 b, c, d, f, g ◼ That is, we do not need to check c in subsequent mining
a, c, e, f, g
40
◼ Ex. 1: c1: sum(S.Price) v is monotone
min_sup = 2
Item Price Profit ◼ Ex. 2: c2: min(S.Price) v is monotone
a 100 40 ◼ Ex. 3: c3: range(S.profit) 15 is monotone
b 40 0
c 150 −20
◼ Itemset ab satisfies c3
d 35 −15 ◼ So does every superset of ab
e 55 −30
f 45 −10
g 80 20
Note: item.price > 0
h 10 5 Profit can be negative
44
Data Space Pruning with Data Anti-Monotonicity
TID Transaction ❑ A constraint c is data anti-monotone: In the mining process, if a
10 a, b, c, d, f, h data entry t cannot satisfy a pattern p under c, t cannot satisfy
20 b, c, d, f, g, h p’s superset either
30 b, c, d, f, g ❑ Data space pruning: Data entry t can be pruned
40 a, c, e, f, g
❑ Ex. 1: c1: sum(S.Profit) v is data anti-monotone
min_sup = 2
❑ Let constraint c1 be: sum(S.Profit) ≥ 25
Item Price Profit
a 100 40 ❑ T30: {b, c, d, f, g} can be removed since none of their
b 40 0 combinations can make an S whose sum of the profit is ≥ 25
c 150 −20 ❑ Ex. 2: c2: min(S.Price) v is data anti-monotone
d 35 −15 ❑ Consider v = 5 but every item in a transaction, say T50 , has a
e 55 −30 price higher than 10
f 45 −10
g 80 20
❑ Ex. 3: c3: range(S.Profit) > 25 is data anti-monotone
Note: item.price > 0
h 10 5
45 Profit can be negative
Data Space Pruning Should Be Explored Recursively
b’s-proj. DB
❑Example. c3: range(S.Profit) > 25 Item Profit
TID Transaction TID Transaction
❑ We check b’s projected database 10 a, b, c, d, f, h a 40
10 a, c, d, f, h
❑ But item “a” is infrequent (sup = 1) 20 c, d, f, g, h 20 b, c, d, f, g, h b 0
❑ Note: c3 prunes T10 effectively only after “a” is pruned (by min-sup) in b’s projected DB
46
Succinctness: Pruning Both Data and Pattern Spaces
◼ Succinctness: If the constraint c can be enforced by directly manipulating the data
◼ Ex. 1: To find those patterns without item i
◼ Remove i from DB and then mine (pattern space pruning)
◼ Ex. 2: To find those patterns containing item i
◼ Mine only i-projected DB (data space pruning)
◼ Ex. 3: c3: min(S.Price) v is succinct
◼ Start with only items whose price v and remove transactions with high-price
items only (pattern + data space pruning)
◼ Ex. 4: c4: sum(S.Price) v is not succinct
◼ It cannot be determined beforehand since sum of the price of itemset S keeps
increasing
47
Convertible Constraints: Ordering Data in Transactions
TID Transaction ◼ Convert tough constraints into (anti-)monotone by proper ordering
10 a, b, c, d, f, h of items in transactions
20 a, b, c, d, f, g, h
◼ Examine c1: avg(S.profit) > 20
30 b, c, d, f, g
◼ Order items in (profit) value-descending order
40 a, c, e, f, g
◼ <a, g, f, b, h, d, c, e>
min_sup = 2
Item Price Profit ◼ An itemset ab violates c1 (avg(ab) = 20)
a 100 40 ◼ So does ab* (i.e., ab-projected DB)
b 40 0
◼ C1: anti-monotone if patterns grow in the right order!
c 150 −20
◼ Can item-reordering work for Apriori?
d 35 −15
e 55 −30 ◼ Level-wise candidate generation requires multi-way checking!
f 45 −5 ◼ avg(agf) = 21.7 > 20, but avg(gf) = 12.5 < 20
g 80 30 ◼ Apriori will not generate “agf” as a candidate
h 10 5
48
Advanced Frequent Pattern Mining
❑ Mining Diverse Patterns
❑ Summary
49
What Is Graph Pattern Mining?
❑ Chem-informatics:
❑ Mining frequent chemical compound structures
50
Frequent (Sub)Graph Patterns
❑ Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a
subgraph g is Dg = {Gi | g Gi, Gi D}
❑ support(g) = |Dg|/ |D|
❑ A (sub)graph g is frequent if support(g) ≥ min_sup
Graph Dataset
❑ Ex.: Chemical structures
❑ Alternative:
(A) (B) (C)
❑ Mining frequent subgraph
patterns from a single large min_sup = 2 Frequent Graph Patterns
graph or network
(1) (2)
support = 67%
51
Applications of Graph Pattern Mining
❑ Bioinformatics
❑ Gene networks, protein interactions, metabolic pathways
❑ Chem-informatics: Mining chemical compound structures
❑ Social networks, web communities, tweets, …
❑ Cell phone networks, computer networks, …
❑ Web graphs, XML structures, Semantic Web, information networks
❑ Software engineering: Program execution flow analysis
❑ Building blocks for graph classification, clustering, compression, comparison,
and correlation analysis
❑ Graph indexing and graph similarity search
52
Graph Pattern Mining Algorithms: Different
Methodologies
❑ Generation of candidate subgraphs
❑ Apriori vs. pattern growth (e.g., FSG vs. gSpan)
❑ Search order
❑ Breadth vs. depth
❑ Elimination of duplicate subgraphs
❑ Passive vs. active (e.g., gSpan [Yan & Han, 2002])
❑ Support calculation
❑ Store embeddings (e.g., GASTON [Nijssen & Kok, 2004], FFSM [Huan, Wang,
& Prins, 2003], MoFa [Borgelt & Berthold, ICDM’02])
❑ Order of pattern discovery
❑ Path → tree → graph (e.g., GASTON [Nijssen & Kok, 2004])
53
Apriori-Based Approach
❑ The Apriori property (anti-monotonicity): A size-k (k+1)-edge
subgraph is frequent if and only if all of its k-edge
subgraphs are frequent G1
❑ A candidate size-(k+1) edge/vertex subgraph is G
generated if its corresponding two k-edge/vertex G2
subgraphs are frequent
G’
❑ Iterative mining process: …
❑ Candidate-generation → candidate pruning →
G’’ Gn
support counting → candidate elimination Join
54
Candidate Generation:
Vertex Growing vs. Edge Growing
❑ Methodology: Breadth-search, Apriori joining two size-k graphs
❑ Many possibilities at generating size-(k+1) candidate graphs
56
gSPAN: Graph Pattern Growth in Order
❑Right-most path extension in subgraph e0: (0,1)
0
pattern growth e1: (1,2)
1
❑ Right-most path: The path from root to the e2: (2,3)
right-most leaf (choose the vertex with the 2 e3: (3,0)
smallest index at each step) 4
❑ Reduce generation of duplicate subgraphs 3 e4: (2,4)
57
Why Mine Closed Graph Patterns?
❑ Challenge: An n-edge frequent graph may have 2n subgraphs
❑ Motivation: Explore closed frequent subgraphs to handle graph
pattern explosion problem
❑ A frequent graph G is closed if there exists no supergraph of G that
carries the same support as G
If this subgraph is closed in the
graph dataset, it implies that
none of its frequent super-graphs
carries the same support
58
CloseGraph: Directly Mining Closed Graph Patterns
❑ CloseGraph: Mining closed graph patterns by extending gSpan (Yan & Han, KDD’03)
100
1.0E+04
1.0E+03 10
1.0E+02 1
0.05 0.06 0.07 0.08 0.1 0.05 0.06 0.07 0.08 0.1
❑ Summary
61
Sequential Pattern Mining
❑ Sequential Pattern and Sequential Pattern Mining
62
Sequence Databases & Sequential Patterns
❑ Sequential pattern mining has broad applications
❑ Customer shopping sequences
❑ Purchase a laptop first, then a digital camera, and then a smartphone,
within 6 months
❑ Medical treatments, natural disasters (e.g., earthquakes), science &
engineering processes, stocks and markets, ...
❑ Weblog click streams, calling patterns, …
❑ Software engineering: Program execution sequences, …
❑ Biological sequences: DNA, protein, …
❑ Transaction DB, sequence DB vs. time-series DB
❑ Gapped vs. non-gapped sequential patterns
❑ Shopping sequences, clicking streams vs. biological sequences
63
Sequential Pattern and Sequential Pattern Mining
❑ Sequential pattern mining: Given a set of sequences, find the complete set of
frequent subsequences (i.e., satisfying the min_sup threshold)
64
Sequential Pattern Mining Algorithms
❑ Algorithm requirement: Efficient, scalable, finding complete set, incorporating
various kinds of user-specific constraints
❑ The Apriori property still holds: If a subsequence s 1 is infrequent, none of s1’s
super-sequences can be frequent
❑ Representative algorithms
❑ GSP (Generalized Sequential Patterns): Srikant & Agrawal @ EDBT’96)
❑ Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
❑ Pattern-growth methods: PrefixSpan (Pei, et al. @TKDE’04)
❑ Mining closed sequential patterns: CloSpan (Yan, et al. @SDM’03)
❑ Constraint-based sequential pattern mining (to be covered in the constraint
mining section)
65
GSP: Apriori-Based Sequential Pattern Mining
❑ Initial candidates: All 8-singleton sequences SID Sequence
❑ <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> 10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)>
❑ Scan DB once, count support for each candidate
30 <(ah)(bf)abf>
❑ Generate length-2 candidate sequences
40 <(be)(ce)d>
<a> <b> <c> <d> <e> <f>
min_sup = 2 50 <a(bd)bcb(ade)>
<a> <aa> <ab> <ac> <ad> <ae> <af>
Cand. sup
<b> <ba> <bb> <bc> <bd> <be> <bf> ❑ Without Apriori pruning:
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<a> 3 <d> <da> <db> <dc> <dd> <de> <df> (8 singletons) 8*8+8*7/2 =
<b> 5 <e> <ea> <eb> <ec> <ed> <ee> <ef> 92 length-2 candidates
<c> 4 <f> <fa> <fb> <fc> <fd> <fe> <ff> ❑ With pruning, length-2
<d> 3 <a> <b> <c> <d> <e> <f> candidates: 36 + 15= 51
<e> 3 <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
<b> <(bc)> <(bd)> <(be)> <(bf)> GSP (Generalized Sequential
<f> 2
<g> 1
<c> <(cd)> <(ce)> <(cf)> Patterns): Srikant & Agrawal
<d> <(de)> <(df)>
<h> 1 <e> <(ef)>
@ EDBT’96)
66 <f>
GSP Mining and Pruning
5th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> Candidates cannot pass min_sup
threshold
4th scan: 8 cand. 7 length-4 seq. pat. <abba> <(bd)bc> … Candidates not in DB
68
PrefixSpan: A Pattern-Growth Approach
min_sup = 2
SID Sequence ❑Prefix and suffix
10 <a(abc)(ac)d(cf)> Prefix Suffix (Projection)
❑ Given <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)> <a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)> ❑ Prefixes: <a>, <aa>,
30 <(ef)(ab)(df)cb>
<ab> <(_c)(ac)d(cf)> <a(ab)>, <a(abc)>, …
40 <eg(af)cbc>
❑ Suffix: Prefixes-based
❑PrefixSpan Mining: Prefix Projections projection
❑ Step 1: Find length-1 sequential patterns
❑ <a>, <b>, <c>, <d>, <e>, <f>
❑ Step 2: Divide search space and mine each projected DB
❑ <a>-projected DB,
PrefixSpan (Prefix-projected
❑ <b>-projected DB,
Sequential pattern mining)
❑ …
Pei, et al. @TKDE’04
❑ <f>-projected DB, …
69
PrefixSpan: Mining Prefix-Projected DBs
SID Sequence min_sup = 2
10 <a(abc)(ac)d(cf)>
Length-1 sequential patterns
20 <(ad)c(bc)(ae)> <a>, <b>, <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
prefix <a> prefix <b> prefix <c>, …, <f>
<a>-projected DB <b>-projected DB
<(abc)(ac)d(cf)> Length-2 sequential …
<(_d)c(bc)(ae)> patterns
<aa>, <ab>, <(ab)>,
<(_b)(df)cb>
<(_f)cbc>
<ac>, <ad>, <af> ……
prefix <aa> prefix <af> Major strength of PrefixSpan:
◼ No candidate subseqs. to be generated
<aa>-projected DB … <af>-projected DB
◼ Projected DBs keep shrinking
70
Implementation Consideration: Pseudo-Projection vs.
Physical Projection
❑ Major cost of PrefixSpan: Constructing projected DBs
❑ Suffixes largely repeating in recursive projected DBs
❑ When DB can be held in main memory, use pseudo projection
❑ No physically copying suffixes
s = <a(abc)(ac)d(cf)>
❑ Pointer to the sequence <a>
❑ Offset of the suffix
s|<a>: ( , 2) <(abc)(ac)d(cf)>
❑ But if it does not fit in memory <ab>
❑ Physical projection s|<ab>: ( , 5) <(_c)(ac)d(cf)>
❑ Suggested approach:
❑ Integration of physical and pseudo-projection
❑ Swapping to pseudo-projection when the data fits in memory
71
END OF UNIT - II
72