0% found this document useful (0 votes)

18 views71 pages

DM 2

The document covers the concepts of Association Rule Mining, focusing on mining frequent patterns, correlations, and various mining methods. It discusses the importance of pattern discovery, the basic concepts of itemsets and their supports, and the challenges faced in mining frequent patterns. Additionally, it introduces the Apriori algorithm and its improvements for efficient mining of association rules and frequent itemsets.

Uploaded by

mrpulluri1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views71 pages

DM 2

Uploaded by

mrpulluri1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

AI512PE: DATA MINING (PE - I)

Unit - 2
Unit - II
Association Rule Mining
❑ Mining Frequent Patterns
❑ Associations and correlations
❑ Mining Methods
❑ Mining Various kinds of Association Rules
❑ Correlation Analysis
❑ Constraint based Association mining
❑ Graph Pattern Mining, SPM.

2
Pattern Discovery: Basic Concepts

❑ What Is Pattern Discovery? Why Is It Important?

❑ Basic Concepts: Frequent Patterns and Association Rules

❑ Compressed Representation: Closed Patterns and Max-Patterns

3
What Is Pattern Discovery?
❑ What are patterns?
❑ Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
❑ Patterns represent intrinsic and important properties of datasets
❑ Pattern discovery: Uncovering patterns from massive data sets
❑ Motivation examples:
❑ What products were often purchased together?
❑ What are the subsequent purchases after buying an iPad?
❑ What code segments likely contain copy-and-paste bugs?
❑ What word sequences likely form phrases in this corpus?

4
Basic Concepts: k-Itemsets and Their Supports
❑ Itemset: A set of one or more items Tid Items bought
10 Beer, Nuts, Diaper
❑ k-itemset: X = {x1, …, xk}
20 Beer, Coffee, Diaper
❑ Ex. {Beer, Nuts, Diaper} is a 3-itemset
30 Beer, Diaper, Eggs
❑ (absolute) support (count) of X, sup{X}: 40 Nuts, Eggs, Milk
Frequency or the number of 50 Nuts, Coffee, Diaper, Eggs, Milk
occurrences of an itemset X
❑ Ex. sup{Beer} = 3 ❑ (relative) support, s{X}: The fraction of
transactions that contains X (i.e., the
❑ Ex. sup{Diaper} = 4
probability that a transaction contains X)
❑ Ex. sup{Beer, Diaper} = 3
❑ Ex. s{Beer} = 3/5 = 60%
❑ Ex. sup{Beer, Eggs} = 1
❑ Ex. s{Diaper} = 4/5 = 80%
❑ Ex. s{Beer, Eggs} = 1/5 = 20%

5
Basic Concepts: Frequent Itemsets (Patterns)
❑ An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 10 Beer, Nuts, Diaper
minsup threshold σ 20 Beer, Coffee, Diaper

❑ Let σ = 50% (σ: minsup threshold) 30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk
For the given 5-transaction dataset
50 Nuts, Coffee, Diaper, Eggs, Milk
❑ All the frequent 1-itemsets:
❑ Beer: 3/5 (60%); Nuts: 3/5 (60%) ❑ Why do these itemsets (shown on the
❑ Diaper: 4/5 (80%); Eggs: 3/5 (60%) left) form the complete set of frequent
❑ All the frequent 2-itemsets: k-itemsets (patterns) for any k?
❑ {Beer, Diaper}: 3/5 (60%) ❑ Observation: We may need an
❑ All the frequent 3-itemsets? efficient method to mine a complete
❑ None set of frequent patterns

6
From Frequent Itemsets to Association Rules
❑ Comparing with itemsets, rules can be more telling Tid Items bought
10 Beer, Nuts, Diaper
❑ Ex. Diaper → Beer
20 Beer, Coffee, Diaper
❑ Buying diapers may likely lead to buying beers
30 Beer, Diaper, Eggs
❑ How strong is this rule? (support, confidence) 40 Nuts, Eggs, Milk
❑ Measuring association rules: X → Y (s, c) 50 Nuts, Coffee, Diaper, Eggs, Milk
❑ Both X and Y are itemsets Containing Containing
both diaper
❑ Support, s: The probability that a transaction
contains X  Y Beer {Beer}  Diaper
{Diaper}
❑ Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
❑ Confidence, c: The conditional probability that a Containing beer
transaction containing X also contains Y {Beer}  {Diaper} = {Beer, Diaper}
❑ Calculation: c = sup(X  Y) / sup(X) Note: X  Y: the union of two itemsets
◼ The set contains both X and Y
❑ Ex. c = sup{Diaper, Beer}/sup{Diaper} = ¾ = 0.75
7
Mining Frequent Itemsets and Association Rules
❑Association rule mining Tid Items bought
❑ Given two thresholds: minsup, minconf 10 Beer, Nuts, Diaper

❑ Find all of the rules, X → Y (s, c) 20 Beer, Coffee, Diaper

30 Beer, Diaper, Eggs
❑ such that, s ≥ minsup and c ≥ minconf
40 Nuts, Eggs, Milk
❑ Let minsup = 50% 50 Nuts, Coffee, Diaper, Eggs, Milk
❑ Freq. 1-itemsets: Beer: 3, Nuts: 3,
❑Observations:
Diaper: 4, Eggs: 3
❑ Mining association rules and
❑ Freq. 2-itemsets: {Beer, Diaper}: 3
mining frequent patterns are
very close problems
❑Let minconf = 50%
❑ Scalable methods are needed
❑ Beer → Diaper (60%, 100%)
for mining large datasets
❑ Diaper → Beer (60%, 75%)

(Q: Are these all rules?)

8
Challenge: There Are Too Many Frequent Patterns!
❑ A long pattern contains a combinatorial number of sub-patterns
❑ How many frequent itemsets does the following TDB1 contain?
❑ TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ Assuming (absolute) minsup = 1
❑ Let’s have a try
1-itemsets: {a1}: 2, {a2}: 2, …, {a50}: 2, {a51}: 1, …, {a100}: 1,
2-itemsets: {a1, a2}: 2, …, {a1, a50}: 2, {a1, a51}: 1 …, …, {a99, a100}: 1,
…, …, …, …
99-itemsets: {a1, a2, …, a99}: 1, …, {a2, a3, …, a100}: 1
100-itemset: {a1, a2, …, a100}: 1 A too huge set for any
one to compute or store!
❑ The total number of frequent itemsets:

9
Apriori Pruning and Scalable Mining Methods
❑ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
❑Scalable mining Methods: Three major approaches
❑ Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
❑ Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
❑ Frequent pattern projection and growth: FPgrowth (Han, Pei,
Yin @SIGMOD’00)

10
The Apriori Algorithm—An Example
minsup = 2 Itemset sup
Database TDB Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
2nd scan {A, C}
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}

C3 Itemset F3 Itemset sup

3rd scan
{B, C, E} {B, C, E} 2

11
The Downward Closure Property of Frequent Patterns
❑ Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ We get a frequent itemset: {a1, …, a50}
❑ Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
❑ There must be some hidden relationships among frequent patterns!
❑ The downward closure (also called “Apriori”) property of frequent patterns
❑ If {beer, diaper, nuts} is frequent, so is {beer, diaper}
❑ Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
❑ Apriori: Any subset of a frequent itemset must be frequent
❑ Efficient mining methodology
❑ If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!

12
Apriori: A Candidate Generation & Test Approach
❑ Outline of Apriori (level-wise, candidate generation and test)
❑ Initially, scan DB once to get frequent 1-itemset
❑ Repeat
❑ Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
❑ Test the candidates against DB to find frequent (k+1)-itemsets
❑ Set k := k +1
❑ Until no frequent or candidate set can be generated
❑ Return all the frequent itemsets derived
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k

K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return k Fk // return Fk generated at each level

14
Candidate Generation: An SQL Implementation
self-join self-join
❑ Suppose the items in Fk-1 are listed
abc abd acd ace bcd
in an order
❑ Step 1: self-joining Fk-1 abcd acde
insert into Ck
pruned
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Fk-1 as p, Fk-1 as q
where p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
❑ Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Fk-1) then delete c from Ck
15
Apriori: Improvements and Alternatives
❑ Reduce passes of transaction database scans
To be discussed in
❑ Partitioning (e.g., Savasere, et al., 1995) subsequent slides
❑ Dynamic itemset counting (Brin, et al., 1997)
❑ Shrink the number of candidates
To be discussed in
❑ Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
❑ Pruning by support lower bounding (e.g., Bayardo 1998)
❑ Sampling (e.g., Toivonen, 1996)
❑ Exploring special data structures
❑ Tree projection (Agarwal, et al., 2001)
❑ H-miner (Pei, et al., 2001)
❑ Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
16
Partitioning: Scan Database Only Twice
❑ Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB

TDB1 + TDB2 + ... + TDBk = TDB

❑ Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)

❑ Scan 1: Partition database so that each partition can fit in main memory (why?)
❑ Mine local frequent patterns in this partition
❑ Scan 2: Consolidate global frequent patterns
❑ Find global frequent itemset candidates (those frequent in at least one partition)
❑ Find the true frequency of those candidates, by scanning TDBi one more time
17
Efficient Pattern Mining Methods
❑ The Downward Closure Property of Frequent Patterns

❑ The Apriori Algorithm

❑ Extensions or Improvements of Apriori

❑ Mining Frequent Patterns by Exploring Vertical Data Format

❑ FPGrowth: A Frequent Pattern-Growth Approach

❑ Mining Closed Patterns

18
Why Mining Frequent Patterns by Pattern Growth?
❑ Apriori: A breadth-first search mining algorithm
❑ First find the complete set of frequent k-itemsets
❑ Then derive frequent (k+1)-itemset candidates
❑ Scan DB again to find true frequent (k+1)-itemsets
❑ Motivation for a different mining methodology
❑ Can we develop a depth-first search mining algorithm?
❑ For a frequent itemset ρ, can subsequent search be confined
to only those transactions that containing ρ?
❑ Such thinking leads to a frequent pattern growth approach:
❑ FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns
without Candidate Generation,” SIGMOD 2000)
19
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
After inserting the 1st frequent
300 {b, f, h, j, o, w} f, b
Itemlist: “f, c, a, m, p”
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:1
a 3
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:1
20
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting the 2nd frequent
400 {b, c, k, s, p} c, b, p itemlist “f, c, a, b, m”
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:2
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:2
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p
a 3 a:2
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:1 b:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:1 m:1
21
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting all the
400 {b, c, k, s, p} c, b, p frequent itemlists
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
{}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:4 c:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:3 b:1 b:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p
a 3 a:3 p:1
3. Scan DB again, construct FP-tree
b 3
❑ The frequent itemlist of each transaction is m:2 b:1
m 3
inserted as a branch, with shared sub-
branches merged, counts accumulated p 3 p:2 m:1
22
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
❑ Pattern mining can be partitioned according to current patterns
❑ Patterns containing p: p’s conditional database: fcam:2, cb:1
❑ p’s conditional database (i.e., the database under the condition that p exists):
❑ transformed prefix paths of item p
❑ Patterns having m but no p: m’s conditional database: fca:2, fcab:1
❑ …… …… {}
min_support = 3 Conditional database of each pattern

Item Frequency Header f:4 c:1 Item Conditional database

f 4 c f:3
c:3 b:1 b:1
c 4 a fc:3
a 3 a:3 p:1 b fca:1, f:1, c:1
b 3 m fca:2, fcab:1
m 3 m:2 b:1 p fcam:2, cb:1
p 3
p:2 m:1
23
Mine Each Conditional Database Recursively
min_support = 3
❑ For each conditional database
Conditional Data Bases ❑ Mine single-item patterns
item cond. data base
❑ Construct its FP-tree & mine it
c f:3
a fc:3 p’s conditional DB: fcam:2, cb:1 → c: 3
b fca:1, f:1, c:1 m’s conditional DB: fca:2, fcab:1 → fca: 3
m fca:2, fcab:1 b’s conditional DB: fca:1, f:1, c:1 → ɸ
p fcam:2, cb:1

{} {} {} {} Actually, for single branch FP-tree, all the

frequent patterns can be generated in one shot
f:3 f:3 f:3 f:3 m: 3
c:3 cm’s FP-tree cam’s FP-tree fm: 3, cm: 3, am: 3
c:3
am’s FP-tree fcm: 3, fam:3, cam: 3
a:3
m’s FP-tree Then, mining m’s FP-tree: fca:3 fcam: 3
24
Expressing Patterns in Compressed Form: Closed Patterns
❑ How to reduce number of patterns?
❑ Solution 1: Closed patterns: A pattern (itemset) X is closed if X is frequent, and
there exists no super-pattern Y ‫ כ‬X, with the same support as X
❑ Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ Suppose minsup = 1. How many closed patterns does TDB1 contain?
❑ Two: P1: “{a1, …, a50}: 2”; P2: “{a1, …, a100}: 1”
❑ Closed pattern is a lossless compression of frequent patterns
❑ Reduces the # of patterns but does not lose the support information!
❑ You will still be able to say: “{a2, …, a40}: 2”, “{a5, a51}: 1”

25
Expressing Patterns in Compressed Form: Max-Patterns
❑ Solution 2: Max-patterns: A pattern X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X
❑ Difference from close-patterns?
❑ Do not care the real support of the sub-patterns of a max-pattern
❑ Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
❑ Suppose minsup = 1. How many max-patterns does TDB1 contain?
❑ One: P: “{a1, …, a100}: 1”
❑ Max-pattern is a lossy compression!
❑ We only know {a1, …, a40} is frequent
❑ But we do not know the real support of {a1, …, a40}, …, any more!
❑ Thus in many applications, mining close-patterns is more desirable than mining
max-patterns
26
Mining Multiple-Level Frequent Patterns
❑ Items often form hierarchies Uniform support Reduced support
Milk Level 1
Level 1
❑ Ex.: Dairyland 2% milk; min_sup = 5% [support = 10%] min_sup = 5%
Wonder wheat bread
Level 2
2% Milk Skim Milk Level 2
❑ How to set min-support min_sup = 5% [support = 6%] [support = 2%] min_sup = 1%
thresholds?
❑ Uniform min-support across multiple levels (reasonable?)
❑ Level-reduced min-support: Items at the lower level are
expected to have lower support
❑ Efficient mining: Shared multi-level mining
❑ Use the lowest min-support to pass down the set of
candidates

28
Redundancy Filtering at Mining Multi-Level
Associations
❑ Multi-level association mining may generate many redundant rules
❑ Redundancy filtering: Some rules may be redundant due to “ancestor”
relationships between items
❑ milk  wheat bread [support = 8%, confidence = 70%] (1)
❑ 2% milk  wheat bread [support = 2%, confidence = 72%] (2)
❑ Suppose the 2% milk sold is about ¼ of milk sold in gallons
❑ (2) should be able to be “derived” from (1)
❑ A rule is redundant if its support is close to the “expected” value, according
to its “ancestor” rule, and it has a similar confidence as its “ancestor”
❑ Rule (1) is an ancestor of rule (2), which one to prune?

29
Customized Min-Supports for Different Kinds of Items
❑ We have used the same min-support threshold for all the items or item sets
to be mined in each association mining
❑ In reality, some items (e.g., diamond, watch, …) are valuable but less
frequent
❑ It is necessary to have customized min-support settings for different kinds of
items
❑ One Method: Use group-based “individualized” min-support
❑ E.g., {diamond, watch}: 0.05%; {bread, milk}: 5%; …
❑ How to mine such rules efficiently?
❑ Existing scalable mining algorithms can be easily extended to cover such
cases

30
Mining Multi-Dimensional Associations
❑ Single-dimensional rules (e.g., items are all in “product” dimension)
❑ buys(X, “milk”)  buys(X, “bread”)
❑ Multi-dimensional rules (i.e., items in  2 dimensions or predicates)
❑ Inter-dimension association rules (no repeated predicates)
❑ age(X, “18-25”)  occupation(X, “student”)  buys(X, “coke”)
❑ Hybrid-dimension association rules (repeated predicates)
❑ age(X, “18-25”)  buys(X, “popcorn”)  buys(X, “coke”)
❑ Attributes can be categorical or numerical
❑ Categorical Attributes (e.g., profession, product: no ordering among
values): Data cube for inter-dimension association
❑ Quantitative Attributes: Numeric, implicit ordering among values—
discretization, clustering, and gradient approaches
31
Mining Quantitative Associations
❑ Mining associations with numerical attributes
❑ Ex.: Numerical attributes: age and salary
❑ Methods
❑ Static discretization based on predefined concept hierarchies
❑ Discretization on each dimension with hierarchy
❑ age: {0-10, 10-20, …, 90-100} → {young, mid-aged, old}
❑ Dynamic discretization based on data distribution
❑ Clustering: Distance-based association
❑ First one-dimensional clustering, then association
❑ Deviation analysis:
❑ Gender = female  Wage: mean=$7/hr (overall mean = $9)
32
Mining Extraordinary Phenomena in Quantitative
Association Mining
❑ Mining extraordinary (i.e., interesting) phenomena
❑ Ex.: Gender = female  Wage: mean=$7/hr (overall mean = $9)
❑ LHS: a subset of the population
❑ RHS: an extraordinary behavior of this subset
❑ The rule is accepted only if a statistical test (e.g., Z-test) confirms the
inference with high confidence
❑ Subrule: Highlights the extraordinary behavior of a subset of the
population of the super rule
❑ Ex.: (Gender = female) ^ (South = yes)  mean wage = $6.3/hr
❑ Rule condition can be categorical or numerical (quantitative rules)
❑ Ex.: Education in [14-18] (yrs)  mean wage = $11.64/hr
❑ Efficient methods have been developed for mining such extraordinary
rules (e.g., Aumann and Lindell@KDD’99)
33
Rare Patterns vs. Negative Patterns
❑ Rare patterns
❑ Very low support but interesting (e.g., buying Rolex watches)
❑ How to mine them? Setting individualized, group-based min-support
thresholds for different groups of items
❑ Negative patterns
❑ Negatively correlated: Unlikely to happen together
❑ Ex.: Since it is unlikely that the same customer buys both a Ford
Expedition (an SUV car) and a Ford Fusion (a hybrid car), buying a
Ford Expedition and buying a Ford Fusion are likely negatively
correlated patterns
❑ How to define negative patterns?
34
Defining Negative Correlated Patterns
❑ A support-based definition
❑ If itemsets A and B are both frequent but rarely occur together, i.e.,
sup(A U B) << sup (A) × sup(B)
❑ Then A and B are negatively correlated
❑ Is this a good definition for large transaction datasets?
❑ Ex.: Suppose a store sold two needle packages A and B 100 times each,
but only one transaction contained both A and B
❑ When there are in total 200 transactions, we have
❑ s(A U B) = 0.005, s(A) × s(B) = 0.25, s(A U B) << s(A) × s(B)
❑ But when there are 105 transactions, we have
❑ s(A U B) = 1/105, s(A) × s(B) = 1/103 × 1/103, s(A U B) > s(A) × s(B)
❑ What is the problem?—Null transactions: The support-based definition
is not null-invariant!
35
Defining Negative Correlation: Need
Null-Invariance in Definition
❑ A good definition on negative correlation should take care of the null-
invariance problem
❑ Whether two itemsets A and B are negatively correlated should not be
influenced by the number of null-transactions
❑ A Kulczynski measure-based definition
❑ If itemsets A and B are frequent but
(s(A U B)/s(A) + s(A U B)/s(B))/2 < є,
where є is a negative pattern threshold, then A and B are negatively
correlated
❑ For the same needle package problem:
❑ No matter there are in total 200 or 105 transactions
❑ If є = 0.01, we have
(s(A U B)/s(A) + s(A U B)/s(B))/2 = (0.01 + 0.01)/2 < є
36
Mining Compressed Patterns
Pat-ID Item-Sets Support ❑ Why mining compressed patterns?
P1 {38,16,18,12} 205227
❑ Too many scattered patterns but not so
P2 {38,16,18,12,17} 205211
meaningful
P3 {39,38,16,18,12,17} 101758
P4 {39,16,18,12,17} 161563 ❑ Pattern distance measure
P5 {39,16,18,12} 161576

❑ Closed patterns ❑ δ-clustering: For each pattern P, find all patterns

❑ P1, P2, P3, P4, P5 which can be expressed by P and whose distance
❑ Emphasizes too much on to P is within δ (δ-cover)
support
❑ There is no compression ❑ All patterns in the cluster can be represented by P
❑ Max-patterns ❑ Method for efficient, direct mining of compressed
❑ P3: information loss frequent patterns (e.g., D. Xin, J. Han, X. Yan, H.
❑ Desired output (a good balance): Cheng, "On Compressing Frequent Patterns",
❑ P2, P3, P4
Knowledge and Data Engineering, 60:5-29, 2007)
37
Redundancy-Aware Top-k Patterns
❑ Desired patterns: high significance & low redundancy

❑ Method: Use MMS (Maximal Marginal Significance) for measuring the

combined significance of a pattern set
❑ Xin et al., Extracting Redundancy-Aware Top-K Patterns, KDD’06
38
Advanced Frequent Pattern Mining
❑ Mining Diverse Patterns

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ Summary

39
Constraint-Based Pattern Mining
❑ Why Constraint-Based Mining?
❑ Different Kinds of Constraints: Different Pruning Strategies
❑ Constrained Mining with Pattern Anti-Monotonicity
❑ Constrained Mining with Pattern Monotonicity
❑ Constrained Mining with Data Anti-Monotonicity
❑ Constrained Mining with Succinct Constraints
❑ Constrained Mining with Convertible Constraints
❑ Handling Multiple Constraints
❑ Constraint-Based Sequential-Pattern Mining
40
Why Constraint-Based Mining?
❑ Finding all the patterns in a dataset autonomously?—unrealistic!
❑ Too many patterns but not necessarily user-interested!
❑ Pattern mining in practice: Often a user-guided, interactive process
❑ User directs what to be mined using a data mining query language (or a
graphical user interface), specifying various kinds of constraints
❑ What is constraint-based mining?
❑ Mine together with user-provided constraints
❑ Why constraint-based mining?
❑ User flexibility: User provides constraints on what to be mined
❑ Optimization: System explores such constraints for mining efficiency
❑ E.g., Push constraints deeply into the mining process
41
Various Kinds of User-Specified Constraints
❑ Knowledge type constraint—Specifying what kinds of knowledge to mine
❑ Ex.: Classification, association, clustering, outlier finding, …
❑ Data constraint—using SQL-like queries
❑ Ex.: Find products sold together in NY stores this year
❑ Dimension/level constraint—similar to projection in relational database
❑ Ex.: In relevance to region, price, brand, customer category
❑ Interestingness constraint—various kinds of thresholds
❑ Ex.: Strong rules: min_sup  0.02, min_conf  0.6, min_correlation  0.7
❑ Rule (or pattern) constraint The focus of this study
❑ Ex.: Small sales (price < $10) triggers big sales (sum > $200)

42
Pattern Space Pruning with Pattern Anti-Monotonicity
TID Transaction ◼ A constraint c is anti-monotone
10 a, b, c, d, f, h ◼ If an itemset S violates constraint c, so does any of its superset
20 b, c, d, f, g, h that is, mining on itemset S can be terminated
30 b, c, d, f, g
◼ Ex. 1: c1: sum(S.price)  v is anti-monotone
40 a, c, e, f, g
min_sup = 2 ◼ Ex. 2: c2: range(S.profit)  15 is anti-monotone
Item Price Profit ◼ Itemset ab violates c2 (range(ab) = 40)
a 100 40
◼ So does every superset of ab
b 40 0
c 150 −20
◼ Ex. 3. c3: sum(S.Price)  v is not anti-monotone
d 35 −15 ◼ Ex. 4. Is c4: support(S)  σ anti-monotone?
e 55 −30 ◼ Yes! Apriori pruning is essentially pruning with an anti-
f 45 −10 monotonic constraint!
g 80 20
Note: item.price > 0
h 10 5 Profit can be negative
43
Pattern Monotonicity and Its Roles
TID Transaction
10 a, b, c, d, f, h ◼ A constraint c is monotone: If an itemset S satisfies the
20 b, c, d, f, g, h
constraint c, so does any of its superset
30 b, c, d, f, g ◼ That is, we do not need to check c in subsequent mining
a, c, e, f, g
40
◼ Ex. 1: c1: sum(S.Price)  v is monotone
min_sup = 2
Item Price Profit ◼ Ex. 2: c2: min(S.Price)  v is monotone
a 100 40 ◼ Ex. 3: c3: range(S.profit)  15 is monotone
b 40 0
c 150 −20
◼ Itemset ab satisfies c3
d 35 −15 ◼ So does every superset of ab
e 55 −30
f 45 −10
g 80 20
Note: item.price > 0
h 10 5 Profit can be negative
44
Data Space Pruning with Data Anti-Monotonicity
TID Transaction ❑ A constraint c is data anti-monotone: In the mining process, if a
10 a, b, c, d, f, h data entry t cannot satisfy a pattern p under c, t cannot satisfy
20 b, c, d, f, g, h p’s superset either
30 b, c, d, f, g ❑ Data space pruning: Data entry t can be pruned
40 a, c, e, f, g
❑ Ex. 1: c1: sum(S.Profit)  v is data anti-monotone
min_sup = 2
❑ Let constraint c1 be: sum(S.Profit) ≥ 25
Item Price Profit
a 100 40 ❑ T30: {b, c, d, f, g} can be removed since none of their
b 40 0 combinations can make an S whose sum of the profit is ≥ 25
c 150 −20 ❑ Ex. 2: c2: min(S.Price)  v is data anti-monotone
d 35 −15 ❑ Consider v = 5 but every item in a transaction, say T50 , has a
e 55 −30 price higher than 10
f 45 −10
g 80 20
❑ Ex. 3: c3: range(S.Profit) > 25 is data anti-monotone
Note: item.price > 0
h 10 5
45 Profit can be negative
Data Space Pruning Should Be Explored Recursively
b’s-proj. DB
❑Example. c3: range(S.Profit) > 25 Item Profit
TID Transaction TID Transaction
❑ We check b’s projected database 10 a, b, c, d, f, h a 40
10 a, c, d, f, h
❑ But item “a” is infrequent (sup = 1) 20 c, d, f, g, h 20 b, c, d, f, g, h b 0

❑ After removing “a (40)” from T 10 30 b, c, d, f, g c −20

30 c, d, f, g
40 a, c, e, f, g d −15
❑ T10 cannot satisfy c3 any more
e −30
min_sup = 2
❑ Since “b (0)” and “c (−20), d (−15), f (−10), h (5)” f −10
price(item) > 0
❑ By removing T10, we can also prune “h” in T 20 g 20
Constraint:
b’s-proj. DB TID Transaction h 5
Recursive range{S.profit} > 25
10 a, c, d, f, h Data b’s FP-tree
20 c, d, f, g, h Pruning single branch: cdfg: 2
Only a single branch “cdfg: 2”
30 c, d, f, g to be mined in b’s projected DB

❑ Note: c3 prunes T10 effectively only after “a” is pruned (by min-sup) in b’s projected DB
46
Succinctness: Pruning Both Data and Pattern Spaces
◼ Succinctness: If the constraint c can be enforced by directly manipulating the data
◼ Ex. 1: To find those patterns without item i
◼ Remove i from DB and then mine (pattern space pruning)
◼ Ex. 2: To find those patterns containing item i
◼ Mine only i-projected DB (data space pruning)
◼ Ex. 3: c3: min(S.Price)  v is succinct
◼ Start with only items whose price  v and remove transactions with high-price
items only (pattern + data space pruning)
◼ Ex. 4: c4: sum(S.Price)  v is not succinct
◼ It cannot be determined beforehand since sum of the price of itemset S keeps
increasing

47
Convertible Constraints: Ordering Data in Transactions
TID Transaction ◼ Convert tough constraints into (anti-)monotone by proper ordering
10 a, b, c, d, f, h of items in transactions
20 a, b, c, d, f, g, h
◼ Examine c1: avg(S.profit) > 20
30 b, c, d, f, g
◼ Order items in (profit) value-descending order
40 a, c, e, f, g
◼ <a, g, f, b, h, d, c, e>
min_sup = 2
Item Price Profit ◼ An itemset ab violates c1 (avg(ab) = 20)
a 100 40 ◼ So does ab* (i.e., ab-projected DB)
b 40 0
◼ C1: anti-monotone if patterns grow in the right order!
c 150 −20
◼ Can item-reordering work for Apriori?
d 35 −15
e 55 −30 ◼ Level-wise candidate generation requires multi-way checking!
f 45 −5 ◼ avg(agf) = 21.7 > 20, but avg(gf) = 12.5 < 20
g 80 30 ◼ Apriori will not generate “agf” as a candidate
h 10 5
48
Advanced Frequent Pattern Mining
❑ Mining Diverse Patterns

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ Summary

49
What Is Graph Pattern Mining?
❑ Chem-informatics:
❑ Mining frequent chemical compound structures

❑ Social networks, web communities, tweets, …

❑ Finding frequent research collaboration subgraphs

50
Frequent (Sub)Graph Patterns
❑ Given a labeled graph dataset D = {G1, G2, …, Gn), the supporting graph set of a
subgraph g is Dg = {Gi | g  Gi, Gi D}
❑ support(g) = |Dg|/ |D|
❑ A (sub)graph g is frequent if support(g) ≥ min_sup
Graph Dataset
❑ Ex.: Chemical structures

❑ Alternative:
(A) (B) (C)
❑ Mining frequent subgraph
patterns from a single large min_sup = 2 Frequent Graph Patterns
graph or network
(1) (2)

support = 67%
51
Applications of Graph Pattern Mining
❑ Bioinformatics
❑ Gene networks, protein interactions, metabolic pathways
❑ Chem-informatics: Mining chemical compound structures
❑ Social networks, web communities, tweets, …
❑ Cell phone networks, computer networks, …
❑ Web graphs, XML structures, Semantic Web, information networks
❑ Software engineering: Program execution flow analysis
❑ Building blocks for graph classification, clustering, compression, comparison,
and correlation analysis
❑ Graph indexing and graph similarity search

52
Graph Pattern Mining Algorithms: Different
Methodologies
❑ Generation of candidate subgraphs
❑ Apriori vs. pattern growth (e.g., FSG vs. gSpan)
❑ Search order
❑ Breadth vs. depth
❑ Elimination of duplicate subgraphs
❑ Passive vs. active (e.g., gSpan [Yan & Han, 2002])
❑ Support calculation
❑ Store embeddings (e.g., GASTON [Nijssen & Kok, 2004], FFSM [Huan, Wang,
& Prins, 2003], MoFa [Borgelt & Berthold, ICDM’02])
❑ Order of pattern discovery
❑ Path → tree → graph (e.g., GASTON [Nijssen & Kok, 2004])

53
Apriori-Based Approach
❑ The Apriori property (anti-monotonicity): A size-k (k+1)-edge
subgraph is frequent if and only if all of its k-edge
subgraphs are frequent G1
❑ A candidate size-(k+1) edge/vertex subgraph is G
generated if its corresponding two k-edge/vertex G2
subgraphs are frequent
G’
❑ Iterative mining process: …
❑ Candidate-generation → candidate pruning →
G’’ Gn
support counting → candidate elimination Join

54
Candidate Generation:
Vertex Growing vs. Edge Growing
❑ Methodology: Breadth-search, Apriori joining two size-k graphs
❑ Many possibilities at generating size-(k+1) candidate graphs

❑ Generating new graphs with one more vertex

❑ AGM (Inokuchi, Washio, & Motoda, PKDD’00)
❑ Generating new graphs with one more edge
❑ FSG (Kuramochi & Karypis, ICDM’01)
❑ Performance shows via edge growing is more efficient
55
Pattern-Growth Approach
❑ Depth-first growth of subgraphs from k-edge to (k+1)-edge, then (k+2)-edge
subgraphs (k+2)-edge
❑ Major challenge (k+1)-edge
❑ Generating many duplicate subgraphs G1 …
❑ Major idea to solve the problem
k-edge G2 duplicate
❑ Define an order to generate
G graphs
subgraphs
❑ DFS spanning tree: Flatten a graph …
into a sequence using depth-first Gn
search
…
❑ gSpan (Yan & Han, ICDM’02)

56
gSPAN: Graph Pattern Growth in Order
❑Right-most path extension in subgraph e0: (0,1)
0
pattern growth e1: (1,2)
1
❑ Right-most path: The path from root to the e2: (2,3)
right-most leaf (choose the vertex with the 2 e3: (3,0)
smallest index at each step) 4
❑ Reduce generation of duplicate subgraphs 3 e4: (2,4)

❑ Completeness: The enumeration of graphs

using right-most path extension is complete
❑ DFS code: Flatten a graph into a sequence
using depth-first search

57
Why Mine Closed Graph Patterns?
❑ Challenge: An n-edge frequent graph may have 2n subgraphs
❑ Motivation: Explore closed frequent subgraphs to handle graph
pattern explosion problem
❑ A frequent graph G is closed if there exists no supergraph of G that
carries the same support as G
If this subgraph is closed in the
graph dataset, it implies that
none of its frequent super-graphs
carries the same support

❑ Lossless compression: Does not contain non-closed graphs, but still

ensures that the mining result is complete
❑ Algorithm CloseGraph: Mines closed graph patterns directly

58
CloseGraph: Directly Mining Closed Graph Patterns
❑ CloseGraph: Mining closed graph patterns by extending gSpan (Yan & Han, KDD’03)

At what condition can we

G1 stop searching their children,
i.e., early termination?
k-edge
G2 ❑ Suppose G and G1 are frequent, and G is a
G subgraph of G1
❑ If in any part of the graph in the dataset
…
where G occurs, G1 also occurs, then we
need not grow G (except some special, subtle
Gn
cases), since none of G’s children will be
(k+1)-edge closed except those of G1
59
Experiment and Performance Comparison
❑ The AIDS antiviral screen compound dataset from NCI/NIH
❑ The dataset contains 43,905 chemical compounds
❑ Discovered patterns: The smaller minimum support, the bigger and more
interesting subgraph patterns discovered
20% 10% 5%

# of Patterns: Frequent vs. Closed Runtime: Frequent vs. Closed

frequent graphs
closed frequent graphs 10000
1.0E+06 FSG
Gspan
Number of patterns

Run time (sec)

1000 CloseGraph
1.0E+05

100
1.0E+04

1.0E+03 10

1.0E+02 1
0.05 0.06 0.07 0.08 0.1 0.05 0.06 0.07 0.08 0.1

60 Minimum support Minimum support

Advanced Frequent Pattern Mining
❑ Mining Diverse Patterns

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ Summary

61
Sequential Pattern Mining
❑ Sequential Pattern and Sequential Pattern Mining

❑ GSP: Apriori-Based Sequential Pattern Mining

❑ SPADE: Sequential Pattern Mining in Vertical Data Format

❑ PrefixSpan: Sequential Pattern Mining by Pattern-Growth

❑ CloSpan: Mining Closed Sequential Patterns

62
Sequence Databases & Sequential Patterns
❑ Sequential pattern mining has broad applications
❑ Customer shopping sequences
❑ Purchase a laptop first, then a digital camera, and then a smartphone,
within 6 months
❑ Medical treatments, natural disasters (e.g., earthquakes), science &
engineering processes, stocks and markets, ...
❑ Weblog click streams, calling patterns, …
❑ Software engineering: Program execution sequences, …
❑ Biological sequences: DNA, protein, …
❑ Transaction DB, sequence DB vs. time-series DB
❑ Gapped vs. non-gapped sequential patterns
❑ Shopping sequences, clicking streams vs. biological sequences
63
Sequential Pattern and Sequential Pattern Mining
❑ Sequential pattern mining: Given a set of sequences, find the complete set of
frequent subsequences (i.e., satisfying the min_sup threshold)

A sequence database A sequence: < (ef) (ab) (df) c b >

SID Sequence
10 <a(abc)(ac)d(cf)> ❑ An element may contain a set of items (also called
20 <(ad)c(bc)(ae)> events)
❑ Items within an element are unordered and we list
30 <(ef)(ab)(df)cb>
them alphabetically
40 <eg(af)cbc>
<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

❑ Given support threshold min_sup = 2, <(ab)c> is a sequential pattern

64
Sequential Pattern Mining Algorithms
❑ Algorithm requirement: Efficient, scalable, finding complete set, incorporating
various kinds of user-specific constraints
❑ The Apriori property still holds: If a subsequence s 1 is infrequent, none of s1’s
super-sequences can be frequent
❑ Representative algorithms
❑ GSP (Generalized Sequential Patterns): Srikant & Agrawal @ EDBT’96)
❑ Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
❑ Pattern-growth methods: PrefixSpan (Pei, et al. @TKDE’04)
❑ Mining closed sequential patterns: CloSpan (Yan, et al. @SDM’03)
❑ Constraint-based sequential pattern mining (to be covered in the constraint
mining section)
65
GSP: Apriori-Based Sequential Pattern Mining
❑ Initial candidates: All 8-singleton sequences SID Sequence
❑ <a>, , <c>, <d>, <e>, <f>, <g>, <h> 10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)>
❑ Scan DB once, count support for each candidate
30 <(ah)(bf)abf>
❑ Generate length-2 candidate sequences
40 <(be)(ce)d>
<a> <c> <d> <e> <f>
min_sup = 2 50 <a(bd)bcb(ade)>
<a> <aa> <ab> <ac> <ad> <ae> <af>

Cand. sup
 <ba> <bb> <bc> <bd> <be> <bf> ❑ Without Apriori pruning:
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<a> 3 <d> <da> <db> <dc> <dd> <de> <df> (8 singletons) 8*8+8*7/2 =
 5 <e> <ea> <eb> <ec> <ed> <ee> <ef> 92 length-2 candidates
<c> 4 <f> <fa> <fb> <fc> <fd> <fe> <ff> ❑ With pruning, length-2
<d> 3 <a> <c> <d> <e> <f> candidates: 36 + 15= 51
<e> 3 <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
 <(bc)> <(bd)> <(be)> <(bf)> GSP (Generalized Sequential
<f> 2
<g> 1
<c> <(cd)> <(ce)> <(cf)> Patterns): Srikant & Agrawal
<d> <(de)> <(df)>
<h> 1 <e> <(ef)>
@ EDBT’96)
66 <f>
GSP Mining and Pruning
5th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba> Candidates cannot pass min_sup
threshold
4th scan: 8 cand. 7 length-4 seq. pat. <abba> <(bd)bc> … Candidates not in DB

3rd scan: 46 cand. 20 length-3 seq. pat. 20

<abb> <aab> <aba> <baa> <bab> …
cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq. pat. <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
10 cand. not in DB at all
1st scan: 8 cand. 6 length-1 seq. pat. <a> <c> <d> <e> <f> <g> <h> min_sup = 2
❑ Repeat (for each level (i.e., length-k)) SID Sequence
10 <(bd)cb(ac)>
❑ Scan DB to find length-k frequent sequences
20 <(bf)(ce)b(fg)>
❑ Generate length-(k+1) candidate sequences from length-k
30 <(ah)(bf)abf>
frequent sequences using Apriori <(be)(ce)d>
40
❑ set k = k+1 50 <a(bd)bcb(ade)>
❑ Until no frequent sequence or no candidate can be found
67
Sequential Pattern Mining in Vertical Data
Format: The SPADE Algorithm
❑ A sequence database is mapped to: <SID, EID>
❑ Grow the subsequences (patterns) one item at a time by Apriori candidate generation
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
min_sup = 2

Ref: SPADE (Sequential

PAttern Discovery
using Equivalent Class)
[M. Zaki 2001]

68
PrefixSpan: A Pattern-Growth Approach
min_sup = 2
SID Sequence ❑Prefix and suffix
10 <a(abc)(ac)d(cf)> Prefix Suffix (Projection)
❑ Given <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)> <a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)> ❑ Prefixes: <a>, <aa>,
30 <(ef)(ab)(df)cb>
<ab> <(_c)(ac)d(cf)> <a(ab)>, <a(abc)>, …
40 <eg(af)cbc>
❑ Suffix: Prefixes-based
❑PrefixSpan Mining: Prefix Projections projection
❑ Step 1: Find length-1 sequential patterns
❑ <a>, , <c>, <d>, <e>, <f>
❑ Step 2: Divide search space and mine each projected DB
❑ <a>-projected DB,
PrefixSpan (Prefix-projected
❑ -projected DB,
Sequential pattern mining)
❑ …
Pei, et al. @TKDE’04
❑ <f>-projected DB, …
69
PrefixSpan: Mining Prefix-Projected DBs
SID Sequence min_sup = 2
10 <a(abc)(ac)d(cf)>
Length-1 sequential patterns
20 <(ad)c(bc)(ae)> <a>, , <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
prefix <a> prefix prefix <c>, …, <f>
<a>-projected DB -projected DB
<(abc)(ac)d(cf)> Length-2 sequential …
<(_d)c(bc)(ae)> patterns
<aa>, <ab>, <(ab)>,
<(_b)(df)cb>
<(_f)cbc>
<ac>, <ad>, <af> ……
prefix <aa> prefix <af> Major strength of PrefixSpan:
◼ No candidate subseqs. to be generated
<aa>-projected DB … <af>-projected DB
◼ Projected DBs keep shrinking
70
Implementation Consideration: Pseudo-Projection vs.
Physical Projection
❑ Major cost of PrefixSpan: Constructing projected DBs
❑ Suffixes largely repeating in recursive projected DBs
❑ When DB can be held in main memory, use pseudo projection
❑ No physically copying suffixes
s = <a(abc)(ac)d(cf)>
❑ Pointer to the sequence <a>
❑ Offset of the suffix
s|<a>: ( , 2) <(abc)(ac)d(cf)>
❑ But if it does not fit in memory <ab>
❑ Physical projection s|<ab>: ( , 5) <(_c)(ac)d(cf)>
❑ Suggested approach:
❑ Integration of physical and pseudo-projection
❑ Swapping to pseudo-projection when the data fits in memory
71
END OF UNIT - II

AQW Item List
33% (3)
AQW Item List
1,531 pages
Waiver of Rights
100% (10)
Waiver of Rights
2 pages
Punished by Rewards
100% (1)
Punished by Rewards
48 pages
L4C1 Examiner Report March 2022
No ratings yet
L4C1 Examiner Report March 2022
7 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Week 3
No ratings yet
Week 3
56 pages
Association
No ratings yet
Association
40 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Frequent Patterns and Association Rule Mining: Outline
No ratings yet
Frequent Patterns and Association Rule Mining: Outline
26 pages
P8 FPBasic
No ratings yet
P8 FPBasic
53 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
Concepts and Techniques: - Chapter 6
No ratings yet
Concepts and Techniques: - Chapter 6
64 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
67 pages
DWDWM Unit2
No ratings yet
DWDWM Unit2
59 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
KDDM-Lecture 3
No ratings yet
KDDM-Lecture 3
21 pages
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
No ratings yet
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
66 pages
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
Updated Module 3
No ratings yet
Updated Module 3
31 pages
Frequent Itemset Mining
No ratings yet
Frequent Itemset Mining
58 pages
06 FPBasic
No ratings yet
06 FPBasic
65 pages
Slides 06FPBasic
No ratings yet
Slides 06FPBasic
30 pages
DMDW Chapter 4 (Updated)
No ratings yet
DMDW Chapter 4 (Updated)
28 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
28 pages
Module 3
No ratings yet
Module 3
136 pages
DM-BS-lec6-Mining Frequent Patterns
No ratings yet
DM-BS-lec6-Mining Frequent Patterns
37 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
Unit 3
No ratings yet
Unit 3
62 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
29 pages
06 FPBasic
No ratings yet
06 FPBasic
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
33 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
(2025-05-27) - FPM - Lecture 9
No ratings yet
(2025-05-27) - FPM - Lecture 9
35 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
DWDM - Unit - IV
No ratings yet
DWDM - Unit - IV
67 pages
Chapter06 (Frequent Patterns)
No ratings yet
Chapter06 (Frequent Patterns)
47 pages
Powerpoint Presentation On Somlething
No ratings yet
Powerpoint Presentation On Somlething
181 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Unit 2
No ratings yet
Unit 2
65 pages
P-3 1 5-Association
No ratings yet
P-3 1 5-Association
46 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
06 Association Rule Mining
No ratings yet
06 Association Rule Mining
20 pages
Chapter 4
No ratings yet
Chapter 4
32 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
Assoc 1
No ratings yet
Assoc 1
26 pages
Association Rules
No ratings yet
Association Rules
24 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Slide 06 Chapter6 Frequent Itemset Mining Methods
No ratings yet
Slide 06 Chapter6 Frequent Itemset Mining Methods
62 pages
Data Mining: Magister Teknologi Informasi Universitas Indonesia
No ratings yet
Data Mining: Magister Teknologi Informasi Universitas Indonesia
72 pages
Association Rules
No ratings yet
Association Rules
48 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
Association Rule Mining:: Dm-Unit-2
No ratings yet
Association Rule Mining:: Dm-Unit-2
16 pages
7 - Association Rule Analysis
No ratings yet
7 - Association Rule Analysis
16 pages
Jelly Filled—18 Quilts from 2 1/2'' Strips: All Skill Levels
From Everand
Jelly Filled—18 Quilts from 2 1/2'' Strips: All Skill Levels
Vanessa Goertzen
4/5 (1)
Gis
No ratings yet
Gis
111 pages
CF Unit 1 Notes
No ratings yet
CF Unit 1 Notes
14 pages
Iv Year Ii Sem Mid - I Wssoa
No ratings yet
Iv Year Ii Sem Mid - I Wssoa
3 pages
PCD Lawn Menial Process
No ratings yet
PCD Lawn Menial Process
24 pages
Computer Breaks System and Mound
No ratings yet
Computer Breaks System and Mound
5 pages
DM 1
No ratings yet
DM 1
47 pages
OverallMarks of Individual
No ratings yet
OverallMarks of Individual
1 page
RPF Si PDF
No ratings yet
RPF Si PDF
2 pages
All Ges101 Past Questions-1-1
No ratings yet
All Ges101 Past Questions-1-1
55 pages
SGLGB EO No. 19 BESRT
100% (6)
SGLGB EO No. 19 BESRT
1 page
Test Accessories Main Catalog: Test & Measureline - Test & Measurement
No ratings yet
Test Accessories Main Catalog: Test & Measureline - Test & Measurement
188 pages
Signal Words Used To Express Problem-Solution
No ratings yet
Signal Words Used To Express Problem-Solution
14 pages
List of Questions For Pre-Interviews: You and Your Family
100% (1)
List of Questions For Pre-Interviews: You and Your Family
6 pages
Reading Passage 1
No ratings yet
Reading Passage 1
13 pages
Transparansi Dan Akuntabilitas Dana Masjid Dalam Pemberdayaan Ekonomi Ummat
No ratings yet
Transparansi Dan Akuntabilitas Dana Masjid Dalam Pemberdayaan Ekonomi Ummat
19 pages
Mineral Resource Conflict Jharkhand
No ratings yet
Mineral Resource Conflict Jharkhand
20 pages
Internal Audit Report: 1. Summary of Findings
No ratings yet
Internal Audit Report: 1. Summary of Findings
7 pages
Zamoras Vs Su Case Digest
No ratings yet
Zamoras Vs Su Case Digest
1 page
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
No ratings yet
Life of Augustine of Hippo The Donatist Controvers... - (PG 25 - 164) PDF
140 pages
Final Test: With Answer Key
No ratings yet
Final Test: With Answer Key
8 pages
Shree Cement Reprot 21-22
No ratings yet
Shree Cement Reprot 21-22
292 pages
Introduction To Six Sigma &amp Process Improvement 2nd An James R. Evans &amp William M. Lindsay Instant Download
100% (2)
Introduction To Six Sigma &amp Process Improvement 2nd An James R. Evans &amp William M. Lindsay Instant Download
28 pages
Punctuation Worksheet
No ratings yet
Punctuation Worksheet
4 pages
Fraser Parker - Occlus
100% (4)
Fraser Parker - Occlus
24 pages
In The Beginning : Chapter Summaries
No ratings yet
In The Beginning : Chapter Summaries
2 pages
Monstrous - Map Crow - Cloud Curigwxfuo - Preview - 10272022
No ratings yet
Monstrous - Map Crow - Cloud Curigwxfuo - Preview - 10272022
8 pages
Contraception Today A Pocketbook For General Practitioners and Practice Nurses 7th Edition John Guillebaud
No ratings yet
Contraception Today A Pocketbook For General Practitioners and Practice Nurses 7th Edition John Guillebaud
55 pages
OECD Work Education Skills Policy Products Services For Countries
No ratings yet
OECD Work Education Skills Policy Products Services For Countries
9 pages
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
No ratings yet
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
3 pages
Withania Coagulans (Solanaceae)
No ratings yet
Withania Coagulans (Solanaceae)
11 pages
Qawaid Fiqhiyyah
No ratings yet
Qawaid Fiqhiyyah
411 pages
Qs Content
No ratings yet
Qs Content
3 pages
Adobe PR
No ratings yet
Adobe PR
54 pages
Modals of Probability 2
No ratings yet
Modals of Probability 2
2 pages

DM 2

Uploaded by

DM 2

Uploaded by

AI512PE: DATA MINING (PE - I)

❑ What Is Pattern Discovery? Why Is It Important?

❑ Compressed Representation: Closed Patterns and Max-Patterns

❑ Let σ = 50% (σ: minsup threshold) 30 Beer, Diaper, Eggs

❑ Find all of the rules, X → Y (s, c) 20 Beer, Coffee, Diaper

(Q: Are these all rules?)

C3 Itemset F3 Itemset sup

TDB1 + TDB2 + ... + TDBk = TDB

❑ Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)

❑ The Apriori Algorithm

❑ Mining Frequent Patterns by Exploring Vertical Data Format

❑ FPGrowth: A Frequent Pattern-Growth Approach

❑ Mining Closed Patterns

Item Frequency Header f:4 c:1 Item Conditional database

{} {} {} {} Actually, for single branch FP-tree, all the

❑ Closed patterns ❑ δ-clustering: For each pattern P, find all patterns

❑ Method: Use MMS (Maximal Marginal Significance) for measuring the

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ After removing “a (40)” from T 10 30 b, c, d, f, g c −20

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ Social networks, web communities, tweets, …

❑ Generating new graphs with one more vertex

❑ Completeness: The enumeration of graphs

❑ Lossless compression: Does not contain non-closed graphs, but still

At what condition can we

# of Patterns: Frequent vs. Closed Runtime: Frequent vs. Closed

Run time (sec)

60 Minimum support Minimum support

❑ Sequential Pattern Mining

❑ Constraint-Based Frequent Pattern Mining

❑ Graph Pattern Mining

❑ Pattern Mining Application: Mining Software Copy-and-Paste Bugs

❑ GSP: Apriori-Based Sequential Pattern Mining

❑ SPADE: Sequential Pattern Mining in Vertical Data Format

❑ PrefixSpan: Sequential Pattern Mining by Pattern-Growth

❑ CloSpan: Mining Closed Sequential Patterns

A sequence database A sequence: < (ef) (ab) (df) c b >

❑ Given support threshold min_sup = 2, <(ab)c> is a sequential pattern

3rd scan: 46 cand. 20 length-3 seq. pat. 20

Ref: SPADE (Sequential

You might also like