0% found this document useful (0 votes)
34 views40 pages

Association

Uploaded by

321106410027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views40 pages

Association

Uploaded by

321106410027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Chapter 5: Mining Frequent Patterns, Association

and Correlations: Basic Concepts and Methods


 Basic Concepts

 Frequent Itemset Mining Methods


 Apriori Algorithm, Improvements to Apriori
 Association Rule Mining
 FP Growth Mining

 Pattern Evaluation Methods

 Summary

1
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?
 Web pages of interest to groups of users
 What are the subsequent purchases after buying a PC?
 Finding structural patterns from chemical compounds or social media
 Applications
 Basket data analysis, cross-marketing, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis & motif
identification 2
Frequent Patterns: Frequent Itemsets
 A Frequent pattern in general captures an intrinsic and important
property of a dataset.
 Frequent patterns of a transaction database are the set of itemsets
frequently purchased together, which are called as Frequent Itemsets.
 itemset: A set of one or more items
 k-itemset X = {x1, …, xk}
 support count of X: Frequency or occurrence of an itemset X
 (relative) support, s, is the fraction of transactions that contains X (i.e.,
the probability that a transaction contains X)
 An itemset X is frequent if X’s support is no less than a minsup threshold

Transactions
Tid Items bought containing Transactions
10 Bread, Nuts, Jam both containing
20 Bread, Coffee, Jam Nuts
30 Bread, Jam, Eggs
40 Nuts, Eggs, Milk Transactions
50 Nuts, Coffee, Jam, Eggs, Milk containing Bread
3
Basic Concepts: Association Rules
Tid Items bought
 support, s, probability that a
10 Bread, Nuts, Jam transaction contains X  Y
20 Bread, Coffee, Jam  confidence, c, conditional
30 Bread, Jam, Eggs probability(Y|X) that a
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Jam, Eggs, Milk
transaction having X also
contains Y
Let minsup = 50%, minconf = 50%
Min threshold on sup count=50% of 5≈3
Freq. Pat. : Bread:3, Nuts:3, Jam:4, Eggs:3, {Bread, Jam}:3
 Find all the rules X  Y with minimum support and
confidence
 Association rules formed from the 2-itemset:
 Bread  Jam (60%, 100%)
 Jam  Bread (60%, 75%) 4
Computational Complexity of Frequent Itemset
Mining
 How many itemsets are possibly generated in the worst case?
 Worst case: MN where M: # distinct items, and N: max length of
transactions (all combinations of items of the longest transaction can be
frequent enough)
 The number of frequent itemsets to be generated is senstive to the
minsup threshold
 When minsup is low, there exist exponential number of frequent
itemsets
 The worst case complexity vs. the expected probability
 Ex. Suppose Walmart has 104 distinct items
 The probability to pick up a specific item is 10-4
 The probability to pick up a particular set of 10 items: ~10-40
 What is the chance this particular set of 10 products to be frequent
with occurrence of 103 times in 109 transactions?
5
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Pattern Evaluation Methods

 Summary

6
The Downward Closure Property and Scalable
Approaches to Frequent Pattern Mining
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent

 If {bread, jam, nuts} is frequent, so is {bread, jam}

 i.e., every transaction having {bread, jam, nuts} also

contains {bread, jam}


 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)
 First two approaches are there in the syllabus
7
Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

8
Apriori: A Candidate Generation & Test Approach
 Apriori pruning based on anti-monotone property:
If an itemset is found to be infrequent, its supersets are not
candidates to be generated/tested!
 Method:
1. Initially, scan DB once to get frequent 1-itemset
2. Generate length (k+1) candidate itemsets from length
k frequent itemsets
3. Test the candidates against DB and identify frequent
candidates
4. Repeat step 2& 3 for next k
5. Terminate when no frequent or candidate set can be
generated 9
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
10
The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk ;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk ;
11
Implementation of Apriori

 How to generate candidates?


 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because cde and ade are not in L3
 C4 = {abcd}
12
Condensed Representation: Closed
Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains 2100 – 1 = 1.27*1030
sub-patterns! (correspond to non-zero rows of a truth
table of 100 variables)
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X
 Closed pattern is a lossless compression of freq. patterns
 Used for reducing the # of patterns and rules

13
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
14
Representing frequent patterns: Example

 All frequent Itemsets= k Lk are listed below:


{<BCE:2>, <AC:2>,<BC:2>,<BE:3>,<CE:2>,<A;2>,<B:3>,<C:3>,<E:3>}
 Association rules are generated from these patterns
 Maximal patterns, M={<BCE:2>, <AC:2>}
 Closed patterns, C={<BCE:2>,<AC:2>,<BE:3>,<C:3>}
 We can infer all patterns and their supports from the set of
closed patterns, C.
 The support of an frequent itemset which is not in C, is equal
to the maximum support over all its closed superpatterns.
Eg: sup(B) = max {sup(BCE), sup(BE)}=max{2,3}=3
Similarly sup(BC)=2 and sup(CE)=2

15
Association Rule Formation from a
Frequent Pattern

 Generate all non-empty proper subsets from the


frequent itemset, f and from each of the subset,s,
generate a rule: s(f-s) and estimate its
confidence, and check against min confidence.
 Eg: generating ARs from <BCE:2> with min. conf 75%
Rule Rule Strong /weak
Id.
R1 BCE 2/2=100% strong
R2 BEC 2/3=67% weak
R3 CEB 2/2=100% Strong
R4 BCE Pruned because it has a parent rule (R2) which is weak
R5 EBC Pruned because it has a parent rule (R2) which is weak

R6 CBE 2/3=67% weak


16
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

17
Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce the number of transaction database scans
 Shrink the transaction database
 Shrink the number of candidates
 Facilitate support counting of candidates
18
Partition TDB: Scan Database Only Twice
 If the size of TDB is too huge making it non-memory
resident, each scan in the Apriori alg requires a lot of I/O.
Instead TDB can be partitioned such that the partitions fit
into the memory and Apriori alg finds the locally frequent
itemsets separately from each of them.
 Rationale: Any itemset that is globally frequent in TDB
must be locally frequent in at least one of the partitions.
 Scan 1: partition database and apply Apriori algorithm

on each partition separately and find locally frequent


patterns
 Scan 2: consolidate global frequent patterns

DB1 + DB2 + + DBk = TDB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Sampling for Frequent Patterns
1. Select a sample of original database, mine frequent patterns
(FISS) within sample using Apriori
2. Prepare set of candidate itemsets to contain the FISS and the
itemsets at the borders of closure of FISS
Example: include abcd also as a candidate itemset, if abc, acd, abd, and
bcd are found to be frequent patterns in the sample

3. Scan database once to count the support for the candidate


itemsets generated in step 2 and identify FIS by thresholding
4. Scan database again to find the support of possible
extensions of the missed frequent patterns, if any.
 This approach is very fast as it applies Apriori algorithm only on
a representative sample rather than the huge TDB and requires
a maximum of two scans of TDB .
20
DIC: Dynamic Itemset Counting

• TDB is partitioned into blocks marked by


start points and candidate itemsets can
be added at any of these start points.
• The support count of a candidate
itemset is finalized upon revisiting its
start point while scanning the TDB.
• Reduces the number of scans

Once both A and D gains the


minimum support required, AD is
introduced as a candidate itemset,
and support counting of AD begins
Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins 21
DHP: Reduce the Number of Candidates

 In sparse transaction DBs most of the items qualify themselves as frequent


items (L1) but get filtered away as a member of larger itemsets (Lk). In a DB
with 100’s of distinct items, |C1| and |L1| are almost equal but |L2| is very
small compared to |C2|. Hence first scan also counts support for groups of
2-itemsets in a hash table.
 A 2-itemset whose corresponding hashing bucket count is below the
threshold cannot be frequent count itemsets
35 {ab, ad, ae}
 Min sup is 40 88 {bd, be, de}
Hash Table
 Hash entries are . .
. .
 {ab, ad, ae} . .

 {bd, be, de} …. 102 {yz, qs, wt}


 None of the 2-itemsets mapped into the backet ({ab, ad, ae}) are qualified
as candidates, since the total count of the bucket is below 40.
22
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

23
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent pattern
24
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset (single
item pattern) Item frequency head f:4 c:1
f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, construct p 3
FP-tree m:2 b:1
F-list = f-c-a-b-m-p
p:2 m:1 25
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list = f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but none of a, b, m, p

 Pattern f

 Completeness and non-redundancy achieved

26
Find Patterns Having P From P-conditional Database

 Starting at each frequent item in header table of the FP-tree,


traverse the FP-tree by following the parent link of each frequent
item, p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


27
From Conditional Pattern-bases to Conditional FP-trees

 For pattern-base of each locally frequent item


 Construct the cond-FP-tree from the pattern base to grow

the pattern
 Accumulate the count for each item in the base to identify

extended patterns m-conditional pattern base:


fca:2, fcab:1
{} m-conditional FP-tree
Header Table
Item frequency head {} All frequent
f:4 c:1 patterns relate to m
f 4
c 4 c:3 b:1 b:1 f:3 m,

a 3  fm, cm, am,
b 3 a:3 p:1 c:3 fcm, fam, cam,
m 3 fcam
p 3 m:2 b:1 a:3
p:2 m:1 |
b:1
28
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

29
A Special Case: Single Prefix Path in FP-tree

 Suppose a (conditional) FP-tree T has a shared


single prefix-path P
 Mining can be decomposed into two parts
{}
 Reduction of the single prefix path into one node
a1:n1  Concatenation of the mining results of the two
a2:n2 parts
a3:n3
{} r1

b1:m1 C1:k1 a1:n1


 r1 = + b1:m1 C1:k1
a2:n2
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
30
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern
mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never be larger than the original database (not count
node-links and the count field)

31
The Frequent Pattern Growth Mining Method

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and

database partition
 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created conditional

FP-tree
 Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern

32
33
Performance of FPGrowth in Large Datasets

100
90 D1 FP-…
D1 Apriori…
80
70
Run time(sec.)

60
50 Data set T25I20D10K
40
30
20
10
0
0 1Support threshold(%)
2 3

FP-Growth vs. Apriori

34
Advantages of the Pattern Growth Approach
 Divide-and-conquer:
 Decompose both the mining task and DB according to the frequent
patterns obtained so far
 For huge TDBs the main memory may not be enough to accommodate the FP-tree in
full. The TDB can be partitioned into a set of projected databases along specific
frequent items and then apply FP-growth alg on each projection and the patterns
extracted can be extended by the suffix representing the frequent item.

 Leads to focused search of smaller databases


 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure compresses dense TDBs around
ten fold
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no pattern
search and matching
35
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

36
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

P( A B) Basketball Not basketball Sum (row)


lift  Cereal 2000 1750 3750
P( A) P( B)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000

37
Are lift and 2 Good Measures of Correlation?

 “Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of


customers buy milk
 Support and confidence measure co-occurance and are not good to
indicate correlations
 Other widely used interestingness measures:
 All_conf(A,B)= min{P(A|B), P(B|A)} similarly Max_Conf(A,B)
 Kulczynski measure Kulc(A,B)= (P(A|B)+P(B|A))/2
 Cosine(A,B)= sqrt( P(A|B)*P(B|A) )
A measure is Null-Invariant if its value is free from the influence of
null-transactions. Above four measures are Null-Invariant where
as lift and 2 are not Null-invariant.
38
Comparison of Interestingness Measures

39
Which Null-Invariant Measure Is Better?
 IR (Imbalance Ratio): measures the imbalance of two itemsets A and B
in rule implications

 Datasets D4 through D6 are all neural (kulc =0.5) even with a lot of
variation in the individual frequencies of ‘m’ and ‘c’. Since their
Kulczynski value is unaffected, it is recommended to use Imbalance
Ratio (IR) together with Kulczynski for extracting interesting patterns.
 D4 is balanced & neutral

 D5 is imbalanced & neutral

 D6 is very imbalanced & neutral

You might also like