VIPDMTheoryChapter 5
VIPDMTheoryChapter 5
Concepts and
Techniques
1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
2
What Is Frequent Pattern
Analysis?
Imagine that you are a sales manager at AllElectronics, and you are
talking to a customer who recently bought a PC and a digital camera
from the store. What should you recommend to her next?
Frequent patterns and association rules are the knowledge that you
want to mine in such a scenario
Importance:
Identifying relationships between purchases
Enhancing product recommendations
3
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (itemset, subsequences, substructure,
etc.) that occurs frequently in a data set
5
Market Basket Analysis: A
Motivating Example
Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers
Market basket analysis may help you design different store layouts
11
Basic Concepts: Frequent
Patterns
transaction ID items
1 {A,C,D}
2 {B,C,E}
3 {A,B,C,E}
4 {B,E}
5 {A,B,C,E}
Basic Concepts: Frequent
Patterns
Let Min Support = Itemset Support F/I
2/5 ABC 2/5 F
Itemset Support F/I
ABD 0/5 I
A 3/5 F
ABE 2/5 F
B 4/5 F
ACD 1/5 I
C 4/5 F
ACE 2/5 F
D 1/5 I
ADE 0/5 I
E 4/5 F
BCD 0/5 I
AB 2/5 F
BCE 3/5 F
AC 3/5 F
BDE 0/5 I
AD 1/5 I
CDE 0/5 I
AE 2/5 F
ABCD 0/5 I
BC 3/5 F
ABCE 2/5 F
BD 0/5 I
ABDE 0/5 I
BE 4/5 F
ACDE 0/5 I
CD 1/5 I
BCDE 0/5 I
CE 3/5 F
ABCDE 0/5 I
DE 0/5 I
Basic Concepts: Frequent
Patterns
Frequent itemsets = 15
Infrequent itemsets =
16
Basic Concepts: Association Rules
Ti Items bought
d Find all the rules X Y with
10 Bread, Nuts, Diaper minimum support and
20 Bread, Coffee, Diaper
confidence
30 Bread, Diaper, Eggs
40 Nuts, Eggs, Milk
support, s, probability that
50 Nuts, Coffee, Diaper, Eggs, a transaction contains X
Milk Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
Association
Let rules:minconf
minsup = 50%, (many=more!)
50%
Freq. Bread Diaper (60%, 100%)
Pat.: Bread:3, Nuts:3, Diaper:4,
Diaper{Bread,
Eggs:3, Bread (60%, 75%)
Diaper}:3
15
Association rule mining Process
In general, association rule mining can be viewed
as a two-step process:
1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a
predetermined minimum support count (min sup).
16
The Downward Closure Property and
Scalable Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be
frequent
• Milk
• Bread
• Butter
• Milk, Bread
• Milk, Butter
• Bread, Butter
17
A major challenge in mining
frequent itemsets from a large
data set
18
The total number of frequent itemsets that it
contains is thus
19
Closed Patterns and Max-
Patterns
A long pattern contains a combinatorial number of
sub-patterns.
21
22
23
A,C,D B,C,D
24
25
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
26
EXAMPLE
Let’s take an example. Suppose we have a database
with four customer transactions, denoted as T1, T2,
T3 and T4:
T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}
{b},
{c},
{d},
{a,b},
{a,c},
{a,d},
{b,c},
{b,d},
{a,b,c},
{a,b,d}
28
frequent closed itemsets:
{a,b},
{a,b,c},
{a,b,d},
{a,b,c,d}
29
FIND maximal frequent itemset
Transaction ID Items
1 ABCD
2 ABD
3 ACD
4 BCD
30
Support count
Itemset (how many times
an itemset appears
in the database)
A 3
B 3
C 3
D 4
AB 2
AC 2
AD 3
BC 2
BD 3
CD 3
ABC 1
ABD 2
ACD 2
BCD 2
ABCD 1
31
the frequent itemsets are
A, B, C, D, AB, AC, AD, BC, BD, CD, ABD,
32
Why Closed and Maximal Frequent
itemset
Closed itemsets are important because they can reduce the
number of frequent itemsets presented to the user, without
losing any information.
Frequent itemsets can be very large and redundant, especially
when the minsup is low.
By mining closed itemsets, generally only a very small set of
itemsets is obtained, and still all the other frequent itemsets can
be directly derived from the closed itemsets.
Maximal frequent itemsets provide a compact representation
of all the frequent itemsets for a given dataset and minimum
support threshold.
Also discovering maximal frequent itemsets can be faster and
require less memory and storage space than finding all the
frequent itemsets.
we can derive all the other frequent itemsets from maximal
frequent itemset but we many not know their exact support
count.
33
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
36
Scalable Frequent Itemset Mining
Methods
Approach
Approach
Data Format
37
Apriori Algorithm
Apriori employs an iterative approach known as a
level-wise search, where k-itemsets are used to
explore (k +1)-itemsets.
To improve the efficiency of the level-wise
generation of frequent itemsets, an important
property called the Apriori property is used to
reduce the search space.
38
Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.
This property belongs to a special category of
properties called antimonotonicity in the sense
that if a set cannot pass a test, all of its
supersets will fail the same test as well.
39
How is the Apriori property used in
the algorithm?
A two-step process is followed, consisting of join
and prune actions.
41
Apriori: A Candidate Generation & Test
Approach
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
43
44
45
In the first iteration of the algorithm, each item is a
member of the set of candidate1-itemsets, C1.
The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets
satisfying minimum support.
To discover the set of frequent 2-itemsets, L2, the
algorithm uses the join L1 ⋈ L1 to generate a candidate
set of 2-itemsets, C2.
No candidates are removed from C2 during the prune
step because each subset of the
candidates is also frequent.
46
The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate 2-itemsets in C2 having
minimum support.
The generation of the set of the candidate 3-itemsets, C3.
From the join step, we first get C3 =L2 ⋈ L2 ={{I1, I2, I3},
{I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be
frequent.
We therefore remove them from C3, thereby saving the
effort of unnecessarily obtaining their counts during the
subsequent scan of D to determine L3.
47
48
The transactions in D are scanned to determine L3,
consisting of those candidate 3-itemsets in C3 having
minimum support
The algorithm uses L3 ⋈ L3 to generate a candidate set
of 4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, itemset {I1, I2, I3, I5} is pruned because its subset
{I2, I3, I5} is not frequent. Thus, C4 =φ, and the algorithm
terminates, having found all of the frequent itemsets.
49
Generating Association Rules from Frequent
Itemsets
Once the frequent itemsets from transactions in database
D have been found, it is straightforward to generate
strong association rules from them (where strong
association rules satisfy both minimum support and
minimum confidence)
50
Based on previous equation, association rules
can be generated as follows:
51
If the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules are output, because these are the only ones generated that are strong.
52
The Apriori Algorithm (Pseudo-
Code)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 53
Implementation of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
54
How to Count Supports of Candidates?
55
Counting Supports of Candidates Using Hash
Tree
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
56
Candidate Generation: An SQL
Implementation
SQL Implementation of candidate generation
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Use object-relational extensions like UDFs, BLOBs, and Table functions
for efficient implementation [See: S. Sarawagi, S. Thomas, and R.
Agrawal. Integrating association rule mining with relational database
systems: Alternatives and implications. SIGMOD’98]
57
Scalable Frequent Itemset Mining
Methods
Format
Transaction reduction
Partitioning
Sampling
process
Key Idea:
61
Hash-based technique
62
Hash-based technique
63
Hash-based technique
C
1
Itemset Count
I1 6
I2 7
I3 6
I4 2
I5 2
Hash table
Bucket
0 1 2 3 4 5 6
Address
Bucket
2 2 4 2 2 4 4
Count
Bucket
Content {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
s
{I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}
66
Transaction reduction
Key Idea:
67
Transaction reduction
Transactions Items
T1 I1, I2, I5
T2 I2, I3, I4 Min. Support Count=2
T3 I3, I4
T4 I1, I2, I3, I4
I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0
68
Transaction reduction
I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0
I1 I2 I3 I4
T1 1 1 0 0
T2 0 1 1 1
T3 0 0 1 1
T4 1 1 1 1
69
Transaction reduction
70
Partitioning
72
Partitioning
73
Sampling
Counters: A = 0, B = 0, C = 0
Empty itemset is marked with a solid
box. All 1-itemsets are marked with
dashed circles.
DIC - transactions
After M transactions are read:
After 2M transactions are read:
Counters: A = 2, B = 1, C = 0,
AB = 0
Counters: A = 2, B = 2, C = 1, AB =
0, AC = 0, BC = 0
We change A and B to dashed
boxes because their counters
are greater than minsup (1) and C changes to a square because its
counter is greater than minsup.
add a counter for AB because
both of its subsets are boxes.
A, B and C have been counted all
the way through so we stop
counting them and make their
boxes solid.
Add counters for AC and BC
because their subsets are all boxes.
DIC - transactions
Counters: A = 2, B = 2, C = 1, AB = 1, AC
= 0, BC = 1
Counters: A = 2, B = 2, C = 1, AB = 1, AC =
0, BC = 0 AC and BC are counted all the way
AB has been counted all the way through through. We do not count ABC because
and its counter satisfies minsup so we one of its subsets is a circle. There are no
change it to a solid box. BC changes to a dashed itemsets left so the algorithm is
dashed box. done.
Frequent Pattern Growth
Apriori candidate generate-and-test method significantly reduces the size
It may need to repeatedly scan the whole database and check a large
itemsets.
T100 I1,I2,I5
• Scan the DB same as
T200 I2,I4 Apriori
• Derives the set of frequent
T300 I2,I3
items(1-itemsets) and
T400 I1,I2,I4 support item
countfre item freq
s q s
T500 I1,I3
I1 6 I2 7
T600 I2,I3 I2 7 I1 6
I3 6 I3 6
T700 I1,I3
I4 2 I4 2
T800 I1,I2,I3,I5 I5 2 I5 2
Tid Items
T10 I2,I1,I5
0 Null
item fre
T20 I2,I4 s q
0
I2 7 I2 I1
T30 I2,I3
0 I1 6
T40 I2,I1,I4 I3 6
0 I4 2
I1 I3 I4 I3
T50 I1,I3 I5 2
0
T60 I2,I3
0 I5 I3 I4
T70 I1,I3
0
T80 I2,I1,I3,I5 I5
0
T90 I2,I1,I3
0
FP-Growth Method: Construction of
FP-Tree
Mining Frequent Patterns (bottom-up
approach)
Why Frequent Pattern Growth
Fast ?
Performance study shows :
FP-growth is an order of magnitude faster
than Apriori, and is also faster than tree-
projection
Reasoning :
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree
building
Challenges in FP Growth
Complexity in tree construction for
large datasets.
FP-Tree is not dynamic (must rebuild if
data changes).
Deep recursive calls (can lead to
memory issues).
94
Mining Frequent Itemsets Using
the Vertical Data Format
95
Horizontal vs. Vertical Data Format
Horizontal Format: Transactions are stored as rows, each
containing a set of items.
Vertical Format: Each item is stored with its corresponding
transaction IDs (TIDs).
Example:
Horizontal Format:
T1: {A, B, C}
T2: {A, C, D}
Vertical Format:
A: {T1, T2}
B: {T1}
C: {T1, T2}
D: {T2}
96
Advantages of Vertical Data
Format
97
Apriori Property in Vertical Format
A candidate k+1 itemset is generated
only if all its k-itemset subsets are
frequent.
Example:
{I1, I2, I3} is a candidate because {I1,
98
Steps with Example
Min supp = 2
99
100
There are 10 intersections performed in total, which lead to eight nonempty 2
itemsets, as shown in Table. Notice that because the itemsets {I1, I4} and {I3, I5}
each contain only one transaction, they do not belong to the set of frequent 2-
itemsets.
101
Based on the Apriori property, a given 3-itemset is a candidate 3-
itemset only if every one of its 2-itemset subsets is frequent.
The candidate generation process here will generate only two 3-itemsets: {I1, I2, I3} and
{I1, I2, I5}.
102
Optimization Techniques
Diffsets: Store only differences in transaction IDs
instead of full sets.
Example:
I1: {T100, T400, T500, T700, T800, T900}
103
Summary
104