Module 5 - Frequent Pattern Mining
Module 5 - Frequent Pattern Mining
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!
Association Rules
• An association rule is an implication of the form:
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• {Milk, Biscuit} → {FruitJuice}
Example
:
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Apriori property:
All subsets of a frequent itemset must be frequent
If a itemset is infrequent all its supersets will be infrequent.
Exercise 1
TID List of Item IDs
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1, I2, I4
T500 I1, I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1, I2, I3
L1
• Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step).
• Check all subsets of a itemset are frequent or not and if not
frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in
dataset.
C2
• compare candidate (C2) support count with minimum support
count (here min_support=2 if support_count of candidate set
item is less than min_support then remove those items) this
gives us itemset L2.
L2
• Step-3:
–Generate candidate set C3 using L2 (join step)
–So itemset generated by joining L2 is {I1, I2, I3}{I1, I2,
I5}{I1, I3, I5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
–Check all subsets of these itemsets are frequent or not and
if not remove that itemset.(Here subset of {I1, I2, I3} are
{I1, I2}{I2, I3}{I1, I3} which are frequent. For {I2, I3, I4}
subset {I3, I4} is not frequent so remove this. Similarly
check for every itemset)
–find support count of these remaining itemset by searching
in dataset.
• Compare candidate (C3) support count with minimum support
count(here min_support=2 if support_count of candidate set
item is less than min_support then remove those items) this
gives us itemset L3.
L3
• Step-4:
–Generate candidate set C4 using L3 (join step).
–Check all subsets of these itemsets are frequent or
not(Here itemset formed by joining L3 is {I1, I2, I3, I5} so
its subset contain {I1, I3, I5} which is not frequent). so no
itemset in C4
–We stop here because no frequent itemset are found
frequent further
Thus we discovered all frequent item-sets now generation of
strong association rule comes into picture. For that we need to
calculate confidence of each rule.
Confidence –
A confidence of 50% means that 50% of the customers who
purchased a milk and bread also bought the butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here By taking example of any frequent itemset we will show
rule generation.
Itemset {I1, I2, I3} //from L3 SO rules can be
Itemset {I1, I2, I3} //from L3 SO rules can be
[I1^I2]=>[I3]
confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2]
confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1]
confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3]
confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3]
confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2]
confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50 % first 3 rules can be considered
strong association rules.
• C3 ={{I1, I2, I3}, {I1, I2, I5}} after pruning.
• Suppose the data contain the frequent itemset L = {I1, I2, I5}.
• What are the association rules that can be generated from L
• The nonempty subsets of L are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, and {I5}.
• The resulting association rules are as shown below, each listed
with its confidence:
• 50%, then only ,first, second,
third, and last rules
are output, because these are
the only ones generated
that are strong.
The Apriori Algorithm—An Example
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Exercise 2
• A database has have 5 transactions. Let min
sup = 60% and min conf = 80%.
(a) Find all frequent itemsets using Apriori
(b) List all of the strong association rules (with
support s and confidence c) matching the
following meta rule, where X is a variable
representing customers, and item i denotes
variables representing items (e.g., A, B, etc.):
Exercise 2
• A database has have 5 transactions. Let min sup =
60% and min conf = 80%.
(a) Find all frequent itemsets using Apriori and FP-
growth, respectively. Compare the efficiency of
the two mining processes.
(b) List all of the strong association rules (with
support s and confidence c) matching the
following meta rule, where X is a variable
representing customers, and item i denotes
variables representing items (e.g., A, B, etc.):
Exercise 2
Ans 2
• Min support count=4
• Min Conf= 70%
Bottleneck of Frequent-pattern Mining
• Example
• minimum support count=3
1. Hashing based technique
1. Hashing based technique
• Example
1. Hashing based technique
• Example
Phase 2
Find global frequent itemsets
3. Partitioning
4. Sampling
BUT HOW?
FP Growth
• Simply a two step procedure
– Step 1: Build a compact data structure called the
FP-tree
• Build using 2 passes over the data-set.
– Step 2: Extracts frequent item sets directly from
the FP-tree
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L in
frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to
the order in L; Scan DB the second time, construct FP-tree
by putting each frequency ordered transaction onto it.
FP Growth
• Now Lets Consider the following transaction table minimum
support count=2,minimum confidence=70%
I1 6
I2 7 Scan the DB for the second time, order
I3 6 frequent items in each transaction
I4 2
I5 2
First Scan of
Database
I2 7
I1 6
I3 6
I4 2
I5 2
FP Growth
• Now we will build a FP tree of that database
• Item sets are considered in order of their
descending value of support count.
For Transaction:
I2,I1,I5 null
I2:
1
I1:
1
I5:
1
For Transaction:
I2,I4 null
I2:
2
I1:
1 I4:
1
I5:
1
For Transaction:
I2,I3 null
I2:
3
I1:
1 I3: I4:
1 1
I5:
1
For Transaction:
I2,I1,I4 null
I2:
4
I1:
2 I3: I4:
1 1
I4:
I5:
1
1
For Transaction:
I1,I3 null
I2:
4 I1:
1
I1:
2 I3: I4:
1 1 I3:
1
I5:
1 I4:
1
For Transaction:
I2,I3 null
I2:
5 I1:
1
I1:
2 I3: I4:
2 1 I3:
1
I5:
1 I4:
1
For Transaction:
I1,I3 null
I2:
5 I1:
2
I1:
2 I3: I4:
2 1 I3:
2
I5:
1 I4:
1
For Transaction:
I2,I1,I3,I5 null
I2:
6 I1:
2
I1:
3 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
1
I5:
1
For Transaction:
I2,I1,I3 null
I2:
7 I1:
2
I1:
4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
I5: Almost
1 Over!
To facilitate tree traversal, an item
null
header table is built so that each
item points to its
occurrences in the tree via a chain
of node-links.
I2:
I2 7
7 I1:
I1 6 2
I3 6
I4 2 I1:
I5 2 4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
FP Tree Construction Over!!
I5: Now we need to find conditional pattern
1 base
and Conditional FP Tree for each item
FP-Tree Definition
• FP-tree is a frequent pattern tree . Formally, FP-tree is a tree structure
defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
– item-name : register which item this node represents,
– count, the number of transactions represented by the portion of the path
reaching this node,
– node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
– item-name, and
– head of node-link that points to the first node in the FP-tree carrying the
item-name.
Construct Conditional Pattern Base
• Starting at the bottom of frequent-item
header table in the FP-tree
• Traverse the FP-tree by following the link of
each frequent item
• Accumulate all of transformed prefix paths of
that item to form a conditional pattern base
Properties of FP-Tree
• Node-link property
– For any frequent item ai, all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
• Prefix path property
– To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
Construct Conditional FP-tree
I2:
7 I1:
2
I1:
4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
I4 {{I2,I1:1},{I2:1}} null
I2:
7 I1:
2
I1:
4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
I2:
7 I1:
2
I1:
4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
I2:
7 I1:
2
I1:
4 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
2
Transaction ID Items
10 A, B, D
20 D, E, F
30 A, F
40 B, C, D
50 E, F
60 D, E, F
70 C, D, F
80 A, C, D, F
TID A B C D E F
10 1 1 0 1 0 0
20 0 0 0 1 1 1
30 1 0 0 0 0 1
40 0 1 1 1 0 0
50 0 0 0 0 1 1
60 0 0 0 1 1 1
70 0 0 1 1 0 1
80 1 0 1 1 0 1
Items Transactions
Vertical List Format:
A 10,30,80
For each item, a list of TID s
B 10,40
of records containing that
C 40,70,80
item is stored; is also
referred as the tidlist of that D 10,20,40,60,70,80
item. E 20,50,60
F 20,30,50,60,70,80
Vertical list Format:
Vertical Representation
Consider only eight transactions with transaction IDs {10,
20, 30, 40, 50, 60, 70, 80}. This set of eight transactions
with six items
Items 10 20 30 40 50 60 70 80
A 1 0 1 0 0 0 0 1
B 1 0 0 1 0 0 0 0
C 0 0 0 1 0 0 1 1
D 1 1 0 1 0 1 1 1
E 0 1 0 0 1 1 0 0
F 0 1 1 0 1 1 1 1
T3 {b2}
T3 {bread}
m1 m2 b1 b2
miniconf =
50%
Frequent itemset: { m2, b1,
b2}
A.A rule: m2 <=> b1
Multiple-Level Association Rules
Generalizing/specializing values of attributes:
from specialized to general: support of rules increases and new
rules may become valid.
from general to specialized: support of rules decreases and
rules may become not valid
On a too low level of abstraction :
high support = too few rules
low support = too many rules, most uninteresting
On a too high level of abstraction => uninteresting rules
foo
d
Approach: ascending minisup
milk bread
on each level
minisup m1 m2 b1 b2
Multiple-Level Association Rules
• A basic approach (top-down progressive deepening):
• Calculate frequent itemsets at each concept level, until no
more frequent itemsets can be found
• For each level use, e.g., Apriori for finding frequent
itemsets
• Example:
• First mine high-level frequent items - milk (15%), bread
(10%) – and rules based on them.
• Then mine their lower-level "weaker" frequent itemsets -
• 2% milk (5%),wheat bread (4%) - and the rules based on
them.
Multiple-Level Association Rules
• Variations of the basic approach:
• Uniform minimum support for all levels
– one minimum support threshold; simplified search
– if too high, we can miss low level associations.
– if too low, we can generate too many uninteresting
high level associations.
• Reduced minimum support at lower levels
different strategies possible: level-by-level, level
cross filtering by single item, level-cross filtering
by k-itemset, and controlled level-cross filtering
by single item
Algorithm: An Example
An entry of sales_transaction Table
Transaction_id Bar_code_set
351428 {17325,92108,55349,88157,…}
TID Items
T1 {111,121,211,221}
T2 {111,211,222,323}
T3 {112,122,221,411}
T4 {111,121}
T5 {111,122,211,221,413}
T6 {211,323,524}
T7 {323,411,524,713}
Algorithm: An Example
The frequent 1-itemset on level
1 Level-1 minsup = 4 T[2]
L[1,1]
TID Items
Itemset Support only keep items T1 {111,121,211,221}
in L[1,1] from
{1**} 5 T[1] T2 {111,211,222}
{2**} 5 T3 {112,122,221}
T4 {111,121}
L[1,2]
T5 {111,122,211,221}
Itemset Support T6 {211}
{1**,2**} 4 food
bread
Use Apriori on each level milk
chocolate white
2%
Dairyland Foremost
Algorithm: An Example
L[2,2]
Level-2 minsup = 3
Itemset Support
L[2,1]
{11*,12*} 4
Itemset Support
{11*,21*} 3
{11*} 5 {11*,22*} 4
{12*} 4 {12*,22*} 3
{21*} 4 {21*,22*} 3
{22*} 4
L[2,3] food
Itemset Support
bread
{11*,12*,22*} 3 milk
{11*,21*,22*} 3 2%
chocolate white
Dairyland Foremost
Frequent Item Sets at Level 3
Level-3 minsup = 3
L[3,1] L[3,2]
Itemset Support Itemset Support
{111,211} 3
{111} 4
{211} 4
{221} 3
food
E.g.
Level-1:
80% of customers that purchase milk also purchase bread.
bread
milk ➔ bread with Confidence= 80% milk
Level-2: 2%
chocolate white
75% of people who buy 2% milk also buy wheat bread.
2% milk ➔ wheat bread with Confidence= 75%
Dairyland Foremost
Redundancy of multilevel association
rules
• Some rules may be redundant due to "ancestor"
relationships between items.
• Example (milk has 4 subclasses):
• milk -> wheat bread [support = 8%, confidence =
70%]
• 2% milk -> wheat bread [support = 2%,
confidence = 72%]
• The first rule is an ancestor of the second rule.
• The second rule could be redundant.
Multidimensional Association rules
• Single-dimensional rules:
• Items in the rule refer to only one dimension or
predicate, e.g., to "buys".
– buys(x, “milk") ^ buys(x, "Chips") -> buys(x, "Bread")
[0.4%, 52%]
• Multidimensional rules:
• Items in the rule refer to two or more dimensions
or predicates, e.g.,
– "buys", "time_of_transaction", "customer_category".
Multidimensional Association rules
• Rules:
• “Nationality = French” -> “Income = high” [50%, 100%]
• “Income = high” -> “Nationality = French“ [50%, 75%]
• “Age = 50” -> “Nationality = Italian” [33%, 100%]
Multidimensional Association rules
• Single-dimensional rules (intradimensional rules):
– Transactional data is used.
– Items in a rule are assumed to belong to the same
transaction.
• Multidimensional rules:
– Relational data is used.
– Attribute A in a rule is assumed to have value a,
attribute B value b and attribute C value c in the
same tuple.
Multidimensional Association rules
• Two types of rules
• Interdimensional rules: same attributes/predicates
may not be repeated in a rule
– age(X, “20:::29”)^occupation(X, “student”))buys(X,
“laptop”).
• Hybrid-dimensional rules: same
attributes/predicates can be repeated in a rule
– age(X, “20:::29”)^buys(X, “laptop”))buys(X, “HP
printer”)