0% found this document useful (0 votes)

457 views111 pages

Module 5 - Frequent Pattern Mining

This document provides an overview of frequent pattern mining and the Apriori algorithm. It defines key concepts like frequent patterns, support, confidence and association rules. It explains that frequent pattern mining involves first finding all frequent itemsets whose support is above a minimum threshold, and then generating strong association rules from those frequent itemsets that meet minimum support and confidence thresholds. The document outlines the Apriori algorithm's candidate generation-and-test approach to efficiently find frequent itemsets in multiple passes over the transaction data by pruning subsets that cannot be frequent.

Uploaded by

aka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

457 views111 pages

Module 5 - Frequent Pattern Mining

Uploaded by

aka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

— Module 5 —

Frequent Pattern Mining

What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalogue design, Add on sales, store
layout,Web log (click stream) analysis, and DNA sequence analysis.
Basic Concepts: Frequent Patterns and Association
Rules

• I = {i1, i2, …, im}: a set of items.

• Transaction t :
– t a set of items, and t ⊆ I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
– An item: an item/article in a basket
– I: the set of all items sold in the store
– A transaction: items purchased in a basket; it may have TID
(transaction ID)
– A transactional dataset: A set of transactions
Basic Concepts: Frequent Patterns and Association
Rules

• A transaction t contains X, a set of items

(itemset) in I, if X ⊆ t.
• An association rule is an implication of the form:
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• An itemset is a set of items.
– E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
– E.g., {milk, bread, cereal} is a 3-itemset.
Support and Confidence
• Support count: The support count of an
itemset X, denoted by X.count or
Support_count(X), in a data set T is the
number of transactions in T that contain X.
Assume D has n transactions.
• Then,
Support & Confidence
• Support- Usefulness
• Confidence - Certainty of discovered rules

• In general, association rule mining is a two-step process:

1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.
Association Rules Ex (cont’d)
Itemset
A collection of one or more items
Example: {Milk, Bread, Biscuit}
k-itemset
An itemset that contains k items
Support count
Frequency of occurrence of an
itemset
E.g. ({Milk, Bread,Biscuit}) = 2
Support
Fraction of transactions that contain
an itemset
E.g. s({Milk, Bread, Biscuit}) = 2/5
Frequent Itemset
An itemset whose support is greater than or equal to a minsup
threshold
DATA MINING VESIT M.VIJAYALAKSHMI 7
Association Rule Mining Task
• Given a set of transactions T, the goal of
association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!
Association Rules
• An association rule is an implication of the form:
X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• {Milk, Biscuit} → {FruitJuice}
Example
:

• {Milk,Biscuit} → {FruitJuice} (s=0.4, c=0.67)

Association Rules
Example of Rules:
{Milk,Biscuit} → {FruitJuice} (s=0.4, c=0.67)
{Milk,FruitJuice} → {Biscuit} (s=0.4, c=1.0)
{Biscuit,FruitJuice} → {Milk} (s=0.4, c=0.67)
{FruitJuice} → {Milk,Biscuit} (s=0.4, c=0.67)
{Biscuit} → {Milk,FruitJuice} (s=0.4, c=0.5)
{Milk} → {Biscuit,FruitJuice} (s=0.4, c=0.5)
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still

computationally expensive
Apriori
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent

• Apriori principle holds due to the following

property of the support measure:

– Support of an itemset never exceeds the support of its

subsets
– This is known as the anti-monotone property of support
Apriori: A Candidate Generation-and-Test Approach

• Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
The Idea of the Apriori Algorithm

• start with all 1-itemsets

• go through data and count their support and find all “large”
1-itemsets
• combine them to form “candidate” 2-itemsets
• go through data and count their support and find all
“large” 2-itemsets
• combine them to form “candidate” 3-itemsets

Apriori property:
All subsets of a frequent itemset must be frequent
If a itemset is infrequent all its supersets will be infrequent.
Exercise 1
TID List of Item IDs
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1, I2, I4
T500 I1, I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1, I2, I3

• Min sup count =2,

• Min Conf =50%
Ans
• Step-1: K=1
Create a table containing support count of each item present in
dataset – Called C1(candidate set)
•
C1
• Compare candidate set item’s support count with minimum
support count(here min_support=2 if support_count of
candidate set items is less than min_support then remove those
items) this gives us itemset L1.
•

L1
• Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step).
• Check all subsets of a itemset are frequent or not and if not
frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in
dataset.

C2
• compare candidate (C2) support count with minimum support
count (here min_support=2 if support_count of candidate set
item is less than min_support then remove those items) this
gives us itemset L2.

L2
• Step-3:
–Generate candidate set C3 using L2 (join step)
–So itemset generated by joining L2 is {I1, I2, I3}{I1, I2,
I5}{I1, I3, I5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}
–Check all subsets of these itemsets are frequent or not and
if not remove that itemset.(Here subset of {I1, I2, I3} are
{I1, I2}{I2, I3}{I1, I3} which are frequent. For {I2, I3, I4}
subset {I3, I4} is not frequent so remove this. Similarly
check for every itemset)
–find support count of these remaining itemset by searching
in dataset.
• Compare candidate (C3) support count with minimum support
count(here min_support=2 if support_count of candidate set
item is less than min_support then remove those items) this
gives us itemset L3.

L3
• Step-4:
–Generate candidate set C4 using L3 (join step).
–Check all subsets of these itemsets are frequent or
not(Here itemset formed by joining L3 is {I1, I2, I3, I5} so
its subset contain {I1, I3, I5} which is not frequent). so no
itemset in C4
–We stop here because no frequent itemset are found
frequent further
Thus we discovered all frequent item-sets now generation of
strong association rule comes into picture. For that we need to
calculate confidence of each rule.
Confidence –
A confidence of 50% means that 50% of the customers who
purchased a milk and bread also bought the butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here By taking example of any frequent itemset we will show
rule generation.
Itemset {I1, I2, I3} //from L3 SO rules can be
Itemset {I1, I2, I3} //from L3 SO rules can be
[I1Î2]=>[I3]
confidence = sup(I1Î2Î3)/sup(I1Î2) = 2/4*100=50%
[I1Î3]=>[I2]
confidence = sup(I1Î2Î3)/sup(I1Î3) = 2/4*100=50%
[I2Î3]=>[I1]
confidence = sup(I1Î2Î3)/sup(I2Î3) = 2/4*100=50%
[I1]=>[I2Î3]
confidence = sup(I1Î2Î3)/sup(I1) = 2/6*100=33%
[I2]=>[I1Î3]
confidence = sup(I1Î2Î3)/sup(I2) = 2/7*100=28%
[I3]=>[I1Î2]
confidence = sup(I1Î2Î3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50 % first 3 rules can be considered
strong association rules.
• C3 ={{I1, I2, I3}, {I1, I2, I5}} after pruning.
• Suppose the data contain the frequent itemset L = {I1, I2, I5}.
• What are the association rules that can be generated from L
• The nonempty subsets of L are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, and {I5}.
• The resulting association rules are as shown below, each listed
with its confidence:
• 50%, then only ,first, second,
third, and last rules
are output, because these are
the only ones generated
that are strong.
The Apriori Algorithm—An Example

Supmin = 2 min_confidence is 60%

Database TDB

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

Itemset Itemset sup

C3 3rd scan L3
{B, C, E} {B, C, E} 2
Association rule generation
• Generating association rules from frequent
itemset
– For each frequent itemset I, generate all nonempty
subsets of I
– For every nonempty subset s of I, output the rule
s=> (l-s) if support(l)/support(s)>= min confidence
B^ C=>E conf =2/2 =100%
C^ E=>B conf =2/2 =100%
BÊ=>C conf =2/3 =66.67%
B=>C Ê conf =2/3 =66.67%
C=>B Ê conf =2/3 =66.67%
E=>B ^C conf =2/3 =66.67%
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Important Details of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Exercise 2
• A database has have 5 transactions. Let min
sup = 60% and min conf = 80%.
(a) Find all frequent itemsets using Apriori
(b) List all of the strong association rules (with
support s and confidence c) matching the
following meta rule, where X is a variable
representing customers, and item i denotes
variables representing items (e.g., A, B, etc.):
Exercise 2
• A database has have 5 transactions. Let min sup =
60% and min conf = 80%.
(a) Find all frequent itemsets using Apriori and FP-
growth, respectively. Compare the efficiency of
the two mining processes.
(b) List all of the strong association rules (with
support s and confidence c) matching the
following meta rule, where X is a variable
representing customers, and item i denotes
variables representing items (e.g., A, B, etc.):
Exercise 2
Ans 2
• Min support count=4
• Min Conf= 70%
Bottleneck of Frequent-pattern Mining

• Multiple database scans are costly

• Mining long patterns needs many passes of
scanning and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 =
1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?
Improving the efficiency of Apriori
1. Hashing based technique

The direct hashing and pruning (DHP) algorithm

attempts to generate large itemsets efficiently and
reduces the transaction database size.

When generating L1, the algorithm also generates

all the 2-itemsets for each transaction, hashes
them to a hash table and keeps a count.
1. Hashing based technique
• The items in the transactions are hashed
based on the hash function as
• h(k) = (order of item K) mod n.
• The item set in the second level are hashed
based on the hash function,
• h(k)=((order of X )*10+order of Y)mod n.
• The itemsets in the next level are hashed
based on the hash function,
• h(k) = (((order of X)*100 + (order of Y)*10 +
order of Z) mod n).
Hashing based technique

• Example
• minimum support count=3
1. Hashing based technique
1. Hashing based technique
• Example
1. Hashing based technique
• Example

The final frequent itemsets are {I1, I2},{I1, I3}

{I2, I3}, {I2,I4}
1. Hashing based technique

The major aim of the algorithm is to reduce the size

of C2. It is therefore essential that the hash table is
large enough so that collisions are low.
2. Transaction Reduction.
• (reducing the number of transactions scanned in future
iterations):
• A transaction that does not contain any frequent k-
itemsets cannot contain any frequent (k+1)-itemsets.
Therefore, such a transaction can be marked or removed
from further consideration because subsequent scans of
the database for j-itemsets, where j > k, will not require
it.
3. Partitioning
The set of transactions may be divided into a number of
disjoint subsets. Then each partition is searched for frequent
itemsets. These frequent itemsets are called local frequent
itemsets.
3. Partitioning
Phase 1
– Divide n transactions into m partitions
– Find the frequent itemsets in each partition
– Combine all local frequent itemsets to form candidate
itemsets

Phase 2
Find global frequent itemsets
3. Partitioning
4. Sampling

• (mining on a subset of the given data):

• The basic idea of the sampling approach is to pick
a random sample S of the given data D, and then
search for frequent itemsets in S instead of D.
• The sample size of S is such that the search for
frequent itemsets in S can be done in main memory
• we are searching for frequent itemsets in S rather
than in D, it is possible that we will miss some of
the global frequent itemsets.
4. Sampling

• Not guaranteed to be accurate but we sacrifice

accuracy for efficiency.
• A lower support threshold may be used for the
sample to ensure not missing any frequent datasets.

• The actual frequencies of the sample frequent

itemsets are then obtained.

• More than one sample could be used to improve

accuracy.
FP Growth
• FP growth improves Apriori to a big extent
• Frequent Item set Mining is possible without
candidate generation
• Only “two scan” to the database is needed

BUT HOW?
FP Growth
• Simply a two step procedure
– Step 1: Build a compact data structure called the
FP-tree
• Build using 2 passes over the data-set.
– Step 2: Extracts frequent item sets directly from
the FP-tree
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L in
frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to
the order in L; Scan DB the second time, construct FP-tree
by putting each frequency ordered transaction onto it.
FP Growth
• Now Lets Consider the following transaction table minimum
support count=2,minimum confidence=70%
I1 6
I2 7 Scan the DB for the second time, order
I3 6 frequent items in each transaction
I4 2
I5 2

First Scan of
Database

I2 7
I1 6
I3 6
I4 2
I5 2
FP Growth
• Now we will build a FP tree of that database
• Item sets are considered in order of their
descending value of support count.
For Transaction:
I2,I1,I5 null

I2:
1

I1:
1

I5:
1
For Transaction:
I2,I4 null

I2:
2

I1:
1 I4:
1

I5:
1
For Transaction:
I2,I3 null

I2:
3

I1:
1 I3: I4:
1 1

I5:
1
For Transaction:
I2,I1,I4 null

I2:
4

I1:
2 I3: I4:
1 1

I4:
I5:
1
1
For Transaction:
I1,I3 null

I2:
4 I1:
1

I1:
2 I3: I4:
1 1 I3:
1
I5:
1 I4:
1
For Transaction:
I2,I3 null

I2:
5 I1:
1

I1:
2 I3: I4:
2 1 I3:
1
I5:
1 I4:
1
For Transaction:
I1,I3 null

I2:
5 I1:
2

I1:
2 I3: I4:
2 1 I3:
2
I5:
1 I4:
1
For Transaction:
I2,I1,I3,I5 null

I2:
6 I1:
2

I1:
3 I3: I4:
2 1 I3:
2
I5:
1 I4:
I3:
1
1

I5:
1
For Transaction:
I2,I1,I3 null

I2:
7 I1:
2