0% found this document useful (0 votes)
6 views96 pages

VIPDMTheoryChapter 5

Chapter 5 of the document discusses mining frequent patterns, associations, and correlations, focusing on basic concepts, frequent itemset mining methods, and pattern evaluation methods. It emphasizes the importance of frequent pattern analysis in understanding customer purchasing behavior and enhancing product recommendations. The chapter also introduces various mining methods, including the Apriori algorithm and the concept of closed and maximal frequent itemsets to improve efficiency in data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views96 pages

VIPDMTheoryChapter 5

Chapter 5 of the document discusses mining frequent patterns, associations, and correlations, focusing on basic concepts, frequent itemset mining methods, and pattern evaluation methods. It emphasizes the importance of frequent pattern analysis in understanding customer purchasing behavior and enhancing product recommendations. The chapter also introduces various mining methods, including the Apriori algorithm and the concept of closed and maximal frequent itemsets to improve efficiency in data mining.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 96

Data Mining:

Concepts and
Techniques

1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern
Analysis?
 Imagine that you are a sales manager at AllElectronics, and you are
talking to a customer who recently bought a PC and a digital camera
from the store. What should you recommend to her next?

 Information about which products are frequently purchased by your


customers following their purchases of a PC and a digital camera in
sequence would be very helpful in making your recommendation

 Frequent patterns and association rules are the knowledge that you
want to mine in such a scenario
 Importance:

Identifying relationships between purchases

Enhancing product recommendations

3
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (itemset, subsequences, substructure,
etc.) that occurs frequently in a data set

 Itemset: Milk and bread bought together

 Subsequence: Buying a PC → Digital Camera → Memory Card

 Substructure: Subgraphs, subtrees, sublattices (A structured form of


frequent patterns, such as graphs, trees, or networks)

 Finding frequent patterns plays an essential role in mining


associations, correlations, and many other interesting relationships
among data
4
Market Basket Analysis: A
Motivating Example

5
Market Basket Analysis: A
Motivating Example
 Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers

 Which groups or sets of items are customers likely to purchase on a


given trip to the store?

 Market basket analysis may help you design different store layouts

 If customers who purchase computers also tend to buy antivirus


software at the same time
 In an alternative strategy, placing hardware and software at opposite
ends may attract customers who purchase such items to pick up other
items along the way
6
Market Basket Analysis: A
Motivating Example
 Market basket analysis can also help retailers plan which items to put
on sale at reduced prices

 If customers tend to purchase computers and printers together, then


having a sale on printers may encourage the sale of printers as well
as computers

 Each basket can be represented by a Boolean vector of values


assigned to the variables. The Boolean vectors can be analyzed for
buying patterns that reflect items that are frequently associated or
purchased together

 These patterns can be represented in the form of association rules


7
Market Basket Analysis: A
Motivating Example
 For example, the information that customers who purchase computers
also tend to buy antivirus software at the same time is represented in
the following association rule:

 Rule support and confidence are two measures of rule interestingness

 A support of 2% for Rule means that 2% of all the transactions under


analysis show that computer and antivirus software are purchased
together

 A confidence of 60% means that 60% of the customers who


purchased a computer also bought the software
8
Basic Concepts: Frequent
Patterns

Tid Items bought  itemset: A set of one or more


10 Bread, Nuts, Diaper items
20 Bread, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Bread, Diaper, Eggs
 (absolute) support, or,
40 Nuts, Eggs, Milk
support count of X: Frequency
50 Nuts, Coffee, Diaper, Eggs,
Milk
or occurrence of an itemset X
Customer
 (relative) support, s, is the
Customer
buys both buys diaper
fraction of transactions that
contains X (i.e., the
probability that a transaction
contains X)
 An itemset X is frequent if X’s
Customer support is no less than a
buys bread minsup threshold
10
Basic Concepts: Frequent
Patterns

Support(Bread → Diaper) = 3/5 = 60%

Confidence(Bread → Diaper) = 3/3 =


100%

11
Basic Concepts: Frequent
Patterns

transaction ID items

1 {A,C,D}

2 {B,C,E}

3 {A,B,C,E}

4 {B,E}

5 {A,B,C,E}
Basic Concepts: Frequent
Patterns
Let Min Support = Itemset Support F/I
2/5 ABC 2/5 F
Itemset Support F/I
ABD 0/5 I
A 3/5 F
ABE 2/5 F
B 4/5 F
ACD 1/5 I
C 4/5 F
ACE 2/5 F
D 1/5 I
ADE 0/5 I
E 4/5 F
BCD 0/5 I
AB 2/5 F
BCE 3/5 F
AC 3/5 F
BDE 0/5 I
AD 1/5 I
CDE 0/5 I
AE 2/5 F
ABCD 0/5 I
BC 3/5 F
ABCE 2/5 F
BD 0/5 I
ABDE 0/5 I
BE 4/5 F
ACDE 0/5 I
CD 1/5 I
BCDE 0/5 I
CE 3/5 F
ABCDE 0/5 I
DE 0/5 I
Basic Concepts: Frequent
Patterns

Min Support = 2/5

Total itemsets =31

Frequent itemsets = 15

Infrequent itemsets =
16
Basic Concepts: Association Rules
Ti Items bought
d  Find all the rules X  Y with
10 Bread, Nuts, Diaper minimum support and
20 Bread, Coffee, Diaper
confidence
30 Bread, Diaper, Eggs
40 Nuts, Eggs, Milk
 support, s, probability that
50 Nuts, Coffee, Diaper, Eggs, a transaction contains X 
Milk Y
 confidence, c, conditional
probability that a
transaction having X also
contains Y
 Association
Let rules:minconf
minsup = 50%, (many=more!)
50%
Freq. Bread  Diaper (60%, 100%)

Pat.: Bread:3, Nuts:3, Diaper:4,

Diaper{Bread,
Eggs:3,  Bread (60%, 75%)
Diaper}:3
15
Association rule mining Process
 In general, association rule mining can be viewed
as a two-step process:
1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a
predetermined minimum support count (min sup).

2. Generate strong association rules from the frequent


itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.

16
The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be

frequent

 If Milk Bread Butter is a frequent itemset, then


the following itemsets are frequent;

• Milk
• Bread
• Butter
• Milk, Bread
• Milk, Butter
• Bread, Butter
17
A major challenge in mining
frequent itemsets from a large
data set

18
 The total number of frequent itemsets that it
contains is thus

 This is too huge a number of itemsets for any computer to


compute or store.
 To over come this difficulty, we introduce the concepts of
closed frequent itemset and maximal frequent
itemset.

19
Closed Patterns and Max-
Patterns
 A long pattern contains a combinatorial number of
sub-patterns.

 Solution: Mine closed patterns and max-patterns


instead

 An itemset X is closed if X is frequent and there


exists no super-pattern Y ‫ כ‬X, with the same
support as X

 An itemset X is a max-pattern if X is frequent and


there exists no frequent super-pattern Y ‫ כ‬X
20
Closed and Maximal
 Frequent Itemset: A frequent itemset is an itemset
whose support is greater than some user-specified
minimum support

 Closed Frequent Itemset: An itemset is closed if


none of its immediate supersets has the same
support as that of the itemset.

 Maximal Frequent Itemset: An itemset is maximal


frequent if none of its immediate supersets is
frequent.

21
22
23
A,C,D B,C,D

24
25
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

26
EXAMPLE
 Let’s take an example. Suppose we have a database
with four customer transactions, denoted as T1, T2,
T3 and T4:

 T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}

 where the letters a, b, c, d indicate the purchase of


items apple, bread, cake and dattes.

 If we set the minimum support threshold to 50% (which


means that we want to find itemsets appearing in at least two
transactions)
27
 frequent itemsets:
 {a},

 {b},

 {c},

 {d},

 {a,b},

 {a,c},

 {a,d},

 {b,c},

 {b,d},

 {a,b,c},

 {a,b,d}
28
 frequent closed itemsets:
 {a,b},

 {a,b,c},

 {a,b,d},

 {a,b,c,d}

29
FIND maximal frequent itemset

Transaction ID Items
1 ABCD
2 ABD
3 ACD
4 BCD

Assume that the minimum support threshold is


50%, meaning that an itemset must occur in at
least 2 transactions to be frequent (since this
datasets has four transactions).

30
Support count
Itemset (how many times
an itemset appears
in the database)
A 3
B 3
C 3
D 4
AB 2
AC 2
AD 3
BC 2
BD 3
CD 3
ABC 1
ABD 2
ACD 2
BCD 2
ABCD 1
31
the frequent itemsets are
A, B, C, D, AB, AC, AD, BC, BD, CD, ABD,

ACD, and BCD.

 three are identified as maximal frequent:


ABD, ACD and BCD.

This is because none of their immediate


supersets (ABCD) are frequent.

32
Why Closed and Maximal Frequent
itemset
 Closed itemsets are important because they can reduce the
number of frequent itemsets presented to the user, without
losing any information.
 Frequent itemsets can be very large and redundant, especially
when the minsup is low.
 By mining closed itemsets, generally only a very small set of
itemsets is obtained, and still all the other frequent itemsets can
be directly derived from the closed itemsets.
 Maximal frequent itemsets provide a compact representation
of all the frequent itemsets for a given dataset and minimum
support threshold.
 Also discovering maximal frequent itemsets can be faster and
require less memory and storage space than finding all the
frequent itemsets.
 we can derive all the other frequent itemsets from maximal
frequent itemset but we many not know their exact support
count.
33
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

36
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical

Data Format
37
Apriori Algorithm
 Apriori employs an iterative approach known as a
level-wise search, where k-itemsets are used to
explore (k +1)-itemsets.
 To improve the efficiency of the level-wise
generation of frequent itemsets, an important
property called the Apriori property is used to
reduce the search space.

38
 Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.
 This property belongs to a special category of
properties called antimonotonicity in the sense
that if a set cannot pass a test, all of its
supersets will fail the same test as well.

39
How is the Apriori property used in
the algorithm?
 A two-step process is followed, consisting of join
and prune actions.

 To find Lk, a set of candidate k-itemsets is


generated by joining Lk−1 with itself.

 For efficient implementation, Apriori assumes that


items within a transaction or itemset are sorted in
lexicographic order. For the (k−1)-itemset, li, this
means that the items are sorted such that
li[1]<li[2]<···<li[k−1].
40
 Join Process
 To find Lk, a set of candidate k-itemsets is generated by joining
Lk−1 with itself.
 The join, Lk−1⋈ Lk−1, is performed, where members of Lk−1
are joinable if their first (k−2) items are in common. That is,
members l1 and l2
 of Lk−1 are joined if
(l1[1]=l2[1])∧(l1[2]=l2[2])∧···∧(l1[k−2]=l2[k−2]) ∧(l1[k−1]
<l2[k−1]).
 The condition l1[k−1] <l2[k−1] simply ensures that no
duplicates are generated. The resulting itemset formed by
joining l1 and l2 is {l1[1], l1[2],..., l1[k−2], l1[k−1], l2[k−1]}.

41
Apriori: A Candidate Generation & Test
Approach

 Apriori pruning principle: If there is any itemset


which is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated 42
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
43
44
45
 In the first iteration of the algorithm, each item is a
member of the set of candidate1-itemsets, C1.
 The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets
satisfying minimum support.
 To discover the set of frequent 2-itemsets, L2, the
algorithm uses the join L1 ⋈ L1 to generate a candidate
set of 2-itemsets, C2.
 No candidates are removed from C2 during the prune
step because each subset of the
 candidates is also frequent.

46
 The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate 2-itemsets in C2 having
minimum support.
 The generation of the set of the candidate 3-itemsets, C3.
 From the join step, we first get C3 =L2 ⋈ L2 ={{I1, I2, I3},
{I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be
frequent.
 We therefore remove them from C3, thereby saving the
effort of unnecessarily obtaining their counts during the
subsequent scan of D to determine L3.
47
48
 The transactions in D are scanned to determine L3,
consisting of those candidate 3-itemsets in C3 having
minimum support
 The algorithm uses L3 ⋈ L3 to generate a candidate set
of 4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, itemset {I1, I2, I3, I5} is pruned because its subset
{I2, I3, I5} is not frequent. Thus, C4 =φ, and the algorithm
terminates, having found all of the frequent itemsets.

49
 Generating Association Rules from Frequent
Itemsets
 Once the frequent itemsets from transactions in database
D have been found, it is straightforward to generate
strong association rules from them (where strong
association rules satisfy both minimum support and
minimum confidence)

50
 Based on previous equation, association rules
can be generated as follows:

51
If the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules are output, because these are the only ones generated that are strong.

52
The Apriori Algorithm (Pseudo-
Code)

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 53
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk

Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
54
How to Count Supports of Candidates?

 Why counting supports of candidates a problem?


 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
and counts
 Interior node contains a hash table
 Subset function: finds all the candidates
contained in a transaction

55
Counting Supports of Candidates Using Hash
Tree

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

56
Candidate Generation: An SQL
Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions
for efficient implementation [See: S. Sarawagi, S. Thomas, and R.
Agrawal. Integrating association rule mining with relational database
systems: Alternatives and implications. SIGMOD’98]
57
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

Format

 Mining Close Frequent Patterns and Maxpatterns


58
Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
59
Improving the Efficiency of Apriori

Apriori faces challenges due to


 Large candidate itemsets
 Multiple database scans
 High computational complexity

How can we further improve the efficiency of Apriori-


based mining?
Hash-based technique

Transaction reduction

Partitioning

Sampling

Dynamic itemset counting



60
Hash-based technique

What is the Hash-Based Technique?

A method to reduce the number of candidate


itemsets by using a hash table

Helps in pruning infrequent itemsets early in the


process

Key Idea:

Use a hash function to map itemsets into buckets


If a bucket’s count is below the support threshold, all


61
Hash-based technique

62
Hash-based technique

TID List of items


T100 I1, I2, I5 Min. Support Count=3
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, 13
Order of Items:
T600 I2, I3
l1=1, 12=2, 13=3, 14=4, 15=
T700 I1, I3
T800 I1, 12, I3, I5
T900 I1, I2, I3

63
Hash-based technique

C
1
Itemset Count
I1 6
I2 7
I3 6
I4 2
I5 2

H(x, y)=((Order of First)* 10+(Order of


Second))mod 7
64
Hash-based technique
Itemset Count Hash Function
(1 * 10 + 2) mod 7 =
I1, I2 4
5
(1 * 10 + 3) mod 7 =
I1, I3 4
6
(1 * 10 + 4) mod 7 =
I1, I4 1
0
(1 * 10 + 5) mod 7 =
I1, I5 2
1
(2 * 10 + 3) mod 7 =
I2, I3 4
2
(2 * 10 + 4) mod 7 =
I2, I4 2
3
(2 * 10 + 5) mod 7 =
I2, I5 2
4
I3, I4 0 --
(3 * 10 + 5) mod 7 =
I3, I5 1 65
Hash-based technique

Hash table

Bucket
0 1 2 3 4 5 6
Address

Bucket
2 2 4 2 2 4 4
Count
Bucket
Content {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
s
{I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}

66
Transaction reduction

 A transaction that does not contain any frequent k-


itemsets cannot contain any frequent (k + 1)-
itemsets

Key Idea:

 If a transaction does not contribute to the


generation of frequent itemsets, it can be safely
ignored

67
Transaction reduction

Transactions Items
T1 I1, I2, I5
T2 I2, I3, I4 Min. Support Count=2
T3 I3, I4
T4 I1, I2, I3, I4

I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0
68
Transaction reduction

I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0

I1 I2 I3 I4
T1 1 1 0 0
T2 0 1 1 1
T3 0 0 1 1
T4 1 1 1 1
69
Transaction reduction

I1, I1, I1, I2, I2, I3,


I2 I3 I4 I3 I4 I4
T1 1 0 0 0 0 0
T2 0 0 0 1 1 1
T3 0 0 0 0 0 1
T4 1 1 1 1 1 1

I1, I2, I2, I3,


I2 I3 I4 I4
T2 0 1 1 1
T4 1 1 1 1
......

70
Partitioning

 A partitioning technique can be used that requires


just two database scans to mine the frequent
itemsets
 It consists of two phases, Phase I and Phase II
 Partitioning is a technique that divides the dataset
into smaller partitions
 Each partition is processed independently to find
local frequent itemsets
 These local frequent itemsets are then combined to
71
Partitioning

72
Partitioning

73
Sampling

 The basic idea of the sampling approach is to pick a


random sample S of the given data D, and then
search for frequent itemsets in S instead of D

 We trade off some degree of accuracy against


efficiency

 Because we are searching for frequent itemsets in S


rather than in D, it is possible that we will miss
74
DIC -Dynamic Itemset
Counting

 Alternative to Apriori Itemset Generation

 Itemsets are dynamically added and


deleted as transactions are read

 Relies on the fact that for an itemset to be


frequent, all of its subsets must also be
frequent, so we only examine those
itemsets whose subsets are all frequent
Itemset lattices
 Itemset lattices: An itemset lattice contains all
of the possible itemsets for a transaction
database. Each itemset in the lattice points to all
of its supersets. When represented graphically, a
itemset lattice can help us to understand the
concepts behind the DIC algorithm.

 Example: minsupp = 25% and M = 2.


DIC Algorithm:
 Mark the empty itemset with a solid square. Mark all
the 1-itemsets with dashed circles and Assign
counter. Leave all other itemsets unmarked.
 While any dashed itemsets remain:

Read M transactions (if we reach the end of the
transaction file, continue from the beginning). For
each transaction, increment the respective
counters for the itemsets that appear in the
transaction and are marked with dashes.

If a dashed circle's count exceeds minsupp, turn
it into a dashed square. If any immediate
superset of it has all of its subsets as solid or
dashed squares, add a new counter for it and
make it a dashed circle.

Once a dashed itemset has been counted
through all the transactions, make it solid and
stop counting it.
DIC - transactions
 Itemset lattice for the
above transaction
database:
 Itemset lattice before
any transactions are
read:

 Counters: A = 0, B = 0, C = 0
Empty itemset is marked with a solid
box. All 1-itemsets are marked with
dashed circles.
DIC - transactions
 After M transactions are read:
 After 2M transactions are read:

 Counters: A = 2, B = 1, C = 0,
AB = 0
 Counters: A = 2, B = 2, C = 1, AB =
0, AC = 0, BC = 0
We change A and B to dashed 
boxes because their counters
are greater than minsup (1) and C changes to a square because its
counter is greater than minsup.
add a counter for AB because
both of its subsets are boxes.
 A, B and C have been counted all
the way through so we stop
counting them and make their
boxes solid.
 Add counters for AC and BC
because their subsets are all boxes.
DIC - transactions

 After 3M transactions read:  After 4M transactions read:

 Counters: A = 2, B = 2, C = 1, AB = 1, AC
= 0, BC = 1

 Counters: A = 2, B = 2, C = 1, AB = 1, AC =
0, BC = 0 AC and BC are counted all the way
AB has been counted all the way through through. We do not count ABC because
and its counter satisfies minsup so we one of its subsets is a circle. There are no
change it to a solid box. BC changes to a dashed itemsets left so the algorithm is
dashed box. done.
Frequent Pattern Growth
 Apriori candidate generate-and-test method significantly reduces the size

of candidate sets, leading to good performance gain. However, it can

suffer from two nontrivial costs:

 It may still need to generate a huge number of candidate sets. For

example, if there are 104 frequent 1-itemsets, the Apriori algorithm

will need to generate more than 10 7 candidate 2-itemsets.

 It may need to repeatedly scan the whole database and check a large

set of candidates by pattern matching. It is costly to go over each

transaction in the database to determine the support of the candidate

itemsets.

“Can we design a method that mines the complete set of frequent

itemsets without such a costly candidate generation process?”


Frequent Pattern Growth
 The FP-Growth Algorithm (Frequent Pattern Growth) is
a powerful alternative to Apriori for mining frequent
itemsets. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a
frequent pattern tree or FP tree.

 This tree structure will maintain the association


between the itemsets. The database is fragmented
using one frequent item. This fragmented part is
called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this
method, the search for frequent itemsets is reduced
comparatively.
FP-growth (finding frequent
itemsets without candidate
generation).
Min support count = 2
Tid Items

T100 I1,I2,I5
• Scan the DB same as
T200 I2,I4 Apriori
• Derives the set of frequent
T300 I2,I3
items(1-itemsets) and
T400 I1,I2,I4 support item
countfre item freq
s q s
T500 I1,I3
I1 6 I2 7
T600 I2,I3 I2 7 I1 6
I3 6 I3 6
T700 I1,I3
I4 2 I4 2
T800 I1,I2,I3,I5 I5 2 I5 2

The set of frequent


itemset is Sorted in the
T900 I1,I2,I3
order of descending
support count
itemsets without candidate
generation).

Tid Items
T10 I2,I1,I5
0 Null
item fre
T20 I2,I4 s q
0
I2 7 I2 I1
T30 I2,I3
0 I1 6
T40 I2,I1,I4 I3 6
0 I4 2
I1 I3 I4 I3
T50 I1,I3 I5 2
0
T60 I2,I3
0 I5 I3 I4
T70 I1,I3
0
T80 I2,I1,I3,I5 I5
0
T90 I2,I1,I3
0
FP-Growth Method: Construction of
FP-Tree
Mining Frequent Patterns (bottom-up
approach)
Why Frequent Pattern Growth
Fast ?
 Performance study shows :
 FP-growth is an order of magnitude faster
than Apriori, and is also faster than tree-
projection
 Reasoning :
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
 Basic operation is counting and FP-tree
building
Challenges in FP Growth
 Complexity in tree construction for
large datasets.
 FP-Tree is not dynamic (must rebuild if
data changes).
 Deep recursive calls (can lead to
memory issues).

94
Mining Frequent Itemsets Using
the Vertical Data Format

95
Horizontal vs. Vertical Data Format
 Horizontal Format: Transactions are stored as rows, each
containing a set of items.
 Vertical Format: Each item is stored with its corresponding
transaction IDs (TIDs).

 Example:
 Horizontal Format:

T1: {A, B, C}
T2: {A, C, D}

 Vertical Format:
A: {T1, T2}
B: {T1}
C: {T1, T2}
D: {T2}
96
Advantages of Vertical Data
Format

 Reduces the need for multiple database scans.


 Frequent itemset mining is performed using TID
Set Intersections.
 Faster support count computation.
 Easier application of the Apriori Property.
 More efficient for dense datasets.

97
Apriori Property in Vertical Format
 A candidate k+1 itemset is generated
only if all its k-itemset subsets are
frequent.

 Example:
 {I1, I2, I3} is a candidate because {I1,

I2}, {I1, I3}, and {I2, I3} are frequent.

98
Steps with Example

Min supp = 2

99
100
There are 10 intersections performed in total, which lead to eight nonempty 2
itemsets, as shown in Table. Notice that because the itemsets {I1, I4} and {I3, I5}
each contain only one transaction, they do not belong to the set of frequent 2-
itemsets.
101
Based on the Apriori property, a given 3-itemset is a candidate 3-
itemset only if every one of its 2-itemset subsets is frequent.
The candidate generation process here will generate only two 3-itemsets: {I1, I2, I3} and
{I1, I2, I5}.

102
Optimization Techniques
 Diffsets: Store only differences in transaction IDs
instead of full sets.

Example:
 I1: {T100, T400, T500, T700, T800, T900}

 I1 ∩ I2: {T100, T400, T800, T900}

 Diffset: {T500, T700}

 Reduces memory usage and computational cost.

103
Summary

 Vertical format mining avoids full database scans


by using TID Set intersections.
 Frequent itemsets are mined efficiently using
the Apriori Property.
 Diffset technique further reduces the storage
and computation overhead.
 Ideal for dense datasets with many frequent
patterns.

104

You might also like