0% found this document useful (0 votes)

6 views96 pages

VIPDMTheoryChapter 5

Chapter 5 of the document discusses mining frequent patterns, associations, and correlations, focusing on basic concepts, frequent itemset mining methods, and pattern evaluation methods. It emphasizes the importance of frequent pattern analysis in understanding customer purchasing behavior and enhancing product recommendations. The chapter also introduces various mining methods, including the Apriori algorithm and the concept of closed and maximal frequent itemsets to improve efficiency in data mining.

Uploaded by

buddyaccomplishedhisgoals

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views96 pages

VIPDMTheoryChapter 5

Uploaded by

buddyaccomplishedhisgoals

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 96

Data Mining:

Concepts and
Techniques

1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern
Analysis?
 Imagine that you are a sales manager at AllElectronics, and you are
talking to a customer who recently bought a PC and a digital camera
from the store. What should you recommend to her next?

 Information about which products are frequently purchased by your

customers following their purchases of a PC and a digital camera in
sequence would be very helpful in making your recommendation

 Frequent patterns and association rules are the knowledge that you
want to mine in such a scenario
 Importance:

Identifying relationships between purchases

Enhancing product recommendations

3
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (itemset, subsequences, substructure,
etc.) that occurs frequently in a data set

 Itemset: Milk and bread bought together

 Subsequence: Buying a PC → Digital Camera → Memory Card

 Substructure: Subgraphs, subtrees, sublattices (A structured form of

frequent patterns, such as graphs, trees, or networks)

 Finding frequent patterns plays an essential role in mining

associations, correlations, and many other interesting relationships
among data
4
Market Basket Analysis: A
Motivating Example

5
Market Basket Analysis: A
Motivating Example
 Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers

 Which groups or sets of items are customers likely to purchase on a

given trip to the store?

 Market basket analysis may help you design different store layouts

 If customers who purchase computers also tend to buy antivirus

software at the same time
 In an alternative strategy, placing hardware and software at opposite
ends may attract customers who purchase such items to pick up other
items along the way
6
Market Basket Analysis: A
Motivating Example
 Market basket analysis can also help retailers plan which items to put
on sale at reduced prices

 If customers tend to purchase computers and printers together, then

having a sale on printers may encourage the sale of printers as well
as computers

 Each basket can be represented by a Boolean vector of values

assigned to the variables. The Boolean vectors can be analyzed for
buying patterns that reflect items that are frequently associated or
purchased together

 These patterns can be represented in the form of association rules

7
Market Basket Analysis: A
Motivating Example
 For example, the information that customers who purchase computers
also tend to buy antivirus software at the same time is represented in
the following association rule:

 Rule support and confidence are two measures of rule interestingness

 A support of 2% for Rule means that 2% of all the transactions under

analysis show that computer and antivirus software are purchased
together

 A confidence of 60% means that 60% of the customers who

purchased a computer also bought the software
8
Basic Concepts: Frequent
Patterns

Tid Items bought  itemset: A set of one or more

10 Bread, Nuts, Diaper items
20 Bread, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Bread, Diaper, Eggs
 (absolute) support, or,
40 Nuts, Eggs, Milk
support count of X: Frequency
50 Nuts, Coffee, Diaper, Eggs,
Milk
or occurrence of an itemset X
Customer
 (relative) support, s, is the
Customer
buys both buys diaper
fraction of transactions that
contains X (i.e., the
probability that a transaction
contains X)
 An itemset X is frequent if X’s
Customer support is no less than a
buys bread minsup threshold
10
Basic Concepts: Frequent
Patterns

Support(Bread → Diaper) = 3/5 = 60%

Confidence(Bread → Diaper) = 3/3 =

100%

11
Basic Concepts: Frequent
Patterns

transaction ID items

1 {A,C,D}

2 {B,C,E}

3 {A,B,C,E}

4 {B,E}

5 {A,B,C,E}
Basic Concepts: Frequent
Patterns
Let Min Support = Itemset Support F/I
2/5 ABC 2/5 F
Itemset Support F/I
ABD 0/5 I
A 3/5 F
ABE 2/5 F
B 4/5 F
ACD 1/5 I
C 4/5 F
ACE 2/5 F
D 1/5 I
ADE 0/5 I
E 4/5 F
BCD 0/5 I
AB 2/5 F
BCE 3/5 F
AC 3/5 F
BDE 0/5 I
AD 1/5 I
CDE 0/5 I
AE 2/5 F
ABCD 0/5 I
BC 3/5 F
ABCE 2/5 F
BD 0/5 I
ABDE 0/5 I
BE 4/5 F
ACDE 0/5 I
CD 1/5 I
BCDE 0/5 I
CE 3/5 F
ABCDE 0/5 I
DE 0/5 I
Basic Concepts: Frequent
Patterns

Min Support = 2/5

Total itemsets =31

Frequent itemsets = 15

Infrequent itemsets =
16
Basic Concepts: Association Rules
Ti Items bought
d  Find all the rules X  Y with
10 Bread, Nuts, Diaper minimum support and
20 Bread, Coffee, Diaper
confidence
30 Bread, Diaper, Eggs
40 Nuts, Eggs, Milk
 support, s, probability that
50 Nuts, Coffee, Diaper, Eggs, a transaction contains X 
Milk Y
 confidence, c, conditional
probability that a
transaction having X also
contains Y
 Association
Let rules:minconf
minsup = 50%, (many=more!)
50%
Freq. Bread  Diaper (60%, 100%)

Pat.: Bread:3, Nuts:3, Diaper:4,

Diaper{Bread,
Eggs:3,  Bread (60%, 75%)
Diaper}:3
15
Association rule mining Process
 In general, association rule mining can be viewed
as a two-step process:
1. Find all frequent itemsets: By definition, each of these
itemsets will occur at least as frequently as a
predetermined minimum support count (min sup).

2. Generate strong association rules from the frequent

itemsets: By definition, these rules must satisfy minimum
support and minimum confidence.

16
The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be

frequent

 If Milk Bread Butter is a frequent itemset, then

the following itemsets are frequent;

• Milk
• Bread
• Butter
• Milk, Bread
• Milk, Butter
• Bread, Butter
17
A major challenge in mining
frequent itemsets from a large
data set

18
 The total number of frequent itemsets that it
contains is thus

 This is too huge a number of itemsets for any computer to

compute or store.
 To over come this difficulty, we introduce the concepts of
closed frequent itemset and maximal frequent
itemset.

19
Closed Patterns and Max-
Patterns
 A long pattern contains a combinatorial number of
sub-patterns.

 Solution: Mine closed patterns and max-patterns

instead

 An itemset X is closed if X is frequent and there

exists no super-pattern Y ‫ כ‬X, with the same
support as X

 An itemset X is a max-pattern if X is frequent and

there exists no frequent super-pattern Y ‫ כ‬X
20
Closed and Maximal
 Frequent Itemset: A frequent itemset is an itemset
whose support is greater than some user-specified
minimum support

 Closed Frequent Itemset: An itemset is closed if

none of its immediate supersets has the same
support as that of the itemset.

 Maximal Frequent Itemset: An itemset is maximal

frequent if none of its immediate supersets is
frequent.

21
22
23
A,C,D B,C,D

24
25
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

26
EXAMPLE
 Let’s take an example. Suppose we have a database
with four customer transactions, denoted as T1, T2,
T3 and T4:

 T1: {a,b,c,d}
T2: {a,b,c}
T3: {a,b,d}
T4: {a,b}

 where the letters a, b, c, d indicate the purchase of

items apple, bread, cake and dattes.

 If we set the minimum support threshold to 50% (which

means that we want to find itemsets appearing in at least two
transactions)
27
 frequent itemsets:
 {a},

 {b},

 {c},

 {d},

 {a,b},

 {a,c},

 {a,d},

 {b,c},

 {b,d},

 {a,b,c},

 {a,b,d}
28
 frequent closed itemsets:
 {a,b},

 {a,b,c},

 {a,b,d},

 {a,b,c,d}

29
FIND maximal frequent itemset

Transaction ID Items
1 ABCD
2 ABD
3 ACD
4 BCD

Assume that the minimum support threshold is

50%, meaning that an itemset must occur in at
least 2 transactions to be frequent (since this
datasets has four transactions).

30
Support count
Itemset (how many times
an itemset appears
in the database)
A 3
B 3
C 3
D 4
AB 2
AC 2
AD 3
BC 2
BD 3
CD 3
ABC 1
ABD 2
ACD 2
BCD 2
ABCD 1
31
the frequent itemsets are
A, B, C, D, AB, AC, AD, BC, BD, CD, ABD,

ACD, and BCD.

 three are identified as maximal frequent:

ABD, ACD and BCD.

This is because none of their immediate

supersets (ABCD) are frequent.

32
Why Closed and Maximal Frequent
itemset
 Closed itemsets are important because they can reduce the
number of frequent itemsets presented to the user, without
losing any information.
 Frequent itemsets can be very large and redundant, especially
when the minsup is low.
 By mining closed itemsets, generally only a very small set of
itemsets is obtained, and still all the other frequent itemsets can
be directly derived from the closed itemsets.
 Maximal frequent itemsets provide a compact representation
of all the frequent itemsets for a given dataset and minimum
support threshold.
 Also discovering maximal frequent itemsets can be faster and
require less memory and storage space than finding all the
frequent itemsets.
 we can derive all the other frequent itemsets from maximal
frequent itemset but we many not know their exact support
count.
33
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods
 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

36
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical

Data Format
37
Apriori Algorithm
 Apriori employs an iterative approach known as a
level-wise search, where k-itemsets are used to
explore (k +1)-itemsets.
 To improve the efficiency of the level-wise
generation of frequent itemsets, an important
property called the Apriori property is used to
reduce the search space.

38
 Apriori property: All nonempty subsets of a
frequent itemset must also be frequent.
 This property belongs to a special category of
properties called antimonotonicity in the sense
that if a set cannot pass a test, all of its
supersets will fail the same test as well.

39
How is the Apriori property used in
the algorithm?
 A two-step process is followed, consisting of join
and prune actions.

 To find Lk, a set of candidate k-itemsets is

generated by joining Lk−1 with itself.

 For efficient implementation, Apriori assumes that

items within a transaction or itemset are sorted in
lexicographic order. For the (k−1)-itemset, li, this
means that the items are sorted such that
li[1]<li[2]<···<li[k−1].
40
 Join Process
 To find Lk, a set of candidate k-itemsets is generated by joining
Lk−1 with itself.
 The join, Lk−1⋈ Lk−1, is performed, where members of Lk−1
are joinable if their first (k−2) items are in common. That is,
members l1 and l2
 of Lk−1 are joined if
(l1[1]=l2[1])∧(l1[2]=l2[2])∧···∧(l1[k−2]=l2[k−2]) ∧(l1[k−1]
<l2[k−1]).
 The condition l1[k−1] <l2[k−1] simply ensures that no
duplicates are generated. The resulting itemset formed by
joining l1 and l2 is {l1[1], l1[2],..., l1[k−2], l1[k−1], l2[k−1]}.

41
Apriori: A Candidate Generation & Test
Approach

 Apriori pruning principle: If there is any itemset

which is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from
length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can
be generated 42
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
43
44
45
 In the first iteration of the algorithm, each item is a
member of the set of candidate1-itemsets, C1.
 The set of frequent 1-itemsets, L1, can then be
determined. It consists of the candidate 1-itemsets
satisfying minimum support.
 To discover the set of frequent 2-itemsets, L2, the
algorithm uses the join L1 ⋈ L1 to generate a candidate
set of 2-itemsets, C2.
 No candidates are removed from C2 during the prune
step because each subset of the
 candidates is also frequent.

46
 The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate 2-itemsets in C2 having
minimum support.
 The generation of the set of the candidate 3-itemsets, C3.
 From the join step, we first get C3 =L2 ⋈ L2 ={{I1, I2, I3},
{I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
Based on the Apriori property that all subsets of a
frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be
frequent.
 We therefore remove them from C3, thereby saving the
effort of unnecessarily obtaining their counts during the
subsequent scan of D to determine L3.
47
48
 The transactions in D are scanned to determine L3,
consisting of those candidate 3-itemsets in C3 having
minimum support
 The algorithm uses L3 ⋈ L3 to generate a candidate set
of 4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, itemset {I1, I2, I3, I5} is pruned because its subset
{I2, I3, I5} is not frequent. Thus, C4 =φ, and the algorithm
terminates, having found all of the frequent itemsets.

49
 Generating Association Rules from Frequent
Itemsets
 Once the frequent itemsets from transactions in database
D have been found, it is straightforward to generate
strong association rules from them (where strong
association rules satisfy both minimum support and
minimum confidence)

50
 Based on previous equation, association rules
can be generated as follows:

51
If the minimum confidence threshold is, say, 70%, then only the second, third, and
last rules are output, because these are the only ones generated that are strong.

52
The Apriori Algorithm (Pseudo-
Code)

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 53
Implementation of Apriori
 How to generate candidates?
 Step 1: self-joining Lk

Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
54
How to Count Supports of Candidates?

 Why counting supports of candidates a problem?

 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
and counts
 Interior node contains a hash table
 Subset function: finds all the candidates
contained in a transaction

55
Counting Supports of Candidates Using Hash
Tree

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458

56
Candidate Generation: An SQL
Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions
for efficient implementation [See: S. Sarawagi, S. Thomas, and R.
Agrawal. Integrating association rule mining with relational database
systems: Alternatives and implications. SIGMOD’98]
57
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data

Format

 Mining Close Frequent Patterns and Maxpatterns

58
Further Improvement of the Apriori Method

 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
59
Improving the Efficiency of Apriori

Apriori faces challenges due to

 Large candidate itemsets
 Multiple database scans
 High computational complexity

How can we further improve the efficiency of Apriori-

based mining?
Hash-based technique


Transaction reduction


Partitioning


Sampling


Dynamic itemset counting


60
Hash-based technique

What is the Hash-Based Technique?

A method to reduce the number of candidate

itemsets by using a hash table

Helps in pruning infrequent itemsets early in the



process

Key Idea:

Use a hash function to map itemsets into buckets



If a bucket’s count is below the support threshold, all



61
Hash-based technique

62
Hash-based technique

TID List of items

T100 I1, I2, I5 Min. Support Count=3
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, 13
Order of Items:
T600 I2, I3
l1=1, 12=2, 13=3, 14=4, 15=
T700 I1, I3
T800 I1, 12, I3, I5
T900 I1, I2, I3

63
Hash-based technique

C
1
Itemset Count
I1 6
I2 7
I3 6
I4 2
I5 2

H(x, y)=((Order of First)* 10+(Order of

Second))mod 7
64
Hash-based technique
Itemset Count Hash Function
(1 * 10 + 2) mod 7 =
I1, I2 4
5
(1 * 10 + 3) mod 7 =
I1, I3 4
6
(1 * 10 + 4) mod 7 =
I1, I4 1
0
(1 * 10 + 5) mod 7 =
I1, I5 2
1
(2 * 10 + 3) mod 7 =
I2, I3 4
2
(2 * 10 + 4) mod 7 =
I2, I4 2
3
(2 * 10 + 5) mod 7 =
I2, I5 2
4
I3, I4 0 --
(3 * 10 + 5) mod 7 =
I3, I5 1 65
Hash-based technique

Hash table

Bucket
0 1 2 3 4 5 6
Address

Bucket
2 2 4 2 2 4 4
Count
Bucket
Content {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
s
{I3, I5} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}
{I2, I3} {I1, I2} {I1, I3}

66
Transaction reduction

 A transaction that does not contain any frequent k-

itemsets cannot contain any frequent (k + 1)-
itemsets

Key Idea:

 If a transaction does not contribute to the

generation of frequent itemsets, it can be safely
ignored

67
Transaction reduction

Transactions Items
T1 I1, I2, I5
T2 I2, I3, I4 Min. Support Count=2
T3 I3, I4
T4 I1, I2, I3, I4

I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0
68
Transaction reduction

I1 I2 I3 I4 I5
T1 1 1 0 0 1
T2 0 1 1 1 0
T3 0 0 1 1 0
T4 1 1 1 1 0

I1 I2 I3 I4
T1 1 1 0 0
T2 0 1 1 1
T3 0 0 1 1
T4 1 1 1 1
69
Transaction reduction

I1, I1, I1, I2, I2, I3,

I2 I3 I4 I3 I4 I4
T1 1 0 0 0 0 0
T2 0 0 0 1 1 1
T3 0 0 0 0 0 1
T4 1 1 1 1 1 1

I1, I2, I2, I3,

I2 I3 I4 I4
T2 0 1 1 1
T4 1 1 1 1
......

70
Partitioning

 A partitioning technique can be used that requires

just two database scans to mine the frequent
itemsets
 It consists of two phases, Phase I and Phase II
 Partitioning is a technique that divides the dataset
into smaller partitions
 Each partition is processed independently to find
local frequent itemsets
 These local frequent itemsets are then combined to
71
Partitioning

72
Partitioning

73
Sampling

 The basic idea of the sampling approach is to pick a

random sample S of the given data D, and then
search for frequent itemsets in S instead of D

 We trade off some degree of accuracy against

efficiency

 Because we are searching for frequent itemsets in S

rather than in D, it is possible that we will miss
74
DIC -Dynamic Itemset
Counting

 Alternative to Apriori Itemset Generation

 Itemsets are dynamically added and

deleted as transactions are read

 Relies on the fact that for an itemset to be

frequent, all of its subsets must also be
frequent, so we only examine those
itemsets whose subsets are all frequent
Itemset lattices
 Itemset lattices: An itemset lattice contains all
of the possible itemsets for a transaction
database. Each itemset in the lattice points to all
of its supersets. When represented graphically, a
itemset lattice can help us to understand the
concepts behind the DIC algorithm.

 Example: minsupp = 25% and M = 2.

DIC Algorithm:
 Mark the empty itemset with a solid square. Mark all
the 1-itemsets with dashed circles and Assign
counter. Leave all other itemsets unmarked.
 While any dashed itemsets remain:

Read M transactions (if we reach the end of the
transaction file, continue from the beginning). For
each transaction, increment the respective
counters for the itemsets that appear in the
transaction and are marked with dashes.

If a dashed circle's count exceeds minsupp, turn
it into a dashed square. If any immediate
superset of it has all of its subsets as solid or
dashed squares, add a new counter for it and
make it a dashed circle.

Once a dashed itemset has been counted
through all the transactions, make it solid and
stop counting it.
DIC - transactions
 Itemset lattice for the
above transaction
database:
 Itemset lattice before
any transactions are
read:

 Counters: A = 0, B = 0, C = 0
Empty itemset is marked with a solid
box. All 1-itemsets are marked with
dashed circles.
DIC - transactions
 After M transactions are read:
 After 2M transactions are read:

 Counters: A = 2, B = 1, C = 0,
AB = 0
 Counters: A = 2, B = 2, C = 1, AB =
0, AC = 0, BC = 0
We change A and B to dashed 
boxes because their counters
are greater than minsup (1) and C changes to a square because its
counter is greater than minsup.
add a counter for AB because
both of its subsets are boxes.
 A, B and C have been counted all
the way through so we stop
counting them and make their
boxes solid.
 Add counters for AC and BC
because their subsets are all boxes.
DIC - transactions

 After 3M transactions read:  After 4M transactions read:

 Counters: A = 2, B = 2, C = 1, AB = 1, AC
= 0, BC = 1

 Counters: A = 2, B = 2, C = 1, AB = 1, AC =
0, BC = 0 AC and BC are counted all the way
AB has been counted all the way through through. We do not count ABC because
and its counter satisfies minsup so we one of its subsets is a circle. There are no
change it to a solid box. BC changes to a dashed itemsets left so the algorithm is
dashed box. done.
Frequent Pattern Growth
 Apriori candidate generate-and-test method significantly reduces the size

of candidate sets, leading to good performance gain. However, it can

suffer from two nontrivial costs:

 It may still need to generate a huge number of candidate sets. For

example, if there are 104 frequent 1-itemsets, the Apriori algorithm

will need to generate more than 10 7 candidate 2-itemsets.

 It may need to repeatedly scan the whole database and check a large

set of candidates by pattern matching. It is costly to go over each

transaction in the database to determine the support of the candidate

itemsets.

“Can we design a method that mines the complete set of frequent

itemsets without such a costly candidate generation process?”

Frequent Pattern Growth
 The FP-Growth Algorithm (Frequent Pattern Growth) is
a powerful alternative to Apriori for mining frequent
itemsets. A frequent pattern is generated without the
need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a
frequent pattern tree or FP tree.

 This tree structure will maintain the association

between the itemsets. The database is fragmented
using one frequent item. This fragmented part is
called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this
method, the search for frequent itemsets is reduced
comparatively.
FP-growth (finding frequent
itemsets without candidate
generation).
Min support count = 2
Tid Items

T100 I1,I2,I5
• Scan the DB same as
T200 I2,I4 Apriori
• Derives the set of frequent
T300 I2,I3
items(1-itemsets) and
T400 I1,I2,I4 support item
countfre item freq
s q s
T500 I1,I3
I1 6 I2 7
T600 I2,I3 I2 7 I1 6
I3 6 I3 6
T700 I1,I3
I4 2 I4 2
T800 I1,I2,I3,I5 I5 2 I5 2

The set of frequent

itemset is Sorted in the
T900 I1,I2,I3
order of descending
support count
itemsets without candidate
generation).

Tid Items
T10 I2,I1,I5
0 Null
item fre
T20 I2,I4 s q
0
I2 7 I2 I1
T30 I2,I3
0 I1 6
T40 I2,I1,I4 I3 6
0 I4 2
I1 I3 I4 I3
T50 I1,I3 I5 2
0
T60 I2,I3
0 I5 I3 I4
T70 I1,I3
0
T80 I2,I1,I3,I5 I5
0
T90 I2,I1,I3
0
FP-Growth Method: Construction of
FP-Tree
Mining Frequent Patterns (bottom-up
approach)
Why Frequent Pattern Growth
Fast ?
 Performance study shows :
 FP-growth is an order of magnitude faster
than Apriori, and is also faster than tree-
projection
 Reasoning :
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
 Basic operation is counting and FP-tree
building
Challenges in FP Growth
 Complexity in tree construction for
large datasets.
 FP-Tree is not dynamic (must rebuild if
data changes).
 Deep recursive calls (can lead to
memory issues).

94
Mining Frequent Itemsets Using
the Vertical Data Format

95
Horizontal vs. Vertical Data Format
 Horizontal Format: Transactions are stored as rows, each
containing a set of items.
 Vertical Format: Each item is stored with its corresponding
transaction IDs (TIDs).

 Example:
 Horizontal Format:

T1: {A, B, C}
T2: {A, C, D}

 Vertical Format:
A: {T1, T2}
B: {T1}
C: {T1, T2}
D: {T2}
96
Advantages of Vertical Data
Format

 Reduces the need for multiple database scans.

 Frequent itemset mining is performed using TID
Set Intersections.
 Faster support count computation.
 Easier application of the Apriori Property.
 More efficient for dense datasets.

97
Apriori Property in Vertical Format
 A candidate k+1 itemset is generated
only if all its k-itemset subsets are
frequent.

 Example:
 {I1, I2, I3} is a candidate because {I1,

I2}, {I1, I3}, and {I2, I3} are frequent.

98
Steps with Example

Min supp = 2

99
100
There are 10 intersections performed in total, which lead to eight nonempty 2
itemsets, as shown in Table. Notice that because the itemsets {I1, I4} and {I3, I5}
each contain only one transaction, they do not belong to the set of frequent 2-
itemsets.
101
Based on the Apriori property, a given 3-itemset is a candidate 3-
itemset only if every one of its 2-itemset subsets is frequent.
The candidate generation process here will generate only two 3-itemsets: {I1, I2, I3} and
{I1, I2, I5}.

102
Optimization Techniques
 Diffsets: Store only differences in transaction IDs
instead of full sets.


Example:
 I1: {T100, T400, T500, T700, T800, T900}

 I1 ∩ I2: {T100, T400, T800, T900}

 Diffset: {T500, T700}

 Reduces memory usage and computational cost.

103
Summary

 Vertical format mining avoids full database scans

by using TID Set intersections.
 Frequent itemsets are mined efficiently using
the Apriori Property.
 Diffset technique further reduces the storage
and computation overhead.
 Ideal for dense datasets with many frequent
patterns.

104

Sap Webi Tutorial
100% (2)
Sap Webi Tutorial
105 pages
All Obj Methods
100% (1)
All Obj Methods
24 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
38 GM - ASAP-Association Rule Mining
No ratings yet
38 GM - ASAP-Association Rule Mining
64 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
06 FPBasic
No ratings yet
06 FPBasic
74 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Association RuleMining
No ratings yet
Association RuleMining
52 pages
Unit 4 - Part 1
No ratings yet
Unit 4 - Part 1
152 pages
6a - Frequent Pattern Analysis
No ratings yet
6a - Frequent Pattern Analysis
13 pages
P8 FPBasic
No ratings yet
P8 FPBasic
53 pages
Mining Frequent, Patterns, Associations, and Correlations
No ratings yet
Mining Frequent, Patterns, Associations, and Correlations
13 pages
Lect 4. Frequent Pattern Mining
100% (1)
Lect 4. Frequent Pattern Mining
60 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
06 Association Rule Mining
No ratings yet
06 Association Rule Mining
20 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
06 FPBasic
No ratings yet
06 FPBasic
65 pages
TMK - DWDM - Unit 4. From Government Engineering College
No ratings yet
TMK - DWDM - Unit 4. From Government Engineering College
176 pages
Unit - III
No ratings yet
Unit - III
38 pages
Data Cube Computation and Data Generation
No ratings yet
Data Cube Computation and Data Generation
54 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
Updated Module 3
No ratings yet
Updated Module 3
31 pages
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods
12 pages
Module 3
No ratings yet
Module 3
136 pages
DA Unit 4
100% (1)
DA Unit 4
125 pages
Unit 2
No ratings yet
Unit 2
65 pages
06apriori Edited v3
No ratings yet
06apriori Edited v3
29 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
ICS 2408 - Lecture 5 - Association
No ratings yet
ICS 2408 - Lecture 5 - Association
44 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
67 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
Lect 6
No ratings yet
Lect 6
74 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
No ratings yet
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
66 pages
Data Warehouse and Data Mining - Unit 5
No ratings yet
Data Warehouse and Data Mining - Unit 5
30 pages
Association Rule Mining
No ratings yet
Association Rule Mining
21 pages
06 Apriori
No ratings yet
06 Apriori
36 pages
DM Unit - 2
No ratings yet
DM Unit - 2
14 pages
Week 3
No ratings yet
Week 3
56 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
5 Frequent Pattern Mining
No ratings yet
5 Frequent Pattern Mining
44 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
DM-BS-lec6-Mining Frequent Patterns
No ratings yet
DM-BS-lec6-Mining Frequent Patterns
37 pages
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
No ratings yet
14-Introduction To Apriori Level Wise Algorithm-03-09-2024
32 pages
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
Session 8-Association Rules Mining
No ratings yet
Session 8-Association Rules Mining
75 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Slides 06FPBasic
No ratings yet
Slides 06FPBasic
30 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Httpsmygju - Gju.edu - Jofacescourse Portfoliocourse Syllabuscourse Syllabus - XHTML 2
No ratings yet
Httpsmygju - Gju.edu - Jofacescourse Portfoliocourse Syllabuscourse Syllabus - XHTML 2
15 pages
Unit 5 Mining Frequent Patterns and Cluster Analysis
No ratings yet
Unit 5 Mining Frequent Patterns and Cluster Analysis
63 pages
Dataanalytics Unit-4
No ratings yet
Dataanalytics Unit-4
23 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Statistics: 1001 Practice Problems For Dummies (+ Free Online Practice)
From Everand
Statistics: 1001 Practice Problems For Dummies (+ Free Online Practice)
The Experts at Dummies
No ratings yet
Bit Project Proposal Template
100% (2)
Bit Project Proposal Template
2 pages
MS Excel Trade Test Actual Part 2
No ratings yet
MS Excel Trade Test Actual Part 2
5 pages
VIVIA
No ratings yet
VIVIA
30 pages
HP Client Automation Enterprise User Guide - Windows and Linux.
No ratings yet
HP Client Automation Enterprise User Guide - Windows and Linux.
485 pages
Lateral Thinking
No ratings yet
Lateral Thinking
4 pages
Lora Based Smart Irrigation System1
No ratings yet
Lora Based Smart Irrigation System1
2 pages
EPassport - #AI7748518585 - Xavier McBride, Sr.
No ratings yet
EPassport - #AI7748518585 - Xavier McBride, Sr.
1 page
Internship and Recent Graduate Opportunities 1729278842
No ratings yet
Internship and Recent Graduate Opportunities 1729278842
1 page
Oracle On Azure Whitepaper
No ratings yet
Oracle On Azure Whitepaper
34 pages
3 Tally Prime Course Annex - 1 Bill Entry
No ratings yet
3 Tally Prime Course Annex - 1 Bill Entry
34 pages
Software Design and Architecture JUNE-2021 Sem - II SET-5 (T.Y.B.tech COMP)
No ratings yet
Software Design and Architecture JUNE-2021 Sem - II SET-5 (T.Y.B.tech COMP)
6 pages
Time Series Graph
No ratings yet
Time Series Graph
1 page
Robot Arm
No ratings yet
Robot Arm
3 pages
MYOB Installation Guide
No ratings yet
MYOB Installation Guide
2 pages
Functional Safety and Automotive
No ratings yet
Functional Safety and Automotive
7 pages
Sop Retail and Corporate Net Banking
No ratings yet
Sop Retail and Corporate Net Banking
3 pages
LKPD Bioteknologi Kelas 9 Worksheet
No ratings yet
LKPD Bioteknologi Kelas 9 Worksheet
4 pages
Official Website Direktorat Jenderal Bea Dan Cukai - Fatisah
No ratings yet
Official Website Direktorat Jenderal Bea Dan Cukai - Fatisah
1 page
DPWH DO NO. 006 S 2024-YOUTUBE LIVESTREAMING, POSTING OF PROCUREMENT ACTIVITIES AND CONTRACT AWARD REPORTING
No ratings yet
DPWH DO NO. 006 S 2024-YOUTUBE LIVESTREAMING, POSTING OF PROCUREMENT ACTIVITIES AND CONTRACT AWARD REPORTING
66 pages
PHP Point of Sale
No ratings yet
PHP Point of Sale
52 pages
1 Simbound Tabletto Case
No ratings yet
1 Simbound Tabletto Case
2 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Sports Equipment Rental System of School Based On B/S Structure Design and Implementation
No ratings yet
Sports Equipment Rental System of School Based On B/S Structure Design and Implementation
4 pages
SAP Handling Unit Management Integration With Production Planning
100% (1)
SAP Handling Unit Management Integration With Production Planning
23 pages
Physical Design
100% (1)
Physical Design
12 pages
Generative AI and ChatGPT 101
100% (1)
Generative AI and ChatGPT 101
27 pages
Rslogix 5000 Training Seminar: Programming in Ladder Logic With Rockwell'S Rs 5000
No ratings yet
Rslogix 5000 Training Seminar: Programming in Ladder Logic With Rockwell'S Rs 5000
10 pages
MCA Rtu Syllabuss
No ratings yet
MCA Rtu Syllabuss
6 pages