0% found this document useful (0 votes)
28 views14 pages

Unit 2

The document discusses association rule mining and frequent itemset generation. It defines key concepts like transactions, itemsets, support count, support and frequent itemsets. The Apriori algorithm employs a two-step approach: 1) generate frequent itemsets whose support is above a minimum threshold and 2) generate rules from frequent itemsets where confidence is above a minimum. Frequent itemset generation is computationally expensive due to the large search space, but strategies like the Apriori principle can reduce the number of candidate itemsets by pruning subsets of infrequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views14 pages

Unit 2

The document discusses association rule mining and frequent itemset generation. It defines key concepts like transactions, itemsets, support count, support and frequent itemsets. The Apriori algorithm employs a two-step approach: 1) generate frequent itemsets whose support is above a minimum threshold and 2) generate rules from frequent itemsets where confidence is above a minimum. Frequent itemset generation is computationally expensive due to the large search space, but strategies like the Apriori principle can reduce the number of candidate itemsets by pruning subsets of infrequent itemsets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

UNIT – II

Association Rules: Problem Definition, Frequent Item Set Generation, The APRIORI Principle,
Support and Confidence Measures, Association Rule Generation; APRIOIRI Algorithm, The
Partition Algorithms, FP-Growth Algorithms, Compact Representation of Frequent Item Set-
Maximal Frequent Item Set, Closed Frequent Item Set.

Association Analysis :

Many business enterprises accumulate la.rge quantities of data from their day to day
operations. For example, huge amounts of customer purchase data are collected daily at the
checkout counters of grocery store. Table below illustrates an example of such data, commonly
known as market basket transactions. Each row in this table corresponds to a transaction, which
contains a unique identifier labeled TID and a set of items bought by a given customer. Retailers
are interested in analyzing the data to learn about the purchasing behavior of their customers.
Such valuable information can be used to support a variety of business related applications such
as marketing promotions, inventory management, and customer relationship management.

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Association analysis, which is useful for discovering interesting relationships hidden in large
data sets. The uncovered relationships can be represented in the form of association rules or sets
of frequent items. For example, the following rule can be extracted from the data set shown in
the Table.

{Diapers}-----> {Beer}.

The rule suggests that a strong relationship exists between the sale of diapers and beer because
many customers who buy diapers also buy beer. Retailers can use this type of rules to help them
identify new opportunities for cross selling their products to the customers. Besides market
basket data, association ana1ysis is also applicable to other application domains such as
bioinformatics, medical diagnosis, Web mining, and scientific data analysis. In the analysis of
Earth science data, for example, the association patterns may reveal interesting connections
among the ocean, land, and atmospheric processes.

There are two key issues that need to be addressed when applying association analysis to market
basket data. First, discovering patterns from a large transaction data set can be computationally
expensive. Second, some of the discovered patterns are potentially spurious because they may
happen simply by chance. The first part of the chapter explains the basic concepts of association
analysis and the algorithms used to efficiently mine such patterns. The second part of the chapter
deals with the issue of evaluating the discovered patterns in order to prevent the generation of
spurious results.

2.1 Problem Definition : The basic terminology used in association analysis is as follows.
Binary Representation Market basket data can be represented in a binary format as shown in
Table.

TID BREAD MILK DIAPERS BEER EGGS COLA


1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 0 0
Each row corresponds to a transaction and each column corresponds to an item. An item can be
treated as a binary variable whose value is one if the item is present in a transaction and zero
otherwise.

Itemset : A collection of one or more items is called itemset.

Example: {Milk, Bread, Diaper}

k-itemset : An itemset that contains k items

Support count () : Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2

Support : Fraction of transactions that contain an itemset. E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset : An itemset whose support is greater than or equal to a minsup threshold.

Association Rule: An association rule is an implication expression of the form X →Y, where X
and Y are disjoint itemsets, i.e ., X n Y = 0. The strength of an association rule can be measured
in terms of its support and confidence. Support determines how often a rule is applicable to a
given data set. while confidence determines how frequently items in Y appear in transactions that
contain X. The formal definitions of these metrics are

Support, s(X → Y)=  (X ᴜY)/N

Confidence, c(X → Y)=  (X ᴜ Y)/ (X)


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Association Rule

An implication expression of the form X → Y, where X and Y are itemsets

Example:
{Milk, Diaper} → {Beer}

Rule Evaluation Metrics

Support (s)

Fraction of transactions that contain both X and Y

Confidence (c)

Measures how often items in Y appear in transactions that contain X.

Example:

{Milk , Diaper }  {Beer}


 ( Milk , Diaper, Beer ) 2
s = = = 0 .4
| T | 5

 ( Milk, Diaper, Beer ) 2


c = = = 0 . 67
 ( Milk , Diaper ) 3

Given a set of transactions T, the goal of association rule mining is to find all rules
having

– support ≥ minsup threshold

– confidence ≥ minconf threshold


Brute-force approach:

– List all possible association rules

– Compute the support and confidence for each rule

– Prune rules that fail the minsup and minconf thresholds

Mining Association Rules:

Two-step approach:

1. Frequent Itemset Generation

– Generate all itemsets whose support  minsup

2. Rule Generation

– Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive.

2.2 Frequent Itemset Generation:


null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

A lattice structure can be used to enumerate the list of all possible itemsets. Figure above shows
an itemset lattice for I = {a ,b,c,d,e}. In general, a data set that contains k items can potentially
generate up to 2k-1 frequent itemsets, excluding the null set. Because k can be very large in many
practical applications, the search space of itemsets that need to be explored is exponentially
large.

Given d items, there are 2d possible candidate itemsets.

l Brute-force approach:

– Each itemset in the lattice is a candidate frequent itemset

– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Match each transaction against every candidate

Complexity ~ O(NMw) => Expensive since M = 2d

Frequent Itemset Generation Strategies:

1.Reduce the number of candidates (M)

– Complete search: M=2d

– Use pruning techniques to reduce M

2.Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases

Used by DHP and vertical-based mining algorithms

3. Reduce the number of comparisons (NM)


-- Use efficient data structures to store the candidates or transactions
-- No need to match every candidate against every transaction.

2.3 The Apriori Principle: T his section describes how the support measure helps to r educe
the number of candidate itemsets explo red during frequent itemset generation. The use of
support for pruning candidate itemsets is guided by t he following principle. Theorem 6.1 (A
priori Principle). If an itemset is frequent, then all of its subsets must also be frequent. To
illustrate the idea behind the Apriori principle, consider the itemset lattice shown in Figure 6.3.
Suppose { c, d, e) is a frequent it,ell1Set. Clearly, any transaction that contains { c, d, e} must
also contain its subsets, { c, d), {c,e}, {d, e}, {c}, {d}, and {e}. As a result, if {c,d,e} is
frequent, then all subsets of {c,d,e} (i.e., the shaded itemsets in this figure) must also be
frequent.

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

The Apriori Algorithm—An Example


Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
2.4 Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

◼ Bottlenecks of the Apriori approach

◼ Breadth-first (i.e., level-wise) search

◼ Candidate generation and test

◼ Often generates a huge number of candidates

◼ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)

◼ Depth-first search

◼ Avoid explicit candidate generation

◼ Major philosophy: Grow long patterns from short ones using local frequent items only

◼ “abc” is a frequent pattern

◼ Get all transactions having “abc”, i.e., project DB on abc: DB|abc

◼ “d” is a local frequent item in DB|abc → abcd is a frequent pattern

Construct FP-tree from a Transaction


Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
Partition Patterns and Databases

➢ Frequent patterns can be partitioned into subsets according to f-list

• F-list = f-c-a-b-m-p

• Patterns containing p

• Patterns having m but no p

• Patterns having c but no a nor b, m, p

• Pattern f

➢ Completeness and non-redundency

➢ Starting at the frequent item header table in the FP-tree

➢ Traverse the FP-tree by following the link of each frequent item p

➢ Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base

Find Patterns Having P From P-conditional Database

◼ Starting at the frequent item header table in the FP-tree


◼ Traverse the FP-tree by following the link of each frequent item p
◼ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


1
From Conditional Pattern-bases to Conditional FP-trees

◼ For each pattern-base


◼ Accumulate the count for each item in the base

◼ Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
31

Recursion: Mining Each Conditional FP-tree


{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree

32
A Special Case: Single Prefix Path in FP-tree

◼ Suppose a (conditional) FP-tree T has a shared


single prefix-path P
◼ Mining can be decomposed into two parts
{}
◼ Reduction of the single prefix path into one node
a1:n1 ◼ Concatenation of the mining results of the two
a2:n2 parts
a3:n3
{} r1

b1:m1 C1:k1 a1:n1


 r1 = + b1:m1 C1:k1
a2:n2
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
33

Benefits of the FP-tree Structure:

➢ Completeness

• Preserve complete information for frequent pattern mining

• Never break a long pattern of any transaction

➢ Compactness

• Reduce irrelevant info—infrequent items are gone

• Items in frequency descending order: the more frequently occurring, the more
likely to be shared

• Never be larger than the original database (not count node-links and the count
field)

The Frequent Pattern Growth Mining Method:

Recursively grow frequent patterns by pattern and database partition

Method :For each frequent item, construct its conditional pattern-base, and then its
conditional FP-tree.
Repeat the process on each newly created conditional FP-tree

Until the resulting FP-tree is empty, or it contains only one path—single path will
generate all the combinations of its sub-paths, each of which is a frequent pattern.

Advantages of the Pattern Growth Approach:

➢ Divide-and-conquer:

• Decompose both the mining task and DB according to the frequent patterns
obtained so far

• Lead to focused search of smaller databases

➢ Other factors

• No candidate generation, no candidate test

• Compressed database: FP-tree structure

• No repeated scan of entire database

• Basic ops: counting local freq items and building sub FP-tree, no pattern search
and matching

➢ A good open-source implementation and refinement of FPGrowth .

2.5 Compact Representation of Frequent Itemsets

The number of frequent itemsets produced from a transaction data set can be very large. It is
useful to identify a small representative set of itemsets from which all other frequent itemsets
can be derived. Two such representations are maximal and closed frequent itemsets.
2.6 Maximal Frequent Itemset

An itemset is maximal frequent if it is frequent and none of its


immediate supersets is frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E

02/14/2018 Introduction to Data Mining, 2nd Edition 46

2.7 Closed Frequent Itemset: An itemset X is closed if none of its immediate


supersets has the same support as the itemset X.

X is not closed if at least one of its immediate supersets has support count as X.

Itemset Support Itemset Support


TID Items
{A} 4 {A,B,C} 2
1 {A,B}
{B} 5 {A,B,D} 3
2 {B,C,D}
{C} 3 {A,C,D} 2
3 {A,B,C,D}
{D} 4 {B,C,D} 2
4 {A,B,D}
{A,B} 4 {A,B,C,D} 2
5 {A,B,C,D}
{A,C} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
Maximal vs Closed Itemsets

null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE

02/14/2018 Introduction to Data Mining, 2nd Edition 60

You might also like