unit IV
unit IV
November 5, 2024 1
• Association rule mining finds interesting association or
correlation relationships among a large set of data items
November 5, 2024 2
• The discovery of such associations can help retailers develop
marketing strategies by gaining insight into which items are
frequently purchased together by the customers.
• For instance, if customers are buying milk, how likely are they to
also buy bread (and which kind of bread) on the same trip to the
supermarket.
• For ex, placing milk and bread within close proximity may further
encourage the sale of these items together within single visits to
the store.
November 5, 2024 3
November 5, 2024 4
Market Basket Analysis
A motivating example for Association Rule Mining
November 5, 2024 7
• Basic Concepts
support (A=>B) = P (A Union B)
confidence (A=>B) = P(B/A) = Support(A u B)/Support(A)
November 5, 2024 8
Association Rule Mining : A Road Map
For Ex :
computer => financial_management_software
[ support = 2%, confidence = 60%]
November 5, 2024 9
Association Rule Mining : A Road Map
November 5, 2024 10
Association Rule Mining : A Road Map
can be rewritten as
buys (x, “computer”) → buys (x, “financial software”)
November 5, 2024 11
Association Rule Mining : A Road Map
November 5, 2024 12
Association Rule Mining : A Road Map
November 5, 2024 13
Types of Association Rule Mining
November 5, 2024 14
Types of Association Rule Mining
November 5, 2024 15
• Let us consider the following set of transactions in a
bookshop.
t1 := {ANN,CC,TC,CG}
t2:= {CC,D,CG}
t3 := {ANN,CC,TC,CG}
t4:= {ANN,CC,D,CG}
t5:= {ANN,CC,D,TC,CG}
t6:= {CC,D,TC}
• I={ANN,CC,D,TC,CG} and T:={t1,t2,t3,t4,t5,t6}
November 5, 2024 16
Association Rule: Basic Concepts
• Given:
(1) database of transactions,
(2) each transaction is a list of items (purchased by a customer in a
visit)
• Find: all rules that correlate the presence of one set of items
with that of another set of items
• E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
November 5, 2024 17
Association Rule Definitions
November 5, 2024 18
Rule Measures: Support & Confidence
November 5, 2024 19
Association Discovery
• 1. Find all large itemsets: with support > minsup (frequent sets).
November 5, 2024 20
Mining Association Rules in Large Databases
November 5, 2024 21
The Apriori Algorithm: Basics
November 5, 2024 22
The Apriori Algorithm in a Nutshell
• Find the frequent itemsets: the sets of items that have minimum
support
• A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be
a frequent itemset
• Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
• Use the frequent itemsets to generate association rules.
November 5, 2024 23
The Apriori Algorithm : Pseudo code
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained
in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
November 5, 2024 24
The Apriori Algorithm: Example
November 5, 2024 25
Step 1: Generating 1-itemset Frequent Pattern
Itemset Sup.Count
Itemset Sup.Count
Compare candidate {I1} 6
Scan D for support count with
count of each minimum support {I2} 7
{I1} 6
candidate count
{I2} 7 {I3} 6
{I3} 6 {I4} 2
{I4} 2 {I5} 2
{I5} 2
C1 L1
• In the first iteration of the algorithm, each item is a member of the set
of candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets
satisfying minimum support.
November 5, 2024 26
Step 2: Generating 2-itemset Frequent Pattern
November 5, 2024 27
L1 = {I1,I2,I3,I4,I5}
Since L2 = L1 join L1 then
{I1,I2,I3,I4,I5} join {I1,I2,I3,I4,I5}
It becomes
[ {I1,I2} {I1,I3}, {I1,I4}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I3,I4}
{I3,I5}, {I4,I5} ].
Now we need to check the frequent item sets with min support count
Then we get
L2= [ {I1,I2} {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5} ].
Similarly We do it for L3
November 5, 2024 28
Step 2: Generating 2-itemset Frequent Pattern
{I4, I5} 0
C2
November 5, 2024 C2 29
Step 2: Generating 2-itemset Frequent Pattern
[Cont.]
November 5, 2024 30
Step 3: Generating 3-itemset Frequent Pattern
J Step 2 :
{I1,I3}, {I1,I3}, C Find Items starting with
{I1,I5}, O I3 in B
{I1,I5}, => {I1,I2,I3},
I It gives NIL, Similarly I4,
{I2,I3}, {I2,I3}, {I1,I2,I5}. I5
{I2,I4}, N Step 3 :
{I2,I4},
Find out infrequent
{I2,I5}]. {I2,I5}]. items sets using min
support count and
remove them.
November 5, 2024 31
Step 3: Generating 3-itemset Frequent Pattern
Compare
Scan D for Scan D for Itemset Sup. candidate Itemset Sup
Itemset support count
count of count of Count Count
each each {I1, I2, I3} 2 with min {I1, I2, I3} 2
{I1, I2, I3}
candidate candidate support count
{I1, I2, I5} {I1, I2, I5} 2 {I1, I2, I5} 2
C3 C3 L3
November 5, 2024 33
Step 4: Generating 4-itemset Frequent Pattern
• The algorithm uses L3 Join L3 to generate a candidate set of
4-itemsets, C4. Although the join results in {{I1, I2, I3,
I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is
not frequent.
• Thus, C4 = φ , and algorithm terminates, having found all of
the frequent items. This completes our Apriori Algorithm.
• What’s Next ?
These frequent itemsets will be used to generate strong
association rules ( where strong association rules satisfy
both minimum support & minimum confidence).
November 5, 2024 34
Step 5: Generating Association Rules from Frequent Itemsets
• Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of
l.
• For every nonempty subset s of l, output the rule “s → (l-s)” if
support_count(l) / support_count(s) >= min_conf where
min_conf is minimum confidence threshold.
• Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},
{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
• Lets take l = {I1,I2,I5}.
• Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
November 5, 2024 35
Step 5: Generating Association Rules from Frequent
Itemsets [Cont.]
• Let minimum confidence threshold is , say 70%.
• The resulting association rules are shown below, each listed
with its confidence.
• R1: I1 ^ I2 → I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
• R2: I1 ^ I5 → I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
• R3: I2 ^ I5 → I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3 is Selected.
• R4: I1 → I2 ^ I5 confidence=2/6=33%
• R5: I2 → I1 ^ I5 confidence=2/7=29%
• R6: I5 → I1 ^ I2 confidence=2/2=100%
November 5, 2024 36
Step 5: Generating Association Rules from Frequent
Itemsets [Cont.]
• R4: I1 → I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
• R5: I2 → I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
• R6: I5 → I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, We have found three strong association
rules.
November 5, 2024 37
Methods to Improve Apriori’s Efficiency
November 5, 2024 38
Hash-based technique
Let us assume:
Item Order
• I1 - 1
• I2 - 2
• I3 - 3
• I4 - 4
• I5 - 5
November 5, 2024 39
Bucket 0 1 2 3 4 5 6
address
Bucket 2 2 4 2 2 4 4
Count
Bucket {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I1,I2} {I1,I3}
Contents {I3,I5} {I1,I5} {I2,I3} {I2,I4} {I1,I2} {I1,I3}
{I2,I3} {I1,I2}{ {I1,I3}
{I2,I3} I1,I2} {I1,I3}
November 5, 2024 40
• FORMULA = h(x,y) = ((order of x) x 10 + (order of )) mod 7
• From table of step 2 (of two item set) first take {I1, I3}
• Then apply formula like h(x,y) = ((1 ) x 10 + (3)) mod 7
• This gives 6. So put this in column 6 and increase bucket count by 1
• Then go the main table and find all {I1, I3) and do the same.
November 5, 2024 41
Bucket 0 1 2 3 4 5 6
address
Bucket 2 2 4 2 2 4 4
Count
November 5, 2024 42
Methods to Improve Apriori’s Efficiency
November 5, 2024 43
Is Apriori Fast Enough? — Performance Bottlenecks
November 5, 2024 44
Mining Frequent Patterns Without Candidate Generation
November 5, 2024 45
Example :
• We reexamine the mining of transaction database D.
November 5, 2024 48
November 5, 2024 49
Benefits of the FP-tree Structure
• Completeness:
• never breaks a long pattern of any transaction
• preserves complete information for frequent pattern mining
• Compactness
• reduce irrelevant information—infrequent items are gone
• frequency descending ordering: more frequent items are more likely to
be shared
• never be larger than the original database (if not count node-links and
counts)
• Example: For Connect-4 DB, compression ratio could be over 100
November 5, 2024 50
Mining Frequent Patterns Using FP-tree
November 5, 2024 51
Major Steps to Mine FP-tree
• Mining of he FP-Tree is summarized in the table below.
• Let us consider I5 which is the last item in L1, rather than the first.
• The reasoning behind this will become apparent as we explain the
FP- Tree process.
• I5 occurs in two branches of the FP-Tree (Back 3 slides)
• The occurrences of I5 can easily be found by following its chain of
node-links.
• The paths formed by these branches are <(I2 I1 I5 : 1)> and <(I2 I1
I3 I5 : 1)>
• Therefore considering I5 as a suffix, its corresponding two prefix
paths are <(I2 I1 : 1)) and <(I2 I1 I3 :1)>, which from its conditional
pattern base.
• Its conditional FP-Tree contains only a single path, <I2:2, I1:2) ; I3
is not included because its support count of 1 is less than the
minimum support count.
• The single path generates all the combinations of frequent patterns
: I2 I5:2,I1 I5:2, I2 I1 I5:2
November 5, 2024 52
Item Conditional pattern base Conditional FP-Tree Frequent patterns generated
I5 {(I2 I1:1), (I2 I1 I3:1)} <I2:2, I1:2> I2 I5:2, I1 I5:2, I2 I1 I5:2
I3 {(I2 I1:2), (I2:2), (I1:2)} <I2:4, I1:2>, <I1:2> I2 I3:4, I1, I3:4, I2 I1 I3 :2
• For I4, its two prefix paths from the conditional pattern base,
{(I2 I1 : 1), (I2 :1)} which generates a single node conditional FP-tree <I2:2>
and derives one frequent pattern I2 I4 :2.
• Similarly I3’s conditional pattern base is : {(I2 I1 :2), (I2:2), (I1:2)}.
• Its conditional FP-tree has two branches <I2:4, I1:2> AND <I1:2>, as shown
in the figure below, which generates the set of patterns {I2 I3 :4, I1 I3:4, I2
I1 I3:2}
• Finally I1’s conditional pattern base is {(I2:4)}, whose FP-tree contains only
one node <I2:4>, which generates one frequent pattern I2 I1 : 4.
November 5, 2024 53
November 5, 2024 54
Major Steps to Mine FP-tree
November 5, 2024 55
Why Is Frequent Pattern Growth Fast?
• Reasoning
• No candidate generation, no candidate test
• Use compact data structure
• Eliminate repeated database scan
• Basic operation is counting and FP-tree building
November 5, 2024 56
Iceberg Queries
November 5, 2024 57
Example
• A database has five transactions. Let min sup = 60% and min con f =
80%.
TID items bought
T100 M, O, N, K, E, Y
T200 D, O, N, K, E, Y
T300 M, A, K, E
T400 M, U, C, K, Y
T500 C, O, O, K, I, E
(a) Find all frequent itemsets using Apriori and FP-growth, respectively.
compare the efficiency of the two mining processes.
(b) List all of the strong association rules (with support s and confidence
c) matching the following metarule, where X is a variable representing
customers, and item i denotes variables representing items (e.g., “A”, “B”,
etc.):
For every x belongs to transaction, buys(X, item1)^buys(X, item2))buys(X,
item3) [s, c]
November 5, 2024 58
Mining Association Rules in Large
Databases
November 5, 2024 59
Multiple-Level Association Rules
November 5, 2024 60
TID Items Purchased
November 5, 2024 61
November 5, 2024 62
• By convention, levels within a concept hierarchy are numbered
from top to bottom, starting with level 0 at the root node for all (the
most general abstract level).
• The items in Table are the lowest level of the concept hierarchy of
Figure. It is difficult to find interesting purchase patterns at such
raw or primitive level data.
November 5, 2024 63
• One would expect that it is easier to find strong association
between “IBM desktop computer” and “B/W printer” rather than
between “IBM desktop computer” and “Sony B/W printer
November 5, 2024 64
Mining Multi-Level Associations
November 5, 2024 65
Multi-level Association: Uniform Support vs.
Reduced Support
November 5, 2024 66
November 5, 2024 67
Multi-level Association: Uniform Support vs.
Reduced Support
November 5, 2024 68
Level-by-level independent : This is a full breadth search. Each node is
examined, regardless of whether or not its parent node is found to be
frequent
November 5, 2024 69
November 5, 2024 70
November 5, 2024 71
November 5, 2024 72
Multi-level Association: Redundancy Filtering
November 5, 2024 73
November 5, 2024 74
Computing Interestingness
Measure
• Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
1
November 5, 2024 77
November 5, 2024 78
Statistical Independence
1
November 5, 2024 80
Statistical-based Measures
P(Y | X )
Lift =
P(Y )
P( X , Y )
Interest =
P( X ) P(Y )
PS = P( X , Y ) − P( X ) P(Y )
P( X , Y ) − P( X ) P(Y )
− coefficient =
P( X )[1 − P( X )]P(Y )[1 − P(Y )]
1
November 5, 2024 82
Example: -Coefficient
• -coefficient is analogous to correlation coefficient for
continuous variables
Y Y Y Y
X 60 10 70 X 20 10 30
X 10 20 30 X 10 60 70
70 30 100 30 70 100
1
Constraint-Based Mining
• Interactive, exploratory mining giga-bytes of data?
• Could it be real? — Making good use of constraints!
• What kinds of constraints can be used in mining?
• Knowledge type constraint: classification, association, etc.
• Data constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98.
• Dimension/level constraints:
• in relevance to region, price, brand, customer category.
• Interestingness constraints:
• strong rules (min_support 3%, min_confidence 60%).
• Rule constraints
• cheap item sales (price < $10) triggers big sales (sum > $200).
1
November 5, 2024 86