0% found this document useful (0 votes)
48 views38 pages

04 AssociationPatternMining

This document discusses association pattern mining and association rules. It covers key concepts like frequent itemsets, support, confidence, and how to generate association rules from frequent itemsets. Examples are provided to illustrate support calculations and how to construct an itemset lattice from transactional data. The goal is to extract useful patterns and rules from data that can be used for applications like promotion and marketing.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views38 pages

04 AssociationPatternMining

This document discusses association pattern mining and association rules. It covers key concepts like frequent itemsets, support, confidence, and how to generate association rules from frequent itemsets. Examples are provided to illustrate support calculations and how to construct an itemset lattice from transactional data. The goal is to extract useful patterns and rules from data that can be used for applications like promotion and marketing.

Uploaded by

Manish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

COMP5009

DATA MINING

ASSOCIATION
PATTERN MINING
DR PAUL HANCOCK
CURTIN UNIVERSITY
SEMESTER 2
ASSOCIATION
PATTERN MINING
 Associations and patterns
Aggarwal Ch 4.1-4.4.2, 4.4.4
 Algorithms
Aggarwal Ch 5.2, 5.4
 Applications
 Summary

COMP5009 – DATA MINING, CURTIN UNIVERSITY 2


ASSOCIATIONS
AND PATTERNS
 Association patterns to describe behavior
 Association rules to predict behavior
 Rules are often the desired outcome, but patterns are
required first

COMP5009 – DATA MINING, CURTIN UNIVERSITY 3


INTRODUCTION

 Sparse problem: number of items in a transaction is


 Frequent patterns: when two or more objects co-occur
typically much less than the total number of items in
 Association rules: some statement about how likely the supermarket
are two sets of items to co-occur or to conditionally  Frequent pattern mining
occur
 Simple: purely based on frequency of itemsets
 “Customers who buy milk and cereal also tend to buy
 Good explanatory power, but poor predictive power
bananas”
 Basis of association rules mining (to study later in the
 Potential use: promotion, marketing, rearranging
course)
items, suggestions, software bug analysis, etc.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 4


NOMENCLATURE

 Database T

 n transactions T1 , T2 , . . . , Tn

 Set of all d items U

 Itemset: a set of (some) items

 k-itemset: an itemset that contains exactly k-items


 Note: items → unique items, i.e. we do not count how many
bananas, milk bottles, etc. in a transaction

 Primarily look at the frequency of itemsets


 frequent patterns ∼ frequent itemsets

COMP5009 – DATA MINING, CURTIN UNIVERSITY 5


CONCEPT - SUPPORT

 Support – sup(I), can be relative (fraction of total) or


absolute (number of transactions)
 Minimum support – minsup, the minimum support for
a set to be included in our list of interesting patterns
 Sets that have sup(I)>=minsup are said to be frequent
itemsets

COMP5009 – DATA MINING, CURTIN UNIVERSITY 6


CONCEPT - SUPPORT

 For the table below, what is the absolute and


relative sup({Bread,Milk})?
 If minsup = 0.5, is {Milk, Yogurt} a frequent
 Support – sup(I), can be relative (fraction of total) or
itemset?
absolute (number of transactions)
 Minimum support – minsup, the minimum support for
a set to be included in our list of interesting patterns
 Sets that have sup(I)>=minsup are said to be frequent
itemsets

COMP5009 – DATA MINING, CURTIN UNIVERSITY 7


SUPPORT PROPERTIES

Proof:
 If Sup(I) >= minsup and J is a subset of I

 then Sup(J)>=Sup(I)>=minsup
 Sup({Bread, Milk}) >= Sup({Bread, Eggs, Milk})
 then Sup(J)>= minsup
 Sup({Bread}) >= Sup({Bread, Milk})
 => J is also frequent

Corollary:
 The number of frequent itemsets with k items
decreases with increasing k.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 8


PATTERN MINING

 k-itemsets

 k >= 1

 More interested in k > 2  Once we have mined our data for frequent patterns,

 How to select minsup we can then start generating association rules

 Small minsup: many, hard to see


 Large minsup: may not even return itemsets with k ≥ 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 9


PYTHON EXAMPLE

 supermarket.arff data set from the


Weka MOOC
 Unfiltered with (absolute) minsup = 0.25

 See more in prac 04

 Note that increasing k (number of items) is


correlated with decreasing support

COMP5009 – DATA MINING, CURTIN UNIVERSITY 10


MAXIMAL FREQUENT ITEMSETS

 -> Maximal Frequent Itemsets can't be made longer


without dropping below the minsup
 All frequent itemsets can be derived from the
maximal frequent itemsets
 Sup({Eggs, Milk}) is 3/5 = 0.6

 Sup({Eggs, Milk, Yoghurt}) is 2/5 = 0.4

 Sup({Eggs, Milk, Cheese}) is 1/5 = 0.2

 For minsup=0.5, {Eggs, Milk} is maxmial frequent

 For minsup=0.3, {Eggs, Milk} is not maximal


frequent

COMP5009 – DATA MINING, CURTIN UNIVERSITY 11


ITEMSET LATTICE

 Represents all possible itemsets

 Some 2|U| entries so not usually practical to visualize

 Pink line represents the border between frequent and


Increasing k

infrequent item sets


 Maximal frequent itemsets are immediately above the
line:
 a, ae, bd, bce, cde

COMP5009 – DATA MINING, CURTIN UNIVERSITY 12


CONSTRUCT AN ITEMSET LATTICE

 For the database shown below, construct an itemset lattice,


and identify the maximal frequent itemsets assuming
minsup = 0.5

COMP5009 – DATA MINING, CURTIN UNIVERSITY 13


ASSOCIATION RULES

Itemset Association Rule

 Describes items that co-occur  Describes the relation between different itemsets

 Similar itemsets may have relations  Written as X => Y with some confidence measure

 Itemsets don't describe these relations  "If a shopper buys eggs, then it's 45% likely
that they'll also buy cheese"
 {Eggs} => {Cheese} with confidence 45%
{Eggs} {Cheese}
{Eggs,
Cheese}
{Eggs,
Cheese}
{Cheese}
{Eggs}
COMP5009 – DATA MINING, CURTIN UNIVERSITY 14
CONFIDENCE

 Let A be the association rule:


{Eggs, Milk} => {Eggs, Milk, Cheese}
 Conf(A) = sup({Eggs, Milk, Cheese}) / sup({Eggs,
Milk})
 =(1/5) / (3/5) = 1/3

 X U Y is the union of the sets X, Y  Our rule has a confidence of 33%

 Confidence is the conditional probability of seeing the


union of X and Y, given that X is already seen
 Confidence is therefore in [0,1]

COMP5009 – DATA MINING, CURTIN UNIVERSITY 15


ASSOCIATION RULES

 Criteria 1 – relevance  f

 Criteria 2 – strength

 Use frequent itemsets to find rules

COMP5009 – DATA MINING, CURTIN UNIVERSITY 16


Recall: conf(X=>Y) = sup(X U Y ) / sup(X) Note:
 Let I = {A, B, C}, X2={A,B} X1 = {A}  {A,B} U {C} = {A} U {B,C} = {A, B, C}

Confidence monotonicity implies that:  sup({A}) > sup({A,B}) (support monotonicity)


 conf({A,B} => {C}) >= conf({A} =>{B,C}) Thus:
Calculate:  conf({A,B} => {C}) >= conf({A} =>{B,C})

 conf({A,B}=>{C}) = sup({A,B} U {C}) / Alternative view:


sup({A,B})
More evidence, less prediction >= less evidence, more
 conf({A}=>{B,C}) = sup({A} U {B,C}) / sup({A}) prediction

COMP5009 – DATA MINING, CURTIN UNIVERSITY 17


ALGORITHMS
 Brute force
 Aprioi
 FP Growth

COMP5009 – DATA MINING, CURTIN UNIVERSITY 18


BRUTE FORCE ALGORITHM

Algorithm
 Step 1: generate all candidate itemsets, 2|U| − 1
 Impractical if d = |U| is large
 Step 2: scan the database and count the number of
occurrences for eat itemset  21000 ~ 10301 is an awful lot of trials

 Step 3: select itemsets with support ≥ minimum  1000 is not a large universe
support  Probably suitable only for very small problems

COMP5009 – DATA MINING, CURTIN UNIVERSITY 19


APRIORI ALGORITHM

 Key idea: Exploiting the downward closure property


 If a candidate itemset X does not meet minimum support,
then any superset of X would also fail.

 How? Ignore all supersets of a failed candidate


itemset
 Algorithm:
 Start with 1-itemsets and filter infrequent 1-itemsets
 Combine 1-itemsets to create candidate 2-itemsets, and
filter infrequent 2-itemsets
 Repeat for 3, 4, 5, …
https://developpaper.com/association-rule-mining-and-apriori-algorithm/

COMP5009 – DATA MINING, CURTIN UNIVERSITY 20


APRIORI EXAMPLE

Algorithm:
 Create 1-itemsets with sup>=minsup

 For N=2 … K do
 Join N-item sets to make (N+1)-items sets with
sup>=minsup

 Report all itemsets with sup>=minsup

https://developpaper.com/association-rule-mining-and-apriori-algorithm/

COMP5009 – DATA MINING, CURTIN UNIVERSITY 21


CONSTRUCT AN ITEMSET LATTICE

 For the database shown below, find all itemsets with


sup>0.2 using the a-priori algorithm.

COMP5009 – DATA MINING, CURTIN UNIVERSITY 22


FREQUENT PATTERN (FP) GROWTH ALGORITHM

FP-tree is not our itemset lattice


 Two phases:  Nodes are items
 Constructing a frequent-pattern tree  Paths are itemsets
 "mining" the tree for patterns meeting minsup  Each node stores the support information from the
root to that node
 The FP-tree is a transform of the database

COMP5009 – DATA MINING, CURTIN UNIVERSITY 23


EXAMPLE FP-GROWTH

Lexicon:
 Bread, Butter, Eggs, Milk, Yoghurt, Cheese

Compute sup for each singleton:

item sup
Bread 2
Butter 1 Order by support:
Eggs 3  Milk, Yoghurt, Eggs, Bread, Cheese, Butter
Milk 5
Yoghurt 3
Cheese 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 24


Reordered itemsets:
tid Set of items
1 {Milk, Bread, Butter}
2 {Milk, Yoghurt, Eggs}  Starting with the first transaction build a tree from the
3 {Milk, Eggs, Bread, items
Cheese}
 Nodes are {Item: count}
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
Ordering:
 Milk, Yoghurt, Eggs, Bread, Cheese, Butter

COMP5009 – DATA MINING, CURTIN UNIVERSITY 25


tid Set of items
{ 0
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 26


tid Set of items
{ 1
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
1
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}

COMP5009 – DATA MINING, CURTIN UNIVERSITY 27


tid Set of items
{ 2
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
2
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
1
1

1
1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 28


tid Set of items
{ 3
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
3
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
1 1
1

1 1
1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 29


tid Set of items
{ 4
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
4
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
2 1
1

2 1
1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 30


tid Set of items
{ 5
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
5
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
3 1
1

2 1 1
1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 31


tid Set of items
{ 5
1 {Milk, Bread, Butter}
}
2 {Milk, Yoghurt, Eggs}
3 {Milk, Eggs, Bread,
Cheese}
5
4 {Milk, Yoghurt, Eggs}
5 {Milk, Yoghurt, Cheese}
3 1
1
items linklist
Milk
Yoghurt
2 1 1
Eggs 1
Bread
Cheese
1
Butter

COMP5009 – DATA MINING, CURTIN UNIVERSITY 32


item sup
{ 5 The FP tree is an alternate
Bread 2
} representation of the
Butter 1 transactions database -> it
Eggs 3 contains the same
Milk 5 information
5
Yoghurt 3
Cheese 2
3 1
1
items linklist
Milk
Yoghurt
2 1 1
Eggs 1
Bread
Cheese
1
Butter

COMP5009 – DATA MINING, CURTIN UNIVERSITY 33


{ 5
}
FP-GROWTH AND MINSUP

5
Let minsup = 3
 No frequent itemsets with Butter, Bread, or Cheese

Consider paths leading to Eggs: 3 1


1
 {M, Y, E:2} and {M, E:1} -> neither meet minsup
item sup
Consider Yoghurt:
Bread 2
 {M, Y:3} -> meets minsup 2
Butter 1 1 1
1
Eggs 3
Itemsets with sup >=3 are: Milk 5
 {M}, {Y}, {E}, {M,Y} 1
Yoghurt 3
Cheese 2

COMP5009 – DATA MINING, CURTIN UNIVERSITY 34


{ 5
}
FP-TREE AND ASSOCIATION RULES

If a basket has {Milk, Eggs}, what is the chance that they


3 1
will also buy {Bread}? 1

If a basket has {Milk, Yoghurt}, what is the chance that


they will also buy {Eggs}? 2 1 1
1

COMP5009 – DATA MINING, CURTIN UNIVERSITY 35


COMPARISON

Apriori FP-Growth

 Candidate generation easy to parallelize  Data are interdependent, hard to parallelize

 Considers all possible candidates  Only candidates with the DB are considered

 Requires multiple DB scans  Can be accomplished in 2 scans of DB

 Runtime is O(2^U)  Runtime is O(U.N)

COMP5009 – DATA MINING, CURTIN UNIVERSITY 36


Other data mining tasks
 Classification: rules of the form X ⇒ c to find discriminative features

 Clustering: to find highly correlated subsets of attributes


APPLICATIONS  Outlier detection: to detect abnormal transactions that are not “covered”
by most of the association patterns in the data
Aggarwal, Ch 5.4
Market basket analysis
 Demographic and profile analysis: the antecedent of the rule
typically identifies a profile segment, and the consequent identifies a
population segment for target marketing
 Recommendations and collaborative filtering: cluster the data
into segments, and then determine the patterns in these segments
(localized pattern mining)
 Weblog analysis: to identify pages frequently visited in a session but not
yet linked to each other

COMP5009 – DATA MINING, CURTIN UNIVERSITY 37


NEXT: DATA CLASSIFICATION
AGGARWAL CH 10

COMP5009 – DATA MINING, CURTIN UNIVERSITY 38

You might also like